# Homework 2 (Due 6:29pm PST Nov 4th, 2021): Word Vectorization, Regex Practice, and Similarity

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

In [1]:
import re
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import sys
import nltk

A. Using the **McDonalds Yelp Review CSV file**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `hamburger` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `hamburgers` and `hamburger` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

In [11]:
mcd_rev = pd.read_csv('mcdonalds-yelp-negative-reviews.csv')
mcd_rev

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 3125: invalid continuation byte

B. **Stopwords, Stemming, Lemmatization Practice**

Using the `tale-of-two-cities.txt` file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then count-vectorization
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

In [2]:
text = open("tale-of-two-cities.txt", "r")
# we use nltk to tokenize by sentence

text_lines = str(text.readlines())
text_lines = text_lines.replace("\\n\', \'", " ")
text_lines = text_lines.replace("\\n", " ")

# we now have each document as a sentence
sent_text = nltk.sent_tokenize(text_lines)

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sent_text)
X = X.toarray()
corpus_df = pd.DataFrame(X, columns = vectorizer.get_feature_names())
corpus_df

Unnamed: 0,1757,1767,1792,21,aback,abandon,abandoned,abandoning,abandonment,abashed,...,your,yourn,yours,yourself,yourselves,youth,youthful,youthfulness,youths,zealous
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7728,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7729,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Status Quo (no stemmind or lemmatization)
- With each sentence as a document, we get **7731 sentences and 9705 columns**
- Simply adding stopwords = 'english', we get **7731 sentences and 9420 columns**

Note, we remove stopwords = 'english' at this point

In [3]:
len(sent_text)

7731

In [4]:
stemmer = nltk.stem.porter.PorterStemmer()

stemmed_list = []
for i in sent_text:
    tokens = nltk.word_tokenize(i)
    x = ''
    for j in tokens:
        x = x + ' ' + stemmer.stem(word = j)
    stemmed_list.append(x)

# to see stemmed text, call stemmed_list

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(stemmed_list)
X = X.toarray()
stemmed_df = pd.DataFrame(X, columns = vectorizer.get_feature_names())
stemmed_df

Unnamed: 0,1757,1767,1792,21,aback,abandon,abash,abat,abbay,abbaye,...,you,young,younger,youngest,your,yourn,yourself,yourselv,youth,zealou
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7728,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7729,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### After stemming, our 7731 sentences have 6682 columns which was reduced from 9705 columns without stemming

In [5]:
lemmatizer = nltk.stem.WordNetLemmatizer()

lemmatized_list = []
for i in sent_text:
    tokens = nltk.word_tokenize(i)
    x = ''
    for j in tokens:
        x = x + ' ' + lemmatizer.lemmatize(word = j)
    lemmatized_list.append(x)

# to see lemmatized text, call lemmatized_list

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(lemmatized_list)
X = X.toarray()
lemmatized_df = pd.DataFrame(X, columns = vectorizer.get_feature_names())
lemmatized_df

Unnamed: 0,1757,1767,1792,21,aback,abandon,abandoned,abandoning,abandonment,abashed,...,youngest,your,yourn,yours,yourself,yourselves,youth,youthful,youthfulness,zealous
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7728,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7729,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### After lemmatizing, our 7731 sentences have 8937 columns which was reduced from 9705 columns without lemmatizing

In [6]:
# we now perform the same thing but remove stopwords

lemmatizer = nltk.stem.WordNetLemmatizer()
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))

lemmatized_list = []
for i in sent_text:
    tokens = nltk.word_tokenize(i)
    x = ''
    for j in tokens:
        if j not in stops:
            x = x + ' ' + lemmatizer.lemmatize(word = j)
        else:
            continue
    lemmatized_list.append(x)

# to see lemmatized text, call lemmatized_list

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(lemmatized_list)
X = X.toarray()
lemmatized_df = pd.DataFrame(X, columns = vectorizer.get_feature_names())
lemmatized_df

Unnamed: 0,1757,1767,1792,21,aback,abandon,abandoned,abandoning,abandonment,abashed,...,younger,youngest,your,yourn,yours,yourself,youth,youthful,youthfulness,zealous
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7728,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7729,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### After lemmatizing and then removing stopwords, our 7731 sentences have 8925 columns. This removed 12 stopwords from our previously lemmatized frame