# Homework 2 (Due 6:29pm PST Nov 4th, 2021): Word Vectorization, Regex Practice, and Similarity

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

In [1]:
import re
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import sys
import nltk

A. Using the **McDonalds Yelp Review CSV file**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `hamburger` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `hamburgers` and `hamburger` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

In [2]:
#utf 8 didnt work
mcd_rev = pd.read_csv('mcdonalds-yelp-negative-reviews.csv', encoding = "ISO-8859-1")

In [3]:
# stopwords, list for now and filter later

from nltk.corpus import stopwords
stopword_list = set(stopwords.words('english'))

# 20 elements from our set of stopwords
# we add mcdonalds to our stopwords since we know the reviews are about mcdonalds
# note the text will already all be in lower case
stopword_list.add('mcdonalds')
stopword_list.add('mcdonald\'s')
stopword_list.add('mcdonald')
stopword_list.add('mcdonald\'s')
stopword_list.add('micky d\'s')
stopword_list.add('mickey d\'s')
stopword_list.add('mcd')
stopword_list.add('mcdss')
stopword_list.add('mickey d')

list(stopword_list)[1:10] 

['you', 'needn', 'wasn', 'their', 'isn', 'being', "hadn't", "don't", 'there']

In [3]:
#series of our reviews in lowercase
reviews = mcd_rev['review'].str.lower()
reviews

0       i'm not a huge mcds lover, but i've been to be...
1       terrible customer service. i came in at 9:30pm...
2       first they "lost" my order, actually they gave...
3       i see i'm not the only one giving 1 star. only...
4       well, it's mcdonald's, so you know what the fo...
                              ...                        
1520    i enjoyed the part where i repeatedly asked if...
1521    worst mcdonalds i've been in in a long time! d...
1522    when i am really craving for mcdonald's, this ...
1523    two points right out of the gate: 1. thuggery ...
1524    i wanted to grab breakfast one morning before ...
Name: review, Length: 1525, dtype: object

In [5]:
# regex cleaning

burger = r'\w*(burger)s?|\w* (burger)s?' #variations of burger

icecream = r'(ice cream)|(ice-cream)' # people write icecream differently

drive_through = r'drive.thr\w*|drivethr\w*' # drive-thru has many variations

mc = r'\b(mc-)|\b(mc )' #mc-chicken / mc chicken becomes mcchicken


for i in range(len(reviews)):
    # note because we want 1 word in our countvector, we change it to icecream and drivethru
    reviews[i] = re.sub(burger, "burger", reviews[i])
    reviews[i] = re.sub(icecream, "icecream", reviews[i])
    reviews[i] = re.sub(drive_through, "drivethru", reviews[i])
    reviews[i] = re.sub(mc, "mc", reviews[i])
    #check reviews[52] original and parsed

### Why stemming?
- We decided to perform stemming since stemming gives higher recall over lemmatization 
- We don't want to look into word morphology for this homework since we are only looking at word counts
- It is simple and good enough for our purposes here

In [6]:
#stemming, create an empty list, loop through and stem
stemmer = nltk.stem.porter.PorterStemmer()

stemmed_list = []
for i in reviews:
    tokens = nltk.word_tokenize(i)
    x = ''
    for j in tokens:
        x = x + ' ' + stemmer.stem(word = j)
    stemmed_list.append(x)

stemmed_list[1:5]

[' terribl custom servic . i came in at 9:30pm and stood in front of the regist and no one bother to say anyth or help me for 5 minut . there wa no one els wait for their food insid either , just outsid at the window . i left and went to chickfila next door and wa greet befor i wa all the way insid . thi mcdonald is also dirti , the floor wa cover with drop food . obvious fill with surli and unhappi worker .',
 " first they `` lost '' my order , actual they gave it to someon one els than took 20 minut to figur out whi i wa still wait for my order.they after i wa ask what i need i repli , `` my order '' .they ask for my ticket and the asst mgr look at the ticket then incomplet fill it.i had to ask her to check to see if she fill it correctly.sh act as if she could n't be bother with that so i ask her again.sh begrudgingli check to she did in fact miss someth on the ticket.so after 22 minut i final had my breakfast biscuit platter.a i left an woman approach and identifi herself as the ma

In [7]:
# removing stopwords
# recall we added McDonalds and some of its variations to the list of stopwords
print(list(stopword_list)[1:10])

vectorizer = CountVectorizer(stop_words = stopword_list)
X = vectorizer.fit_transform(stemmed_list)
X = X.toarray()
final_df = pd.DataFrame(X, columns = vectorizer.get_feature_names())
final_df

['you', 'needn', 'wasn', 'their', 'isn', 'being', "hadn't", "don't", 'there']




Unnamed: 0,00,000,00am,00mi,00pm,01,0200,03pm,04,04am,...,zax,zee,zeke,zero,zesti,zip,zombi,zombie,zoom,î_
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1520,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1522,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1523,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [23]:
# Our top 50 most frequent words with CountVectors
final_df.sum().sort_values(ascending=False)[0:50]

thi          1887
wa           1873
order        1293
food          894
get           804
one           783
time          735
go            694
drivethru     603
place         545
servic        528
like          521
locat         446
onli          432
wait          431
becaus        405
would         377
ask           367
fri           360
peopl         354
even          347
back          340
got           329
work          328
manag         326
alway         314
custom        308
good          294
minut         285
never         283
want          278
veri          278
coffe         277
line          274
take          272
window        272
ever          271
ha            270
make          269
come          264
burger        258
look          256
give          249
right         248
say           248
went          241
eat           239
realli        237
know          237
fast          237
dtype: int64

B. **Stopwords, Stemming, Lemmatization Practice**

Using the `tale-of-two-cities.txt` file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then count-vectorization
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

In [9]:
text = open("tale-of-two-cities.txt", "r")
# we use nltk to tokenize by sentence

text_lines = str(text.readlines())
#remove some of the new line characters
text_lines = text_lines.replace("\\n\', \'", " ")
text_lines = text_lines.replace("\\n", " ")

# we now have each document as a sentence
sent_text = nltk.sent_tokenize(text_lines)

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sent_text)
X = X.toarray()
corpus_df = pd.DataFrame(X, columns = vectorizer.get_feature_names())
corpus_df

Unnamed: 0,1757,1767,1792,21,aback,abandon,abandoned,abandoning,abandonment,abashed,...,your,yourn,yours,yourself,yourselves,youth,youthful,youthfulness,youths,zealous
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7728,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7729,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Status Quo (no stemming or lemmatization)
- With each sentence as a document, we get **7731 sentences and 9705 columns**
- Simply adding stopwords = 'english', we get **7731 sentences and 9420 columns**

Note, we remove stopwords = 'english' at this point, the second bullet point above was heuristic

In [10]:
stemmer = nltk.stem.porter.PorterStemmer()

stemmed_list = []
for i in sent_text:
    tokens = nltk.word_tokenize(i)
    x = ''
    for j in tokens:
        x = x + ' ' + stemmer.stem(word = j)
    stemmed_list.append(x)

# to see stemmed text, call stemmed_list

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(stemmed_list)
X = X.toarray()
stemmed_df = pd.DataFrame(X, columns = vectorizer.get_feature_names())
stemmed_df

Unnamed: 0,1757,1767,1792,21,aback,abandon,abash,abat,abbay,abbaye,...,you,young,younger,youngest,your,yourn,yourself,yourselv,youth,zealou
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7728,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7729,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### After stemming, our 7731 sentences have 6682 columns which was reduced from 9705 columns without stemming

In [11]:
# this snippet of code is taken from
# https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258

#gives context for the words
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [12]:
lemmatized_list = []

for i in sent_text:
    x = lemmatize_sentence(i)
    lemmatized_list.append(x)

# to see lemmatized text, call lemmatized_list

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(lemmatized_list)
X = X.toarray()
lemmatized_df = pd.DataFrame(X, columns = vectorizer.get_feature_names())
lemmatized_df

Unnamed: 0,1757,1767,1792,21,aback,abandon,abandoned,abandonment,abashed,abate,...,younger,your,yourn,yours,yourself,yourselves,youth,youthful,youthfulness,zealous
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7728,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7729,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# our original lemmatizer with no context


#lemmatizer = nltk.stem.WordNetLemmatizer()

#lemmatized_list = []
#for i in sent_text:
#    tokens = nltk.word_tokenize(i)
#    x = ''
#    for j in tokens:
#        x = x + ' ' + lemmatizer.lemmatize(word = j)
#    lemmatized_list.append(x)

# to see lemmatized text, call lemmatized_list


#vectorizer = CountVectorizer()
#X = vectorizer.fit_transform(lemmatized_list)
#X = X.toarray()
#lemmatized_df = pd.DataFrame(X, columns = vectorizer.get_feature_names())
#lemmatized_df

### After lemming, our 7731 sentences have 7969 columns which was reduced from 9705 columns without lemmatization

In [14]:
# we do the same thing with stopwords
lemmatized_list = []

for i in sent_text:
    x = lemmatize_sentence(i)
    lemmatized_list.append(x)

# to see lemmatized text, call lemmatized_list

vectorizer = CountVectorizer(stop_words = 'english')
X = vectorizer.fit_transform(lemmatized_list)
X = X.toarray()
lemmatized_df = pd.DataFrame(X, columns = vectorizer.get_feature_names())
lemmatized_df

Unnamed: 0,1757,1767,1792,21,aback,abandon,abandoned,abandonment,abashed,abate,...,yoked,yonder,yore,young,younger,yourn,youth,youthful,youthfulness,zealous
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7728,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7729,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### After lemming and then removing stopwords, our 7731 sentences have 7689 columns. This removed 280 stopwords