# N-Grams, Regex, and TF-IDF

You are an analyst working at McDonalds' corporate headquarters, and charged with identifying areas for improvement to increase customer service.

Using the `mcdonalds-yelp-negative-reviews.csv` dataset, clean and parse the text reviews. 

Finally, generate a TF-IDF report that **visualizes** for each city what the major source of complaints with the McDonalds franchises are. Offer your analysis and business recommendations on next steps for the global SVP of Operations.

In [1]:
import nltk
nltk.download('stopwords')
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
lemmatizer = WordNetLemmatizer()
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kailinghung/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
data = pd.read_csv("mcdonalds-yelp-negative-reviews.csv", encoding="latin1")

## explore data

In [3]:
data.shape

(1525, 3)

In [4]:
data.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [5]:
# city
city = data['city'].value_counts()
print(city)
print("there are",len(data['city'].value_counts()),"cities")

Las Vegas      409
Chicago        219
Los Angeles    167
New York       165
Atlanta        130
Houston        105
Portland        97
Dallas          75
Cleveland       71
Name: city, dtype: int64
there are 9 cities


## all reviews

In [6]:
allreview = list(data["review"].values)
type(allreview)

list

In [7]:
# word count for all reviews
# to find potential customize stop words 
words = [] 
word_count = {} 

for line in allreview: 
    for word in line.split(" "): 
        words.append(word.lower())
        
        if word not in word_count.keys(): 
            word_count[word] = 1
        else:
            word_count[word] += 1 

In [8]:
import operator
sorted_review = sorted(word_count.items(), key=operator.itemgetter(1),reverse=True)

In [9]:
# lemmatize all review
lemmatizer = WordNetLemmatizer()

def lemma(lines_review):
    sentence1 =[]
    for sentence in lines_review:
        token_words=word_tokenize(sentence)
        token_words
        stem_sentence=[]
        for word in token_words:
            stem_sentence.append(lemmatizer.lemmatize(word))
            stem_sentence.append(" ")
        sentence1.append("".join(stem_sentence))
    return sentence1

allreview = lemma(allreview)

print(allreview[2])
print(type(allreview))

First they `` lost '' my order , actually they gave it to someone one else than took 20 minute to figure out why I wa still waiting for my order.They after I wa asked what I needed I replied , `` my order '' .They asked for my ticket and the asst mgr looked at the ticket then incompletely filled it.I had to ask her to check to see if she filled it correctly.She acted a if she could n't be bothered with that so I asked her again.She begrudgingly checked to she did in fact miss something on the ticket.So after 22 minute I finally had my breakfast biscuit platter.As I left an woman approached and identified herself a the manager , she wa dressed a if she had just awoken in an old t-shirt and sweat pants.She said she had heard what happened and said she 'd take care of it.Well why did n't she intervene when she saw I wa growing annoyed with the incompetence ? 
<class 'list'>


In [10]:
vectorizer = TfidfVectorizer(ngram_range=(2,5),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             max_df=0.5,
                             binary = True,
                             min_df=2, stop_words=stopwords.words('english')+ ['.', ',',"'s", 'wa','', "n't",'...',\
                                                                              'mcdonalds','mcdonald','McDonald','McDonalds',\
                                                                              'one','get','would','could','know','even','got',"fast","food"]) #customize stops words


In [11]:
X = vectorizer.fit_transform(allreview)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score["term"] = terms
score.sort_values(by="score", ascending=False, inplace=True)

In [12]:
score.head(30)

Unnamed: 0,score,term
drive thru,34.117692,drive thru
customer service,15.998619,customer service
worst ever,11.473978,worst ever
ice cream,10.704884,ice cream
order wrong,10.451155,order wrong
every time,8.786582,every time
big mac,8.555748,big mac
parking lot,8.13903,parking lot
order right,8.075159,order right
late night,7.437414,late night


# Las Vegas

In [13]:
# filter only Las Vegas
Vegas = data[data.city == 'Las Vegas']
r_vegas = list(Vegas['review'].values)

# lemmatize all review
lemma(r_vegas)

# vectorizer1 2-grams
vectorizer1 = TfidfVectorizer(ngram_range=(2,5),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             binary = True,
                             max_df=0.3,
                             min_df=2, stop_words=stopwords.words('english')+ ['.', ',',"'s", 'wa','', "n't",'...',\
                                                                              'mcdonalds','mcdonald','McDonald','McDonalds',\
                                                                              'one','get','would','could','know','even','got',"fast","food"]) #customize stops words

# vectorizer2 3-grams
vectorizer2 = TfidfVectorizer(ngram_range=(3,3),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             binary = True,
                             max_df=0.3,
                             min_df=2, stop_words=stopwords.words('english')+ ['.', ',',"'s", 'wa','', "n't",'...',\
                                                                              'mcdonalds','mcdonald','McDonald','McDonalds',\
                                                                              'one','get','would','could','know','even','got',"fast","food"]) #customize stops words

# vectorizer3
vectorizer3 = TfidfVectorizer(ngram_range=(4,4),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             binary = True,
                             max_df=0.3,
                             min_df=1, stop_words=stopwords.words('english')+ ['.', ',',"'s", 'wa','', "n't",'...',\
                                                                              'mcdonalds','mcdonald','McDonald','McDonalds',\
                                                                              'one','get','would','could','know','even','got',"fast","food"]) #customize stops words

In [14]:
# 2-5 grams
vegas = vectorizer1.fit_transform(r_vegas)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(vegas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1.head(10)

Unnamed: 0,score,term
drive thru,16.660906,drive thru
customer service,6.546794,customer service
big mac,5.613363,big mac
worst ever,4.912069,worst ever
order right,4.506187,order right
las vegas,4.467361,las vegas
chicken nuggets,4.337296,chicken nuggets
order wrong,4.18409,order wrong
ice cream,4.048424,ice cream
sweet tea,3.812979,sweet tea


In [15]:
# 3-grams
vegas = vectorizer2.fit_transform(r_vegas)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(vegas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)


In [16]:
# 4-grams
vegas = vectorizer3.fit_transform(r_vegas)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(vegas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)


# Chicago

In [17]:
# filter only Chicago
Chicago = data[data.city == 'Chicago']
r_chicago = list(Chicago['review'].values)

# lemmatize review
r_chicago = lemma(r_chicago)

In [18]:
# 2-5 grams
chicago = vectorizer1.fit_transform(r_chicago)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(chicago.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

In [19]:
# 3-grams
chicago = vectorizer2.fit_transform(r_chicago)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(chicago.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

In [20]:
# 4-grams
chicago = vectorizer3.fit_transform(r_chicago)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(chicago.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

In [21]:
score1.to_csv('chicago.csv')

## Los Angeles

In [22]:
# filter only LA
LA = data[data.city == 'Los Angeles']
r_la = list(LA['review'].values)

# lemmatize review
r_la = lemma(r_la)

In [23]:
# 2-5 grams
la = vectorizer1.fit_transform(r_la)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(la.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)


In [24]:
# 3-grams
la = vectorizer2.fit_transform(r_la)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(la.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

In [25]:
# 4-grams
la = vectorizer3.fit_transform(r_la)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(la.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

## New York

In [26]:
# filter only LA
NY = data[data.city == 'New York']
r_ny = list(NY['review'].values)

# lemmatize review
r_ny = lemma(r_ny)

In [27]:
# 2-5 grams
ny = vectorizer1.fit_transform(r_ny)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(ny.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

In [28]:
# 3-grams
ny = vectorizer2.fit_transform(r_ny)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(ny.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

In [29]:
# 4-grams
ny = vectorizer3.fit_transform(r_ny)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(ny.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

In [30]:
score1.to_csv('ny.csv')

## Atlanta 

In [31]:
# filter only LA
Atlanta = data[data.city == 'Atlanta']
r_atlanta = list(Atlanta['review'].values)

# lemmatize review
r_atlanta = lemma(r_atlanta)

In [32]:
# 2-5 grams
atlanta = vectorizer1.fit_transform(r_atlanta)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(atlanta.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

In [33]:
# 3-grams
atlanta = vectorizer2.fit_transform(r_atlanta)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(atlanta.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

In [34]:
# 4-grams
atlanta = vectorizer3.fit_transform(r_atlanta)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(atlanta.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

In [35]:
score1.to_csv('altan.csv')

## Houston        

In [36]:
# filter only Houston
Houston = data[data.city == 'Houston']
r_houston = list(Houston['review'].values)

# lemmatize review
r_houston = lemma(r_houston)

In [37]:
# 2-5 grams
houston = vectorizer1.fit_transform(r_houston)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(houston.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

In [38]:
# 3-grams
houston = vectorizer2.fit_transform(r_houston)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(houston.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

In [39]:
# 4-grams
houston = vectorizer3.fit_transform(r_houston)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(houston.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

## Portland 

In [40]:
# filter only Portland
Portland = data[data.city == 'Portland']
r_portland = list(Portland['review'].values)

# lemmatize review
r_portland = lemma(r_portland)

In [41]:
# 2-5 grams
portland = vectorizer1.fit_transform(r_portland)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(portland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

In [42]:
# 3-grams
portland = vectorizer2.fit_transform(r_portland)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(portland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

In [43]:
# 4-grams
portland = vectorizer3.fit_transform(r_portland)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(portland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

## Dallas 

In [44]:
# filter only Dallas
Dallas = data[data.city == 'Dallas']
r_dallas = list(Dallas['review'].values)

# lemmatize review
r_dallas = lemma(r_dallas)

In [45]:
# 2-5 grams
dallas = vectorizer1.fit_transform(r_dallas)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(dallas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

In [46]:
# 3-grams
dallas = vectorizer2.fit_transform(r_dallas)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(dallas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

In [47]:
# 4-grams
dallas = vectorizer3.fit_transform(r_dallas)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(dallas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

In [48]:
score1.to_csv('dallas.csv')

## Cleveland 

In [49]:
# filter only Cleveland
Cleveland = data[data.city == 'Cleveland']
r_cleveland = list(Cleveland['review'].values)

# lemmatize review
r_cleveland = lemma(r_cleveland)

In [50]:
# 2-5 grams
cleveland = vectorizer1.fit_transform(r_cleveland)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(cleveland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1.head()

Unnamed: 0,score,term
drive thru,4.501727,drive thru
customer service,2.458793,customer service
worst ever,2.084463,worst ever
somewhere else,2.0,somewhere else
long wait,1.89608,long wait


In [51]:
# 3-grams
cleveland = vectorizer2.fit_transform(r_cleveland)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(cleveland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

In [52]:
# 4-grams
cleveland = vectorizer3.fit_transform(r_cleveland)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(cleveland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

## phrase count for high scored phrases

In [53]:
# count "drive thru"
import re

count = 0
for line in allreview:
    if len(re.findall(r"(drive thru)", line)) >= 1:
        count += 1
count

print("drive thru is mentioned in",round((count/1525)*100,3), '% of all comments')

drive thru is mentioned in 12.984 % of all comments


In [54]:
# count "drive thru" in Vagas
count = 0
for line in r_vegas:
    if len(re.findall(r"(drive thru)", line)) >= 1:
        count += 1

(count/409)*100

16.381418092909534

In [55]:
# count "order wrong" , "wrong order" , "order right" , "right order" "correct order" "order correct"
count = 0
for line in allreview:
    if len(re.findall(r"(order wrong|wrong order|order right|right order|correct order|order correct)", line)) >= 1:
        count += 1

print("wrong order, order wrong, order right, or right order is mentioned in",round((count/1525)*100,3), '% of all comments')

wrong order, order wrong, order right, or right order is mentioned in 8.656 % of all comments


In [56]:
# count "ice cream"
count = 0
for line in allreview:
    if len(re.findall(r"(ice cream|icecream)", line)) >= 1:
        count += 1
print("ice cream is mentioned in",round((count/1525)*100,3), '% of all comments')

ice cream is mentioned in 2.689 % of all comments


In [57]:
# count "french fry"
count = 0
for line in allreview:
    if len(re.findall(r"(french fry|frenchfry|fries)", line)) >= 1:
        count += 1
print("french fry is mentioned in",round((count/1525)*100,3), '% of all comments')

french fry is mentioned in 2.164 % of all comments


In [58]:
# count "big mac"
count = 0
for line in allreview:
    if len(re.findall(r"(big mac|bigmac)", line)) >= 1:
        count += 1
print("big mac is mentioned in",round((count/1525)*100,3), '% of all comments')

big mac is mentioned in 0.852 % of all comments


In [59]:
# count "chicken nugget"
count = 0
for line in allreview:
    if len(re.findall(r"(chicken nugget)", line)) >= 1:
        count += 1
print("chicken nugget is mentioned in",round((count/1525)*100,3), '% of all comments')

chicken nugget is mentioned in 2.098 % of all comments


In [60]:
#count iced coffee
count = 0
for line in allreview:
    if len(re.findall(r"(iced coffee)", line)) >= 1:
        count += 1
print("ced coffee is mentioned in",round((count/1525)*100,3), '% of all comments')

ced coffee is mentioned in 1.77 % of all comments


In [61]:
#count sweet tea
count = 0
for line in allreview:
    if len(re.findall(r"(sweet tea)", line)) >= 1:
        count += 1
print("sweet tea is mentioned in",round((count/1525)*100,3), '% of all comments')

sweet tea is mentioned in 1.705 % of all comments


In [62]:
# count "late night"
count = 0
for line in allreview:
    if len(re.findall(r"(late night|latenight)", line)) >= 1:
        count += 1
print("late night is mentioned in",round((count/1525)*100,3), '% of all comments')

late night is mentioned in 1.508 % of all comments


In [63]:
# count "parking lot"
count = 0
for line in allreview:
    if len(re.findall(r"(parking lot|parking)", line)) >= 1:
        count += 1
print("parking or parking lot is mentioned in",round((count/1525)*100,3), '% of all comments')

parking or parking lot is mentioned in 3.148 % of all comments
