# Homework 3 (Due 5:30pm PST April 2nd, 2019): N-Grams, Regex, and TF-IDF

### Submit via Slack/email.

You are an analyst working at McDonalds' corporate headquarters, and charged with identifying areas for improvement to increase customer service.

Using the `mcdonalds-yelp-negative-reviews.csv` dataset, clean and parse the text reviews. Document the decisions you make:
- why remove/keep stopwords?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?

Finally, generate a TF-IDF report that **visualizes** for each city what the major source of complaints with the McDonalds franchises are. Offer your analysis and business recommendations on next steps for the global SVP of Operations.

In [1]:
import nltk
nltk.download('stopwords')
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
lemmatizer = WordNetLemmatizer()
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kailinghung/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
data = pd.read_csv("mcdonalds-yelp-negative-reviews.csv", encoding="latin1")

## explore data

In [3]:
data.shape

(1525, 3)

In [4]:
data.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [5]:
# city
city = data['city'].value_counts()
print(city)
print("there are",len(data['city'].value_counts()),"cities")

Las Vegas      409
Chicago        219
Los Angeles    167
New York       165
Atlanta        130
Houston        105
Portland        97
Dallas          75
Cleveland       71
Name: city, dtype: int64
there are 9 cities


## all reviews

In [6]:
allreview = list(data["review"].values)
type(allreview)

list

In [7]:
# word count for all reviews
# to find potential customize stop words 
words = [] 
word_count = {} 

for line in allreview: 
    for word in line.split(" "): 
        words.append(word.lower())
        
        if word not in word_count.keys(): 
            word_count[word] = 1
        else:
            word_count[word] += 1 

In [8]:
import operator
sorted_review = sorted(word_count.items(), key=operator.itemgetter(1),reverse=True)
sorted_review

[('the', 6208),
 ('I', 4330),
 ('and', 4070),
 ('to', 3953),
 ('a', 3426),
 ('of', 1990),
 ('is', 1865),
 ('was', 1771),
 ('in', 1708),
 ('for', 1617),
 ('my', 1412),
 ('this', 1375),
 ('it', 1177),
 ('that', 1160),
 ('they', 1137),
 ('you', 1048),
 ('at', 1011),
 ('have', 937),
 ('on', 873),
 ('not', 860),
 ('but', 830),
 ('with', 795),
 ('The', 743),
 ('me', 705),
 ('are', 700),
 ('get', 649),
 ('be', 628),
 ('so', 607),
 ('order', 602),
 ('food', 589),
 ('one', 588),
 ("McDonald's", 585),
 ('had', 551),
 ('just', 532),
 ('up', 499),
 ('or', 486),
 ('drive', 472),
 ('there', 468),
 ('like', 466),
 ('as', 462),
 ('go', 459),
 ('when', 445),
 ('were', 438),
 ('no', 427),
 ('out', 424),
 ('your', 413),
 ('This', 402),
 ('only', 386),
 ('here', 384),
 ('if', 382),
 ('time', 379),
 ('because', 374),
 ('their', 371),
 ('place', 370),
 ('an', 361),
 ('been', 359),
 ('what', 356),
 ('from', 356),
 ('about', 354),
 ('we', 336),
 ('all', 333),
 ("don't", 333),
 ('would', 312),
 ('service', 311

In [9]:
# lemmatize all review
lemmatizer = WordNetLemmatizer()

def lemma(lines_review):
    sentence1 =[]
    for sentence in lines_review:
        token_words=word_tokenize(sentence)
        token_words
        stem_sentence=[]
        for word in token_words:
            stem_sentence.append(lemmatizer.lemmatize(word))
            stem_sentence.append(" ")
        sentence1.append("".join(stem_sentence))
    return sentence1

allreview = lemma(allreview)

print(allreview[2])
print(type(allreview))

First they `` lost '' my order , actually they gave it to someone one else than took 20 minute to figure out why I wa still waiting for my order.They after I wa asked what I needed I replied , `` my order '' .They asked for my ticket and the asst mgr looked at the ticket then incompletely filled it.I had to ask her to check to see if she filled it correctly.She acted a if she could n't be bothered with that so I asked her again.She begrudgingly checked to she did in fact miss something on the ticket.So after 22 minute I finally had my breakfast biscuit platter.As I left an woman approached and identified herself a the manager , she wa dressed a if she had just awoken in an old t-shirt and sweat pants.She said she had heard what happened and said she 'd take care of it.Well why did n't she intervene when she saw I wa growing annoyed with the incompetence ? 
<class 'list'>


In [10]:
vectorizer = TfidfVectorizer(ngram_range=(2,5),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             max_df=0.5,
                             binary = True,
                             min_df=2, stop_words=stopwords.words('english')+ ['.', ',',"'s", 'wa','', "n't",'...',\
                                                                              'mcdonalds','mcdonald','McDonald','McDonalds',\
                                                                              'one','get','would','could','know','even','got',"fast","food"]) #customize stops words


In [11]:
X = vectorizer.fit_transform(allreview)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score["term"] = terms
score.sort_values(by="score", ascending=False, inplace=True)

In [12]:
score.head(30)

Unnamed: 0,score,term
drive thru,34.117692,drive thru
customer service,15.998619,customer service
worst ever,11.473978,worst ever
ice cream,10.704884,ice cream
order wrong,10.451155,order wrong
every time,8.786582,every time
big mac,8.555748,big mac
parking lot,8.13903,parking lot
order right,8.075159,order right
late night,7.437414,late night


# Las Vegas

In [13]:
# filter only Las Vegas
Vegas = data[data.city == 'Las Vegas']
r_vegas = list(Vegas['review'].values)

# lemmatize all review
lemma(r_vegas)

# vectorizer1 2-grams
vectorizer1 = TfidfVectorizer(ngram_range=(2,5),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             binary = True,
                             max_df=0.3,
                             min_df=2, stop_words=stopwords.words('english')+ ['.', ',',"'s", 'wa','', "n't",'...',\
                                                                              'mcdonalds','mcdonald','McDonald','McDonalds',\
                                                                              'one','get','would','could','know','even','got',"fast","food"]) #customize stops words

# vectorizer2 3-grams
vectorizer2 = TfidfVectorizer(ngram_range=(3,3),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             binary = True,
                             max_df=0.3,
                             min_df=2, stop_words=stopwords.words('english')+ ['.', ',',"'s", 'wa','', "n't",'...',\
                                                                              'mcdonalds','mcdonald','McDonald','McDonalds',\
                                                                              'one','get','would','could','know','even','got',"fast","food"]) #customize stops words

# vectorizer3
vectorizer3 = TfidfVectorizer(ngram_range=(4,4),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             binary = True,
                             max_df=0.3,
                             min_df=1, stop_words=stopwords.words('english')+ ['.', ',',"'s", 'wa','', "n't",'...',\
                                                                              'mcdonalds','mcdonald','McDonald','McDonalds',\
                                                                              'one','get','would','could','know','even','got',"fast","food"]) #customize stops words

In [14]:
# 2-5 grams
vegas = vectorizer1.fit_transform(r_vegas)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(vegas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1.head(30)

Unnamed: 0,score,term
drive thru,16.660906,drive thru
customer service,6.546794,customer service
big mac,5.613363,big mac
worst ever,4.912069,worst ever
order right,4.506187,order right
las vegas,4.467361,las vegas
chicken nuggets,4.337296,chicken nuggets
order wrong,4.18409,order wrong
ice cream,4.048424,ice cream
sweet tea,3.812979,sweet tea


In [15]:
# 3-grams
vegas = vectorizer2.fit_transform(r_vegas)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(vegas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

score2.head(20)

Unnamed: 0,score,term
went drive thru,6.023178,went drive thru
worst service ever,4.885912,worst service ever
drive thru order,4.765669,drive thru order
never order right,4.662145,never order right
thru drive thru,4.657173,thru drive thru
ice cream machine,3.665731,ice cream machine
ordered big mac,3.41553,ordered big mac
piece chicken nuggets,3.413,piece chicken nuggets
every single time,3.11118,every single time
rainbow blue diamond,2.974732,rainbow blue diamond


In [16]:
score2.to_csv('vagas.csv')
score1.to_csv('vagas1.csv')

In [17]:
# 4-grams
vegas = vectorizer3.fit_transform(r_vegas)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(vegas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

score3.head(10)

Unnamed: 0,score,term
terrible service plain simple,1.0,terrible service plain simple
slowest ever especially mornings,1.0,slowest ever especially mornings
service either hit miss,1.0,service either hit miss
matter time day always,0.746944,matter time day always
guys never order right,0.707107,guys never order right
always busy usually cold,0.707107,always busy usually cold
drive line terrible painfully,0.707107,drive line terrible painfully
never order right hard,0.707107,never order right hard
busy usually cold sigh,0.707107,busy usually cold sigh
awful service extremely slow,0.707107,awful service extremely slow


# Chicago

In [18]:
# filter only Chicago
Chicago = data[data.city == 'Chicago']
r_chicago = list(Chicago['review'].values)

# lemmatize review
lemma(r_chicago)

['I am a big fan of Mc Donalds , however the young lady at the register are unprofessional , and untrained . The promotion for buy one Quater pounder get one free after doing a review is lost upon the entire staff . This wa on a monday , but happened again on a tuesday . I am telling everyone i know , and i have a LARGE Twitter follower , also a a HUGE facebook following . ',
 'On my way to the local brewery not too far from here we decided to grab a bite to eat beforehand . We pulled into the drive through because getting out just seemed like an unnecessary risk . As we waited and finally got up to the menu , the order taker informed u that they were only accepting cash at the moment . My driver decided to exit the line but I insisted on getting him a meal . I risked life and limb to get that man dinner . I agree with the other reviewer that this place is le that satisfactory . The cashier wa not even the slightest bit delightful in my ordering experience . The restaurant did not look

In [19]:
# 2-5 grams
chicago = vectorizer1.fit_transform(r_chicago)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(chicago.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1.head()

Unnamed: 0,score,term
drive thru,11.378852,drive thru
parking lot,5.301684,parking lot
customer service,4.689828,customer service
every time,3.800888,every time
order wrong,3.425112,order wrong


In [20]:
# 3-grams
chicago = vectorizer2.fit_transform(r_chicago)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(chicago.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

score2.head(20)

Unnamed: 0,score,term
great customer service,2.687094,great customer service
piece chicken nugget,2.374188,piece chicken nugget
drive thru line,2.035073,drive thru line
coffee hot coffee,2.0,coffee hot coffee
red line station,2.0,red line station
salsa breakfast burrito,2.0,salsa breakfast burrito
time order wrong,2.0,time order wrong
receipt threw away,2.0,receipt threw away
give correct change,2.0,give correct change
last time went,1.930051,last time went


In [21]:
# 4-grams
chicago = vectorizer3.fit_transform(r_chicago)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(chicago.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

score3.head(10)

Unnamed: 0,score,term
forgot order waiting mins,1.0,forgot order waiting mins
breakfast stops idea ggrrr,1.0,breakfast stops idea ggrrr
expect lol lol lol,1.0,expect lol lol lol
special wokrer window friendly,0.707107,special wokrer window friendly
service ever avoid place,0.707107,service ever avoid place
worst service ever avoid,0.707107,worst service ever avoid
nothing special wokrer window,0.707107,nothing special wokrer window
good environment looks better,0.57735,good environment looks better
worst ever hope give,0.57735,worst ever hope give
ever hope give zero,0.57735,ever hope give zero


In [22]:
score1.to_csv('chicago.csv')

## Los Angeles

In [23]:
# filter only LA
LA = data[data.city == 'Los Angeles']
r_la = list(LA['review'].values)

# lemmatize review
lemma(r_la)

["Slowest drive-thru ever . You 're always bombarded by house le occupant that post at the drive- thru entrance and exit . Better option is to drive the extra 2 min west and go to the location on arlington . ",
 "If I could give this place negative star I wouldHorrible service , rude staff and their price on the menu do n't even match what they charge you they have price showing that to add bacon and $ .79 and then she charge me a $ 1.89 per piece of bacon boosted my meal up almost 3 dollar and some change just to add bacon to a breakfast sandwich and then when I speak with the manager in charge she there though well they 'll change the price eventually everyone ha attitude no one 's willing to help I work in customer service myself and if I wa a treat anyone the way that I wa treated this morning it will be a serious issue but of course because it 's McDonald 's care it 's a free-for-all to do whatever you want to treat people however you want to treat them with no repercussion I will

In [24]:
# 2-5 grams
la = vectorizer1.fit_transform(r_la)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(la.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1

Unnamed: 0,score,term
drive thru,12.506190,drive thru
worst ever,3.536440,worst ever
order wrong,3.201370,order wrong
iced coffee,2.982152,iced coffee
many times,2.853963,many times
customer service,2.493280,customer service
open hours,2.298396,open hours
late night,2.274971,late night
across street,2.198805,across street
took minutes,2.121605,took minutes


In [25]:
# 3-grams
la = vectorizer2.fit_transform(r_la)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(la.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

score2.head(20)

Unnamed: 0,score,term
drive thru line,3.014437,drive thru line
drive thru service,2.485056,drive thru service
customer service ever,2.371946,customer service ever
drive thru window,2.371946,drive thru window
back explained happened,2.0,back explained happened
ordered egg mcmuffin,2.0,ordered egg mcmuffin
order drive thru,2.0,order drive thru
gotten order wrong,2.0,gotten order wrong
every single time,2.0,every single time
worst service ever,2.0,worst service ever


In [26]:
# 4-grams
la = vectorizer3.fit_transform(r_la)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(la.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

score3.head(10)

Unnamed: 0,score,term
horrible service order wrong,1.0,horrible service order wrong
like drive thru service,1.0,like drive thru service
people world place enjoy,0.707107,people world place enjoy
cup coffee drive thru,0.707107,cup coffee drive thru
cash night sounds shady,0.707107,cash night sounds shady
dumbest people world place,0.707107,dumbest people world place
hey cup coffee drive,0.707107,hey cup coffee drive
accept cash night sounds,0.707107,accept cash night sounds
like place bcas ita,0.707107,like place bcas ita
place bcas ita neighborhood,0.707107,place bcas ita neighborhood


In [27]:
score1.to_csv('la.csv')

## New York

In [28]:
# filter only LA
NY = data[data.city == 'New York']
r_ny = list(NY['review'].values)

# lemmatize review
lemma(r_ny)

["1 . It 's a Mcdonalds.2 . It 's Harlem3 . It 's Harlem4 . Did I mention it 's Harlem ? I personally do not like Harlem and this is because Harlem is too overcrowded , and there are some people there with very nasty attitude . That you ca n't even look at them , without someone feeling some type of way . I avoid Harlem at any cost that I can . But I had to go to Mcdonalds , being said that this wa the closest . First off the female there wa out of it , I do n't think she knew where she wa , and they got all my order wrong . They put some really really nasty sauce spicy buffalo on my Mcchicken . Who put that crap on a Mcchicken who in they 're right mind would do some crazy thing like that ? That wa the most horrible Mcchicken I 've ever tasted in my entire life . People beware . ",
 "Awful service ( super slow ) . Just wanted a mcflurry and it took over ten minute with only two people ahead of me in line . Does n't seem clean . ",
 'Would you like roach with that ? ',
 "More expensive

In [29]:
# 2-5 grams
ny = vectorizer1.fit_transform(r_ny)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(ny.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1.head(20)

Unnamed: 0,score,term
customer service,4.737319,customer service
big mac,3.595069,big mac
worst ever,3.428556,worst ever
happy meal,3.037121,happy meal
bad service,2.949364,bad service
rush hour,2.91991,rush hour
homeless people,2.713449,homeless people
fries always,2.712008,fries always
late night,2.703424,late night
ice cream,2.621264,ice cream


In [30]:
# 3-grams
ny = vectorizer2.fit_transform(r_ny)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(ny.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

score2.head(20)

Unnamed: 0,score,term
ice cream sundae,2.0,ice cream sundae
inside waiting bus,2.0,inside waiting bus
people standing counter,2.0,people standing counter
person working cashier,2.0,person working cashier
long day work,1.707107,long day work
went massive renovation,1.707107,went massive renovation
egg cheese mcmuffin,1.414214,egg cheese mcmuffin
sausage egg cheese,1.414214,sausage egg cheese


In [31]:
# 4-grams
ny = vectorizer3.fit_transform(r_ny)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(ny.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

score3.head(10)

Unnamed: 0,score,term
really reason employees rude,1.0,really reason employees rude
found king bug burger,1.0,found king bug burger
needs work customer service,0.57735,needs work customer service
seen many signs saying,0.57735,seen many signs saying
many signs saying things,0.57735,many signs saying things
particular location needs work,0.57735,particular location needs work
wow seen many signs,0.57735,wow seen many signs
close ghetto experience welcome,0.57735,close ghetto experience welcome
want close ghetto experience,0.57735,want close ghetto experience
location needs work customer,0.57735,location needs work customer


In [32]:
score1.to_csv('ny.csv')

## Atlanta 

In [33]:
# filter only LA
Atlanta = data[data.city == 'Atlanta']
r_atlanta = list(Atlanta['review'].values)

# lemmatize review
lemma(r_atlanta)

["I 'm not a huge mcds lover , but I 've been to better one . This is by far the worst one I 've ever been too ! It 's filthy inside and if you get drive through they completely screw up your order every time ! The staff is terribly unfriendly and nobody seems to care . ",
 'Terrible customer service . I came in at 9:30pm and stood in front of the register and no one bothered to say anything or help me for 5 minute . There wa no one else waiting for their food inside either , just outside at the window . I left and went to Chickfila next door and wa greeted before I wa all the way inside . This McDonalds is also dirty , the floor wa covered with dropped food . Obviously filled with surly and unhappy worker . ',
 "First they `` lost '' my order , actually they gave it to someone one else than took 20 minute to figure out why I wa still waiting for my order.They after I wa asked what I needed I replied , `` my order '' .They asked for my ticket and the asst mgr looked at the ticket then 

In [34]:
# 2-5 grams
atlanta = vectorizer1.fit_transform(r_atlanta)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(atlanta.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1.head(20)

Unnamed: 0,score,term
drive thru,8.828755,drive thru
customer service,2.955215,customer service
worst ever,2.524639,worst ever
ice cream,2.487052,ice cream
order wrong,2.403048,order wrong
northside hospital,2.322851,northside hospital
particular location,2.13472,particular location
like nothing,2.0,like nothing
every time,1.957103,every time
across street,1.950903,across street


In [35]:
# 3-grams
atlanta = vectorizer2.fit_transform(r_atlanta)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(atlanta.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

score2.head(20)

Unnamed: 0,score,term
order drive thru,3.332197,order drive thru
cars drive thru,2.238283,cars drive thru
went drive thru,2.030364,went drive thru
went back store,2.0,went back store
stay away location,2.0,stay away location
drive thru walk,1.745864,drive thru walk
northside hospital option,1.707107,northside hospital option
trapped northside hospital,1.707107,trapped northside hospital
sat drive thru,1.707107,sat drive thru
drive thru seems,1.57735,drive thru seems


In [36]:
# 4-grams
atlanta = vectorizer3.fit_transform(r_atlanta)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(atlanta.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

score3.head(10)

Unnamed: 0,score,term
regular close highway good,0.707107,regular close highway good
close highway good bad,0.707107,close highway good bad
ice cream machine always,0.66095,ice cream machine always
see giving star star,0.57735,see giving star star
giving star star need,0.57735,giving star star need
employees hanging friends front,0.57735,employees hanging friends front
disorganized order employees hanging,0.57735,disorganized order employees hanging
star star need say,0.57735,star star need say
order employees hanging friends,0.57735,order employees hanging friends
particular location worst employees,0.447214,particular location worst employees


In [37]:
score1.to_csv('altan.csv')

## Houston        

In [38]:
# filter only Houston
Houston = data[data.city == 'Houston']
r_houston = list(Houston['review'].values)

# lemmatize review
lemma(r_houston)

['Manger is extremely rude , for a restaurant that pride itself for customer . It lack a lot of it ',
 "They recently gave this location a renovation and updated it . It look really nice and now ha an indoor play area for the kiddos . It 's super clean inside ( at least when I went ) though we actually had to ask someone where the trashcan wa since it wa hidden around a corner . Ketchup dispenser wa broken and they gave me sweet and sour sauce instead of sweet chili like I asked . Now , I usually do n't review fast food since it 's pretty much the same no matter where you go , but this is the location closest to my house , and I can honestly say I rarely ever go because of one thing ... they are SLOOWWWWWW ! In the drive through , expect to wait around 10+ minute . I even had two people behind me decide to leave after they had ordered , and I wa n't the one holding up the line ! I ordered a Spicy McChicken and nugget . I even considered if going inside would have been faster . Bottom l

In [39]:
# 2-5 grams
houston = vectorizer1.fit_transform(r_houston)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(houston.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1.head(20)

Unnamed: 0,score,term
drive thru,11.499161,drive thru
french fries,3.071101,french fries
customer service,2.872201,customer service
check order,2.646158,check order
every time,2.600101,every time
took minutes,2.198231,took minutes
order wrong,2.189673,order wrong
went drive,2.165907,went drive
dollar menu,2.114301,dollar menu
sausage biscuit,2.035795,sausage biscuit


In [40]:
# 3-grams
houston = vectorizer2.fit_transform(r_houston)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(houston.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

score2.head(20)

Unnamed: 0,score,term
went drive thru,3.351953,went drive thru
drive thru line,2.76358,drive thru line
minutes drive thru,2.432893,minutes drive thru
use drive thru,2.257521,use drive thru
drive thru order,2.116631,drive thru order
check order drive,2.0,check order drive
egg cheese biscuit,2.0,egg cheese biscuit
especially drive thru,2.0,especially drive thru
every single time,2.0,every single time
indoor play area,2.0,indoor play area


In [41]:
# 4-grams
houston = vectorizer3.fit_transform(r_houston)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(houston.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

score3.head(10)

Unnamed: 0,score,term
looking take order ends,1.0,looking take order ends
ice cream cones regrets,0.57735,ice cream cones regrets
mcdonaldsi ice cream cones,0.57735,mcdonaldsi ice cream cones
love mcdonaldsi ice cream,0.57735,love mcdonaldsi ice cream
micky dirty horrible place,0.5,micky dirty horrible place
indoor play area kids,0.5,indoor play area kids
great indoor play area,0.5,great indoor play area
dirty horrible place take,0.5,dirty horrible place take
nice clean great indoor,0.5,nice clean great indoor
clean great indoor play,0.5,clean great indoor play


In [42]:
score1.to_csv("Houston.csv")

## Portland 

In [43]:
# filter only Portland
Portland = data[data.city == 'Portland']
r_portland = list(Portland['review'].values)

# lemmatize review
lemma(r_portland)

["Dirtiest filthiest McDonalds I 've ever been to . Filthy floor in an empty restaurant , filthy credit card machine , dirty rag hanging on the garbage bin . Will never be back . Gross gross . Avoid unless you want a communicable disease or food poisoning . Ick . How do they pas their health department check ? The jewel on the crown wa a salad full of warm wilted lettuce . Did n't dare eat it . ",
 'Do not go in . Walk across the street & get a day old , shriveled up hot dog from the 7-11 . Just take my word for it . ',
 "`` Would you like any ketchup on your tray ? '' Huh ? Oh , right , you consider this a ghetto location where people will for some reason make off with ALL THE KETCHUP if you give them a chance . So it 's hoarded , doled out packet by precious packet . Want a bit more ? Better be ready for the glower a the manager put one additional packet on your tray . Oh , the sign that screamed `` no loitering '' wa also a very nice touch . `` That 's not going to be enough ? `` No

In [44]:
# 2-5 grams
portland = vectorizer1.fit_transform(r_portland)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(portland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1.head(20)

Unnamed: 0,score,term
drive thru,7.545466,drive thru
every time,3.080159,every time
care less,2.423831,care less
french fries,2.271284,french fries
went drive,2.123706,went drive
never eat,2.112641,never eat
cold fries,2.093191,cold fries
parking lot,2.03955,parking lot
tasted like,2.014434,tasted like
egg mcmuffin,2.0,egg mcmuffin


In [45]:
# 3-grams
portland = vectorizer2.fit_transform(r_portland)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(portland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

score2.head(20)

Unnamed: 0,score,term
went drive thru,3.0,went drive thru
drive thru window,2.0,drive thru window
every time visit,2.0,every time visit
gets order wrong,2.0,gets order wrong
location good experience,2.0,location good experience
order drive window,2.0,order drive window
terrible customer service,2.0,terrible customer service
chicken nuggets fish,1.0,chicken nuggets fish
fillet chicken sandwich,1.0,fillet chicken sandwich
fish fillet chicken,1.0,fish fillet chicken


In [46]:
# 4-grams
portland = vectorizer3.fit_transform(r_portland)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(portland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

score3.head(10)

Unnamed: 0,score,term
plenty interesting places eat,1.0,plenty interesting places eat
screw ice cream come,0.707107,screw ice cream come
mcd area less value,0.707107,mcd area less value
ice cream come education,0.707107,ice cream come education
expensive mcd area less,0.707107,expensive mcd area less
nice night crew rude,0.57735,nice night crew rude
crew nice night crew,0.57735,crew nice night crew
place park going observatory,0.57735,place park going observatory
park going observatory roscoe,0.57735,park going observatory roscoe
day crew nice night,0.57735,day crew nice night


In [47]:
score1.to_csv('portland.csv')

## Dallas 

In [48]:
# filter only Dallas
Dallas = data[data.city == 'Dallas']
r_dallas = list(Dallas['review'].values)

# lemmatize review
lemma(r_dallas)

["So it 's fast food and McDonalds at that . So let 's say expectation are very , very low . Sometimes though you just want food that 's fast and you know what to expect . Not every meal can be a 5 star dinner with an executive chef.Even with appropriate low expectation , this location take the cake for subpar service . About a week ago I wa exhausted and needed something to eat and there are few fast food restaurant near where I live . So off to McDonalds I went . Ordered a value meal and a single apple pie . Surprise ... no pie ! Ok , mistake happen . I 've worked drive through in a previous life and I get it . Even though it wa 10 pm and I wa the only car there , so it 's not like they were rushed . Still ... mistake happen.Fast forward to tonight . Again about 10 pm and exhausted and decide to go crazy and get McD 's twice in 1 week . So I try again to order an apple pie with my value meal . Now , maybe this is my fault , because I should have checked the bag before driving off . B

In [49]:
# 2-5 grams
dallas = vectorizer1.fit_transform(r_dallas)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(dallas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1.head(30)

Unnamed: 0,score,term
drive thru,8.12085,drive thru
customer service,5.085995,customer service
parking lot,3.504216,parking lot
big mac,2.110771,big mac
play area,2.056075,play area
looked like,2.014677,looked like
never back,1.803684,never back
staff friendly,1.654771,staff friendly
many times,1.590418,many times
iced coffee,1.590418,iced coffee


In [50]:
# 3-grams
dallas = vectorizer2.fit_transform(r_dallas)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(dallas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

score2

Unnamed: 0,score,term
children play area,2.0,children play area
drive thru order,2.0,drive thru order
drive thru ordered,2.0,drive thru ordered
service waited minutes,2.0,service waited minutes
time order right,2.0,time order right
drive thru window,1.707107,drive thru window
worst customer service,1.707107,worst customer service


In [51]:
# 4-grams
dallas = vectorizer3.fit_transform(r_dallas)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(dallas.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

score3.head(10)

Unnamed: 0,score,term
said end day still,0.5,said end day still
clean restaurant service said,0.5,clean restaurant service said
restaurant service said end,0.5,restaurant service said end
service said end day,0.5,service said end day
took min iced coffee,0.447214,took min iced coffee
messed location times refuse,0.447214,messed location times refuse
order messed location times,0.447214,order messed location times
refuse love iced coffees,0.447214,refuse love iced coffees
counter girl super rude,0.447214,counter girl super rude
location times refuse love,0.447214,location times refuse love


In [52]:
score1.to_csv('dallas.csv')

## Cleveland 

In [53]:
# filter only Cleveland
Cleveland = data[data.city == 'Cleveland']
r_cleveland = list(Cleveland['review'].values)

# lemmatize review
lemma(r_cleveland)

['Horrible Service , staff is not bothered , they are just lazy and dont care . I would say avoid this place . else you gon na spoil your day . Today I went to get coffee and the place looked chaotic with so many customer . I wasnt sure if all of those customer were waiting to order to pick their order . One of the staff member wa at cash register lazily looking at customer and not doing anything . When I wa about to ask her if I can order , she said she is ready for next order . I order a latte and it wa horrible , cold , and too much sugary syrup . I would rate 0 but thats not possible.Just Avoid It ',
 'Order # 488 today at 4pm . Nobody else in line . 10 second to get my drink but six minute ticket time . Manager came forward , grabbed the receipt out my hand , crumpled it up and handed me my food . Do I need to repeat that ? Corporate will be called . ',
 'This McDonalds is always horrible . I think that the only reason they are still open is because they are located in Tower City 

In [54]:
# 2-5 grams
cleveland = vectorizer1.fit_transform(r_cleveland)
terms = vectorizer1.get_feature_names()
tf_idf = pd.DataFrame(cleveland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score1 = pd.DataFrame(tf_idf, columns=["score"])
score1["term"] = terms
score1.sort_values(by="score", ascending=False, inplace=True)

score1.head()

Unnamed: 0,score,term
drive thru,5.02559,drive thru
customer service,2.516729,customer service
worst ever,2.165731,worst ever
somewhere else,2.0,somewhere else
quarter pounder,2.0,quarter pounder


In [55]:
# 3-grams
cleveland = vectorizer2.fit_transform(r_cleveland)
terms = vectorizer2.get_feature_names()
tf_idf = pd.DataFrame(cleveland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score2 = pd.DataFrame(tf_idf, columns=["score"])
score2["term"] = terms
score2.sort_values(by="score", ascending=False, inplace=True)

score2.head(20)

Unnamed: 0,score,term
turn around time,2.0,turn around time
drive thru slowest,2.0,drive thru slowest
complaints orders never,0.458831,complaints orders never
never received correctly,0.458831,never received correctly
received correctly time,0.458831,received correctly time
orders never received,0.458831,orders never received
constant arguing grill,0.458831,constant arguing grill
correctly time give,0.458831,correctly time give
counter massive customer,0.458831,counter massive customer
customer complaints orders,0.458831,customer complaints orders


In [56]:
# 4-grams
cleveland = vectorizer3.fit_transform(r_cleveland)
terms = vectorizer3.get_feature_names()
tf_idf = pd.DataFrame(cleveland.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score3 = pd.DataFrame(tf_idf, columns=["score"])
score3["term"] = terms
score3.sort_values(by="score", ascending=False, inplace=True)

score3.head(10)

Unnamed: 0,score,term
typically always bad service,1.0,typically always bad service
stopping every time eat,0.57735,stopping every time eat
hear heart stopping every,0.57735,hear heart stopping every
heart stopping every time,0.57735,heart stopping every time
though probably figured drove,0.5,though probably figured drove
super ghetto though probably,0.5,super ghetto though probably
dirty super ghetto though,0.5,dirty super ghetto though
ghetto though probably figured,0.5,ghetto though probably figured
customer complaints orders never,0.471405,customer complaints orders never
location run worst management,0.471405,location run worst management


In [57]:
score1.to_csv('cleveland.csv')

## phrase count for high scored phrases

In [58]:
# count "drive thru"
import re

count = 0
for line in allreview:
    if len(re.findall(r"(drive thru)", line)) >= 1:
        count += 1
count

print("drive thru is mentioned in",round((count/1525)*100,3), '% of all comments')

drive thru is mentioned in 12.984 % of all comments


In [59]:
# count "drive thru" in Vagas
count = 0
for line in r_vegas:
    if len(re.findall(r"(drive thru)", line)) >= 1:
        count += 1

(count/409)*100

16.381418092909534

In [60]:
# count "order wrong" , "wrong order" , "order right" , "right order" "correct order" "order correct"
count = 0
for line in allreview:
    if len(re.findall(r"(order wrong|wrong order|order right|right order|correct order|order correct)", line)) >= 1:
        count += 1

print("wrong order, order wrong, order right, or right order is mentioned in",round((count/1525)*100,3), '% of all comments')

wrong order, order wrong, order right, or right order is mentioned in 8.656 % of all comments


In [61]:
# count "ice cream"
count = 0
for line in allreview:
    if len(re.findall(r"(ice cream|icecream)", line)) >= 1:
        count += 1
print("ice cream is mentioned in",round((count/1525)*100,3), '% of all comments')

ice cream is mentioned in 2.689 % of all comments


In [62]:
# count "french fry"
count = 0
for line in allreview:
    if len(re.findall(r"(french fry|frenchfry|fries)", line)) >= 1:
        count += 1
print("french fry is mentioned in",round((count/1525)*100,3), '% of all comments')

french fry is mentioned in 2.164 % of all comments


In [63]:
# count "big mac"
count = 0
for line in allreview:
    if len(re.findall(r"(big mac|bigmac)", line)) >= 1:
        count += 1
print("big mac is mentioned in",round((count/1525)*100,3), '% of all comments')

big mac is mentioned in 0.852 % of all comments


In [64]:
# count "chicken nugget"
count = 0
for line in allreview:
    if len(re.findall(r"(chicken nugget)", line)) >= 1:
        count += 1
print("chicken nugget is mentioned in",round((count/1525)*100,3), '% of all comments')

chicken nugget is mentioned in 2.098 % of all comments


In [65]:
#count iced coffee
count = 0
for line in allreview:
    if len(re.findall(r"(iced coffee)", line)) >= 1:
        count += 1
print("ced coffee is mentioned in",round((count/1525)*100,3), '% of all comments')

ced coffee is mentioned in 1.77 % of all comments


In [66]:
#count sweet tea
count = 0
for line in allreview:
    if len(re.findall(r"(sweet tea)", line)) >= 1:
        count += 1
print("sweet tea is mentioned in",round((count/1525)*100,3), '% of all comments')

sweet tea is mentioned in 1.705 % of all comments


In [67]:
# count "late night"
count = 0
for line in allreview:
    if len(re.findall(r"(late night|latenight)", line)) >= 1:
        count += 1
print("late night is mentioned in",round((count/1525)*100,3), '% of all comments')

late night is mentioned in 1.508 % of all comments


In [68]:
# count "parking lot"
count = 0
for line in allreview:
    if len(re.findall(r"(parking lot|parking)", line)) >= 1:
        count += 1
print("parking or parking lot is mentioned in",round((count/1525)*100,3), '% of all comments')

parking or parking lot is mentioned in 3.148 % of all comments
