# Class Activity 03 assignment with McDonald's data

## Imaginary problem statement

McDonald's receives **thousands of customer comments** on their website per day, and many of them are negative. Their corporate employees don't have time to read every single comment, but they do want to read a subset of comments that they are most interested in. In particular, the media has recently portrayed their employees as being rude, and so they want to review comments about **rude service**.

McDonald's has hired you to develop a system that ranks each comment by the **likelihood that it is referring to rude service**. They will use your system to build a "rudeness dashboard" for their corporate employees, so that employees can spend a few minutes each day examining the **most relevant recent comments**.

## Description of the data

Before hiring you, McDonald's used the [CrowdFlower platform](http://www.crowdflower.com/data-for-everyone) to pay humans to **hand-annotate** about 1500 comments with the **type of complaint**. The complaint types are listed below, with the encoding used in the data listed in parentheses:

- Bad Food (BadFood)
- Bad Neighborhood (ScaryMcDs)
- Cost (Cost)
- Dirty Location (Filthy)
- Missing Item (MissingFood)
- Problem with Order (OrderProblem)
- Rude Service (RudeService)
- Slow Service (SlowService)
- None of the above (na)

## Task 1

Read **`mcdonalds.csv`** into a pandas DataFrame and examine it. (It can be found in the **`data`** directory of the course repository.)

- The **policies_violated** column lists the type of complaint. If there is more than one type, the types are separated by newline characters.
- The **policies_violated:confidence** column lists CrowdFlower's confidence in the judgments of its human annotators for that row (higher is better).
- The **city** column is the McDonald's location.
- The **review** column is the actual text comment.

In [231]:
#reading McDonald's csv
import pandas as pd
path = '/content/mcdonalds.csv'
mcd = pd.read_csv(path)

In [232]:
#Examining Data
mcd.describe()
mcd.info()
mcd.head(5)
mcd['policies_violated'].value_counts().sort_index()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   _unit_id                      1525 non-null   int64  
 1   _golden                       1525 non-null   bool   
 2   _unit_state                   1525 non-null   object 
 3   _trusted_judgments            1525 non-null   int64  
 4   _last_judgment_at             1525 non-null   object 
 5   policies_violated             1471 non-null   object 
 6   policies_violated:confidence  1471 non-null   object 
 7   city                          1438 non-null   object 
 8   policies_violated_gold        0 non-null      float64
 9   review                        1525 non-null   object 
 10  Unnamed: 10                   0 non-null      float64
dtypes: bool(1), float64(2), int64(2), object(6)
memory usage: 120.8+ KB


BadFood                                                                101
BadFood\nCost                                                            4
BadFood\nFilthy                                                          2
BadFood\nFilthy\nRudeService                                             1
BadFood\nMissingFood                                                     1
                                                                      ... 
na\nCost                                                                 1
na\nScaryMcDs                                                            2
na\nScaryMcDs\nBadFood                                                   1
na\nSlowService\nScaryMcDs                                               1
na\nSlowService\nScaryMcDs\nRudeService\nOrderProblem\nFilthy\nCost      1
Name: policies_violated, Length: 146, dtype: int64

In [233]:
#Examining text for first row
mcd.loc[0,'review']

"I'm not a huge mcds lover, but I've been to better ones. This is by far the worst one I've ever been too! It's filthy inside and if you get drive through they completely screw up your order every time! The staff is terribly unfriendly and nobody seems to care."

## Task 2

Remove any rows from the DataFrame in which the **policies_violated** column has a **null value**. Check the shape of the DataFrame before and after to confirm that you only removed about 50 rows.

- **Note:** Null values are also known as "missing values", and are encoded in pandas with the special value "NaN". This is distinct from the "na" encoding used by CrowdFlower to denote "None of the above". Rows that contain "na" should **not** be removed.

In [234]:
#Null Row Removal
mcd.shape #1525 rows
mcd.policies_violated.isnull().sum() #54 null rows
mcd.dropna(subset = ['policies_violated'], inplace = True)
mcd.shape #1471 rows, so drop successful

(1471, 11)

## Task 3

Add a new column to the DataFrame called **"rude"** that is 1 if the **policies_violated** column contains the text "RudeService", and 0 if the **policies_violated** column does not contain "RudeService". The "rude" column is going to be your response variable, so check how many zeros and ones it contains.

In [235]:
# adding "rude" column to the dataframe
rude = []
for item in mcd['policies_violated']:
  if 'RudeService' in item:
    rude.append(1)
  else:
    rude.append(0)
print(rude.count(1))
print(rude.count(0))
mcd['rude'] = rude
mcd['rude'].head() # looks like transferred over, not sure if number of "Rude" responses is correct

503
968


0    1
1    1
2    0
3    0
4    1
Name: rude, dtype: int64

## Task 4

1. Define X (the **review** column) and y (the **rude** column).
2. Split X and y into training and testing sets (using the parameter **`random_state=1`**).
3. Use CountVectorizer (with the **default parameters**) to create document-term matrices from X_train and X_test.

In [236]:
# 1. define X and y
X = mcd.review
y = mcd.rude

In [237]:
# 2. Splitting into test and training sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [238]:
# Examining Shape
print(X_train.shape)
print(X_test.shape) #split looks right

(1103,)
(368,)


In [239]:
# 3. Using CountVectorizer to create DTM
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [240]:
# fit and transfrom for X_train DTM
X_train_dtm = vect.fit_transform(X_train)

In [241]:
# only transform on X_test DTM
X_test_dtm = vect.transform(X_test)

In [242]:
# Examining Shape and last 50 features
print(X_train_dtm.shape)
print(X_test_dtm.shape) #7300 terms extracted
print(vect.get_feature_names()[-50:])

(1103, 7300)
(368, 7300)
['œæturns', 'œætwo', 'œæughhh', 'œæughhhhh', 'œæultimately', 'œæum', 'œæunfortunately', 'œæunreal', 'œæuntil', 'œæupon', 'œæuseless', 'œæusually', 'œævery', 'œæwait', 'œæwanna', 'œæwant', 'œæwas', 'œæwasn', 'œæway', 'œæwe', 'œæwell', 'œæwhat', 'œæwhatever', 'œæwhen', 'œæwhich', 'œæwhile', 'œæwho', 'œæwhy', 'œæwill', 'œæwish', 'œæwith', 'œæwon', 'œæword', 'œæwork', 'œæworkers', 'œæworst', 'œæwould', 'œæwow', 'œæwtf', 'œæya', 'œæyay', 'œæyeah', 'œæyears', 'œæyelp', 'œæyep', 'œæyes', 'œæyesterday', 'œæyet', 'œæyou', 'œæyour']


## Task 5

Fit a Multinomial Naive Bayes model to the training set, calculate the **predicted probabilites** (not the class predictions) for the testing set, and then calculate the **AUC**.

- **Note:** Because McDonald's only cares about ranking the comments by the likelihood that they refer to rude service, **classification accuracy** is not the relevant evaluation metric. **Area Under the Curve (AUC)** is a more useful evaluation metric for this scenario, since it measures the ability of the classifier to assign higher predicted probabilities to positive instances than to negative instances. See [AUC documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html#) for reference.

In [243]:
# Using MultiNomial NB to predict probabilities and calculate AUC
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1] 

print('Features: ', X_train_dtm.shape[1])

#AUC Test
print(metrics.roc_auc_score(y_test, y_pred_prob))

Features:  7300
0.8426005404546177


## Task 6

Using Naive Bayes, try **tuning CountVectorizer** using some of the techniques we learned in class. Check the testing set **AUC** after each change, and find the set of parameters that increases AUC the most.

- **Hint:** It is highly recommended that you adapt the **`tokenize_test()`** function from class for this purpose, since it will allow you to iterate quickly through different sets of parameters.

In [244]:
# Adapting tokenize_test function
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    #look at the shape wihch is the attribute of DTM
    #shape is the 2nd element, the no. of columns which tells 
    #us howmany features are generated.
    print('Features: ', X_train_dtm.shape[1])
    
    # use Multinomial Naive Bayes to predict the star rating
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
    
    # print the accuracy of its predictions from the metrics module
    #pass it to the true values and the predicted values.
    print(metrics.roc_auc_score(y_test, y_pred_prob))

In [245]:
#Default Params already used in task 5, doing don't convert to lowercase
vect = CountVectorizer(lowercase=False)
tokenize_test(vect) # lowercase has very slightly decreased the AUC

Features:  8742
0.8406453663964394


In [246]:
#including 1 and 2 grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect) #This has significantly decreased the AUC

Features:  57936
0.8195994277539342


In [247]:
# removing stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect) # Removing the common english stopwords has slightly increased the AUC

Features:  7020
0.853520902877126


In [248]:
#removing doc-specific stop words (those appearing in more than 30% of documents)
vect = CountVectorizer(max_df=0.3)
tokenize_test(vect) #this has marginally increased AUC

Features:  7269
0.8523764107455094


In [249]:
#Only keeping items in at least 3 documents
vect = CountVectorizer(min_df=3)
tokenize_test(vect) #this has not increased AUC much

Features:  2449
0.8430456207280242


In [250]:
#Combination of factors
vect = CountVectorizer(stop_words = 'english', max_df=0.3, min_df = 6, lowercase= False, ngram_range=(1,2))
tokenize_test(vect)# this looks to be the best performing model I could find

Features:  1651
0.8558416785884597


In [251]:
#Combination of best performers
vect = CountVectorizer(stop_words = 'english', max_df=0.3, min_df = 2, lowercase= False)
tokenize_test(vect) 

Features:  3745
0.8583849944364966


## Task 7 (Optional)

The **city** column might be predictive of the response, but we are not currently using it as a feature. Let's see whether you can increase the AUC by adding it to the model:

1. Create a new DataFrame column, **review_city**, that concatenates the **review** text with the **city** text. One easy way to combine string columns in pandas is by using the [`Series.str.cat()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.cat.html) method. Make sure to use the **space character** as a separator, as well as replacing **null city values** with a reasonable string value (such as 'na').
2. Redefine X as the **review_city** column, and re-split X and y into training and testing sets.
3. When you run **`tokenize_test()`**, CountVectorizer will simply treat the city as an extra word in the review, and thus it will automatically be included in the model! Check to see whether it increased or decreased the AUC of your **model**.

In [252]:
# 1. Creating review_city column
mcd['review_city'] = mcd['review'].str.cat(mcd['city'], join = 'outer', sep = ' ', na_rep='na')
mcd['review_city'][0] # looks like the text carried over correctly

"I'm not a huge mcds lover, but I've been to better ones. This is by far the worst one I've ever been too! It's filthy inside and if you get drive through they completely screw up your order every time! The staff is terribly unfriendly and nobody seems to care. Atlanta"

In [253]:
# 2. Redefining X with the review_city  and creating new training and testing sets
X = mcd.review_city
y = mcd.rude

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

print(X_train.shape)
print(X_test.shape) #split looks right

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

(1103,)
(368,)


In [254]:
# using the tokenize test to see if accuracy has increased compared to default of 0.842
vect = CountVectorizer()
tokenize_test(vect) #this has increased the AUC by 0.0002, so not much, but only 3 new terms were added

Features:  7303
0.8428071848672706


## Task 8 (Optional)

New comments have been submitted to the McDonald's website, and you need to **score them with the likelihood** that they are referring to rude service.

1. Before making predictions on out-of-sample data, it is important to re-train your model on all relevant data using the tuning parameters and preprocessing steps that produced the best AUC above.
    - In other words, X should be defined using either **all rows** or **only those rows with a confidence_mean of at least 0.75**, whichever produced a better AUC above.
    - X should refer to either the **review column** or the **review_city column**, whichever produced a better AUC above.
    - CountVectorizer should be instantiated with the **tuning parameters** that produced the best AUC above.
    - **`train_test_split()`** should not be used during this process.
2. Build a document-term matrix (from X) called **X_dtm**, and examine its shape.
3. Read the new comments stored in **`mcdonalds_new.csv`** into a DataFrame called **new_comments**, and examine it.
4. If your model uses a **review_city** column, create that column in the new_comments DataFrame. (Otherwise, skip this step.)
5. Build a document_term matrix (from the **new_comments** DataFrame) called **new_dtm**, and examine its shape.
6. Train your best model (Naive Bayes) using **X_dtm** and **y**.
7. Predict the "rude probability" for each comment in **new_dtm**, and store the probabilities in an object called **new_pred_prob**.


In [255]:
# 1. Re-train model ###Could not figure out this task###
X = mcd.review_city
y = mcd.rude

#Creating new training sets
X_train = X.sample(frac=0.75)
y_train = y.sample(frac= 0.75)
X_test = X.sample(frac=0.25)
y_test = y.sample(frac = 0.21)

vect = CountVectorizer(stop_words = 'english', max_df=0.3, min_df = 2, lowercase= False)

In [256]:
# 2. Build a new dtm
X_dtm = vect.fit_transform(X_train)

In [257]:
# 3. Read in new comments
new_comments = pd.read_csv('/content/mcdonalds_new.csv')

new_comments.head()
new_comments['review'][0]
new_comments.shape

(10, 2)

In [258]:
# 4. Create review_city column
new_comments['review_city'] = new_comments['review'].str.cat(new_comments['city'], join = 'outer', sep = ' ', na_rep='na')
new_comments['review_city'][0] # looks like the text carried over correctly

"Went through the drive through and ordered a #10 (cripsy sweet chili chicken wrap) without fries- the lady couldn't understand that I did not want fries and charged me for them anyways. I got the wrong order- a chicken sandwich and a large fries- my boyfriend took it back inside to get the correct order. The gentleman that ordered the chicken sandwich was standing there as well and she took the bag from my bf- glanced at the insides and handed it to the man without even offering to replace. I mean with all the scares about viruses going around... ugh DISGUSTING SERVICE. Then when she gave him the correct order my wrap not only had the sweet chili sauce on it, but the nasty (just not my first choice) ranch dressing on it!!!! I mean seriously... how lazy can you get!!!! I worked at McDonalds in Texas when I was 17 for about 8 months and I guess I was spoiled with good management. This was absolutely ridiculous. I was beyond disappointed. Las Vegas"

In [259]:
# 5. Creating new_dtm
new_dtm = vect.transform(new_comments)
new_dtm.shape

(3, 3769)

In [260]:
# 6. Training Naive Bayes with new info 
# Using MultiNomial NB to predict probabilities and calculate AUC
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

nb = MultinomialNB()
nb.fit(X_train, new_comments)

ValueError: ignored

In [None]:
# 7. Predict Prob

#take prediction column and append each item to the list new_pred_prob

y_pred_prob = nb.predict_proba(new_dtm)

new_pred_prob = []
for item in y_pred_prob[:, 1]:
  new_pred_prob.append(item)

new_pred_prob
