# Homework with McDonald's sentiment data

## Imaginary problem statement

McDonald's receives **thousands of customer comments** on their website per day, and many of them are negative. Their corporate employees don't have time to read every single comment, but they do want to read a subset of comments that they are most interested in. In particular, the media has recently portrayed their employees as being rude, and so they want to review comments about **rude service**.

McDonald's has hired you to develop a system that ranks each comment by the **likelihood that it is referring to rude service**. They will use your system to build a "rudeness dashboard" for their corporate employees, so that employees can spend a few minutes each day examining the **most relevant recent comments**.

## Description of the data

Before hiring you, McDonald's used the [CrowdFlower platform](http://www.crowdflower.com/data-for-everyone) to pay humans to **hand-annotate** about 1500 comments with the **type of complaint**. The complaint types are listed below, with the encoding used in the data listed in parentheses:

- Bad Food (BadFood)
- Bad Neighborhood (ScaryMcDs)
- Cost (Cost)
- Dirty Location (Filthy)
- Missing Item (MissingFood)
- Problem with Order (OrderProblem)
- Rude Service (RudeService)
- Slow Service (SlowService)
- None of the above (na)

## Task 1

Read **`mcdonalds.csv`** into a pandas DataFrame and examine it. (It can be found in the **`data`** directory of the course repository.)

- The **policies_violated** column lists the type of complaint. If there is more than one type, the types are separated by newline characters.
- The **policies_violated:confidence** column lists CrowdFlower's confidence in the judgments of its human annotators for that row (higher is better).
- The **city** column is the McDonald's location.
- The **review** column is the actual text comment.

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:

path = '../data/mcdonalds.csv'
df = pd.read_csv(path)

In [3]:
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,
2,679455655,False,finalized,3,2/21/15 0:26,SlowService\nOrderProblem,1.0\n1.0,Atlanta,,"First they ""lost"" my order, actually they gave...",
3,679455656,False,finalized,3,2/21/15 0:27,na,0.6667,Atlanta,,I see I'm not the only one giving 1 star. Only...,
4,679455657,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,"Well, it's McDonald's, so you know what the fo...",


In [4]:
df.shape

(1525, 11)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 11 columns):
_unit_id                        1525 non-null int64
_golden                         1525 non-null bool
_unit_state                     1525 non-null object
_trusted_judgments              1525 non-null int64
_last_judgment_at               1525 non-null object
policies_violated               1471 non-null object
policies_violated:confidence    1471 non-null object
city                            1438 non-null object
policies_violated_gold          0 non-null float64
review                          1525 non-null object
Unnamed: 10                     0 non-null float64
dtypes: bool(1), float64(2), int64(2), object(6)
memory usage: 120.8+ KB


In [6]:
df.dtypes

_unit_id                          int64
_golden                            bool
_unit_state                      object
_trusted_judgments                int64
_last_judgment_at                object
policies_violated                object
policies_violated:confidence     object
city                             object
policies_violated_gold          float64
review                           object
Unnamed: 10                     float64
dtype: object

In [7]:
df.review[0]

"I'm not a huge mcds lover, but I've been to better ones. This is by far the worst one I've ever been too! It's filthy inside and if you get drive through they completely screw up your order every time! The staff is terribly unfriendly and nobody seems to care."

## Task 2

Remove any rows from the DataFrame in which the **policies_violated** column has a **null value**. Check the shape of the DataFrame before and after to confirm that you only removed about 50 rows.

- **Note:** Null values are also known as "missing values", and are encoded in pandas with the special value "NaN". This is distinct from the "na" encoding used by CrowdFlower to denote "None of the above". Rows that contain "na" should **not** be removed.
- **Hint:** [How do I handle missing values in pandas?](https://www.youtube.com/watch?v=fCMrO_VzeL8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=16) explains how to do this.

In [8]:
# to check missing values
df.isnull().sum()

_unit_id                           0
_golden                            0
_unit_state                        0
_trusted_judgments                 0
_last_judgment_at                  0
policies_violated                 54
policies_violated:confidence      54
city                              87
policies_violated_gold          1525
review                             0
Unnamed: 10                     1525
dtype: int64

In [9]:
# we need explore what polices_voilated

df.policies_violated.dtype

dtype('O')

In [10]:
df.policies_violated[:5]

0    RudeService\nOrderProblem\nFilthy
1                          RudeService
2            SlowService\nOrderProblem
3                                   na
4                          RudeService
Name: policies_violated, dtype: object

In [11]:
df[df.policies_violated.isnull()]

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10
37,679455690,False,finalized,3,2/21/15 1:48,,,Atlanta,,Stopped here on the way downtown this morning ...,
60,679455713,False,finalized,3,2/21/15 0:20,,,Atlanta,,Went to order a meal via drive through and the...,
63,679455716,False,finalized,3,2/21/15 0:38,,,Atlanta,,Typical McDonald's restaurant located at Power...,
66,679455719,False,finalized,3,2/21/15 0:28,,,Atlanta,,Just sat at drive thru several minutes no one ...,
161,679455816,False,finalized,3,2/21/15 0:21,,,Las Vegas,,"Nice McDonald's inside, but they have a remote...",
251,679455906,False,finalized,3,2/21/15 0:21,,,Las Vegas,,They always have maintenence whenever I go. I ...,
318,679455976,False,finalized,3,2/21/15 0:44,,,Las Vegas,,They are making this the worst Vegas experienc...,
332,679455991,False,finalized,3,2/21/15 0:38,,,Las Vegas,,Was told at 5am that burgers arent made until ...,
358,679456017,False,finalized,3,2/21/15 0:24,,,Las Vegas,,I left the Hilton late last night and I was re...,
383,679456042,False,finalized,3,2/21/15 1:18,,,Las Vegas,,Just left here. Screwed up my order 3 times. W...,


In [12]:
df.shape

(1525, 11)

In [13]:
# if we have all five columns with missing values and we remove them then our dataframe looks like
df.dropna(how='any').shape

(0, 11)

In [14]:
# Drop a row if all of its values missing
df.dropna(how='all').shape

(1525, 11)

In [15]:
# for our given variables
df.dropna(subset=['policies_violated'], how='any').shape

(1471, 11)

In [16]:
df.dropna(subset=['policies_violated'], how='any', inplace=True)

In [17]:
#df['policies_violated']= df.policies_violated.fillna("Other")

In [18]:
df.isnull().sum()

_unit_id                           0
_golden                            0
_unit_state                        0
_trusted_judgments                 0
_last_judgment_at                  0
policies_violated                  0
policies_violated:confidence       0
city                              86
policies_violated_gold          1471
review                             0
Unnamed: 10                     1471
dtype: int64

In [19]:
df.shape

(1471, 11)

## Task 3

Add a new column to the DataFrame called **"rude"** that is 1 if the **policies_violated** column contains the text "RudeService", and 0 if the **policies_violated** column does not contain "RudeService". The "rude" column is going to be your response variable, so check how many zeros and ones it contains.

- **Hint:** [How do I use string methods in pandas?](https://www.youtube.com/watch?v=bofaC0IckHo&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=12) shows how to search for the presence of a substring, and [How do I change the data type of a pandas Series?](https://www.youtube.com/watch?v=V0AWyzVMf54&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=13) shows how to convert the boolean results (True/False) to integers (1/0).

In [20]:
df.policies_violated.value_counts()

na                                   295
RudeService                          177
SlowService                          127
OrderProblem                         116
BadFood                              101
                                    ... 
OrderProblem\nRudeService\nFilthy      1
BadFood\nRudeService\nFilthy           1
Cost\nMissingFood                      1
Cost\nFilthy                           1
RudeService\nBadFood\nFilthy           1
Name: policies_violated, Length: 146, dtype: int64

In [21]:
len(df[df.policies_violated=="RudeService"])

177

In [22]:
df.policies_violated=="RudeService"

0       False
1        True
2       False
3       False
4        True
        ...  
1520    False
1521    False
1522    False
1523    False
1524    False
Name: policies_violated, Length: 1471, dtype: bool

In [23]:
df.policies_violated.dtype

dtype('O')

In [24]:
df[df.policies_violated.str.contains('RudeService')]

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,
4,679455657,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,"Well, it's McDonald's, so you know what the fo...",
7,679455660,False,finalized,3,2/21/15 0:15,RudeService,0.6801,Atlanta,,One Star and I'm beng kind. I blame management...,
8,679455661,False,finalized,3,2/21/15 0:29,SlowService\nRudeService\nMissingFood,1.0\n1.0\n0.6667,Atlanta,,Never been upset about any fast food drive thr...,
...,...,...,...,...,...,...,...,...,...,...,...
1509,679484693,False,finalized,3,2/21/15 0:20,BadFood\nRudeService,1.0\n0.6909,New York,,Really rude workers and customer service was j...,
1511,679492436,False,finalized,3,2/21/15 0:24,RudeService\nOrderProblem,1.0\n0.6585,Las Vegas,,Worst service ever!!!! There was so much trash...,
1512,679494190,False,finalized,3,2/21/15 0:09,OrderProblem\nRudeService,0.7056\n0.6377,Chicago,,Normally I don't review a chain unless somethi...,
1517,679497650,False,finalized,3,2/21/15 0:22,OrderProblem\nRudeService,1.0\n0.6786,Houston,,The drive thru got our order wrong AGAIN. I ca...,


Only "RudeService" is 177 but there are some cases where we have combination of it

In [25]:
df['rude'] = df.policies_violated.str.contains('RudeService').astype(int)

In [26]:
df[["rude","policies_violated"]].head(10)

Unnamed: 0,rude,policies_violated
0,1,RudeService\nOrderProblem\nFilthy
1,1,RudeService
2,0,SlowService\nOrderProblem
3,0,na
4,1,RudeService
5,0,BadFood\nSlowService
6,0,SlowService\nScaryMcDs
7,1,RudeService
8,1,SlowService\nRudeService\nMissingFood
9,0,na


In [27]:
df.rude.dtype

dtype('int64')

## Task 4

1. Define X (the **review** column) and y (the **rude** column).
2. Split X and y into training and testing sets (using the parameter **`random_state=1`**).
3. Use CountVectorizer (with the **default parameters**) to create document-term matrices from X_train and X_test.

In [28]:
X=df["review"]
y=df["rude"]

In [29]:
print(X.shape)
print(y.shape)

(1471,)
(1471,)


In [30]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)

(1103,)
(368,)


In [31]:
from sklearn.feature_extraction.text import CountVectorizer
# instantiate the vectorizer
vect = CountVectorizer()

In [32]:
# learn training data vocabulary, then use it to create a document-term matrix
X_train_dtm=vect.fit_transform(X_train)

In [33]:
X_train_dtm=pd.DataFrame(X_train_dtm.toarray(), columns=vect.get_feature_names())
X_train_dtm.shape

(1103, 7300)

In [34]:
X_train_dtm.head()

Unnamed: 0,00,00am,01,0200,03pm,04,04am,05,05i,0600,...,œæyay,œæyeah,œæyears,œæyelp,œæyep,œæyes,œæyesterday,œæyet,œæyou,œæyour
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
X_test_dtm = vect.transform(X_test)
X_test_dtm

<368x7300 sparse matrix of type '<class 'numpy.int64'>'
	with 22035 stored elements in Compressed Sparse Row format>

In [36]:
X_test_dtm=pd.DataFrame(X_test_dtm.toarray(), columns=vect.get_feature_names())

In [37]:
print(X_train_dtm.shape)
print(X_test_dtm.shape)

(1103, 7300)
(368, 7300)


## Task 5

Fit a Multinomial Naive Bayes model to the training set, calculate the **predicted probabilites** (not the class predictions) for the testing set, and then calculate the **AUC**. Repeat this task using a logistic regression model to see which of the two models achieves a better AUC.

- **Note:** Because McDonald's only cares about ranking the comments by the likelihood that they refer to rude service, **classification accuracy** is not the relevant evaluation metric. **Area Under the Curve (AUC)** is a more useful evaluation metric for this scenario, since it measures the ability of the classifier to assign higher predicted probabilities to positive instances than to negative instances.
- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to calculate predicted probabilities and AUC, and my [blog post and video](http://www.dataschool.io/roc-curves-and-auc-explained/) explain AUC in-depth.

In [38]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [39]:
# train the model using X_train_dtm 
%time nb.fit(X_train_dtm, y_train)

CPU times: user 75.2 ms, sys: 23.6 ms, total: 98.7 ms
Wall time: 58.3 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [40]:
# make class predictions for class 1 i.e rude comments
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob[:5]

array([9.99234643e-01, 2.59242971e-01, 8.57868677e-01, 2.67695702e-06,
       2.68548521e-01])

In [42]:
# calculate AUC
from sklearn import metrics
metrics.roc_auc_score(y_test, y_pred_prob)

0.8426005404546177

### Same process repeated for Logistic Reg

In [43]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [44]:
%time logreg.fit(X_train_dtm, y_train)

CPU times: user 139 ms, sys: 5.2 ms, total: 144 ms
Wall time: 57.7 ms




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [45]:
y_pred_prob_logreg = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob_logreg[:5]

array([0.62535679, 0.10575876, 0.79011902, 0.03743445, 0.46656837])

In [46]:
metrics.roc_auc_score(y_test, y_pred_prob_logreg)

0.8233985058019392

**So, Naive Bayes performs better than Logistic Classification model**

## Task 6

Using either Naive Bayes or logistic regression (whichever one had a better AUC in the previous step), try **tuning CountVectorizer** using some of the techniques we learned in class. Check the testing set **AUC** after each change, and find the set of parameters that increases AUC the most.

- **Hint:** It is highly recommended that you adapt the **`tokenize_test()`** function from class for this purpose, since it will allow you to iterate quickly through different sets of parameters.

In [47]:
def tokenize_test(vect):
    
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features:', X_train_dtm.shape[1])
    
    # use Multinomial Naive Bayes to calculate predicted probabilities
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
    
    # print the AUC
    print('AUC:', metrics.roc_auc_score(y_test, y_pred_prob))

In [48]:
vect = CountVectorizer()
tokenize_test(vect)

Features: 7300
AUC: 0.8426005404546177


In [49]:
# tune CountVectorizer to increase the AUC
vect = CountVectorizer(stop_words='english', max_df=0.3, min_df=4)
tokenize_test(vect)

Features: 1732
AUC: 0.8621522810364012


## Task 7 (Challenge)

The **city** column might be predictive of the response, but we are not currently using it as a feature. Let's see whether we can increase the AUC by adding it to the model:

1. Create a new DataFrame column, **review_city**, that concatenates the **review** text with the **city** text. One easy way to combine string columns in pandas is by using the [`Series.str.cat()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.cat.html) method. Make sure to use the **space character** as a separator, as well as replacing **null city values** with a reasonable string value (such as 'na').
2. Redefine X as the **review_city** column, and re-split X and y into training and testing sets.
3. When you run **`tokenize_test()`**, CountVectorizer will simply treat the city as an extra word in the review, and thus it will automatically be included in the model! Check to see whether it increased or decreased the AUC of your **best model**.

In [50]:
df.city.value_counts()

Las Vegas      390
Chicago        211
Los Angeles    162
New York       158
Atlanta        126
Houston        102
Portland        95
Dallas          74
Cleveland       67
Name: city, dtype: int64

In [51]:
len(df.city.value_counts())

9

In [52]:
df.city.isnull().sum()

86

In [53]:
# use str. cat for combining city with reviews. Also use sep and na_rep for dealing with missing values
df['review_city'] = df.review.str.cat(df.city, sep=' ', na_rep='Other-City')

In [54]:
df.review_city[0]

"I'm not a huge mcds lover, but I've been to better ones. This is by far the worst one I've ever been too! It's filthy inside and if you get drive through they completely screw up your order every time! The staff is terribly unfriendly and nobody seems to care. Atlanta"

In [55]:
# redefine X and y
X = df.review_city
y = df.rude

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [57]:
vect = CountVectorizer(stop_words='english', max_df=0.3, min_df=4)
tokenize_test(vect)

Features: 1738
AUC: 0.8646002225401366


We improved performance of our model by having 1738 features

## Task 8 (Challenge)

The **policies_violated:confidence** column may be useful, since it essentially represents a measurement of the training data quality. Let's see whether we can improve the AUC by only training the model using higher-quality rows!

To accomplish this, your first sub-task is to **calculate the mean confidence score for each row**, and then store those mean scores in a new column. For example, the confidence scores for the first row are `1.0\r\n0.6667\r\n0.6667`, so you should calculate a mean of `0.7778`. Here are the suggested steps:

1. Using the [`Series.str.split()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html) method, convert the **policies_violated:confidence** column into lists of one or more "confidence scores". Save the results as a new DataFrame column called **confidence_list**.
2. Define a function that calculates the mean of a list of numbers, and pass that function to the [`Series.apply()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html) method of the **confidence_list** column. That will calculate the mean confidence score for each row. Save those scores in a new DataFrame column called **confidence_mean**.
    - **Hint:** [How do I apply a function to a pandas Series or DataFrame?](https://www.youtube.com/watch?v=P_q0tkYqvSk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=30) explains how to use the `Series.apply()` method.

Your second sub-task is to **remove lower-quality rows from the training set**, and then repeat the model building and evaluation process. Here are the suggested steps:

1. Remove all rows from X_train and y_train that have a **confidence_mean lower than 0.75**. Check their shapes before and after to confirm that you removed about 300 rows.
2. Use the **`tokenize_test()`** function to check whether filtering the training data increased or decreased the AUC of your **best model**.
    - **Hint:** Even though X_train and y_train are separate from the mcd DataFrame, they can still be filtered using a boolean Series generated from mcd because all three objects share the same index.
    - **Note:** It's important that we don't remove any rows from the testing set (X_test and y_test), because the testing set should be representative of the real-world data we will encounter in the future (which will contain both high-quality and low-quality rows).

In [58]:
df['policies_violated:confidence'][:5]

0    1.0\n0.6667\n0.6667
1                      1
2               1.0\n1.0
3                 0.6667
4                      1
Name: policies_violated:confidence, dtype: object

1.0\n0.6667\n0.6667

We need to separate this

In [59]:
# split the column into lists of one or more confidence scores
df['confidence_list'] = df['policies_violated:confidence'].str.split()
df.confidence_list.head()

0    [1.0, 0.6667, 0.6667]
1                      [1]
2               [1.0, 1.0]
3                 [0.6667]
4                      [1]
Name: confidence_list, dtype: object

### Take Confidence mean

In [60]:
import numpy as np

# define a function that accepts a list of strings and returns the mean
def mean_of_list(conf_list):
    
    # convert the list to a NumPy array of floats
    conf_array = np.array(conf_list, dtype=float)
    
    # return the mean of the array
    return np.mean(conf_array)

In [61]:
# calculate the mean confidence score for each row
df['confidence_mean'] = df.confidence_list.apply(mean_of_list)
df.confidence_mean.head()

0    0.7778
1    1.0000
2    1.0000
3    0.6667
4    1.0000
Name: confidence_mean, dtype: float64

### Remove all rows from X_train and y_train that have a confidence_mean lower than 0.85

In [62]:
# check the shapes of X_train and y_train before removing any rows
print(X_train.shape)
print(y_train.shape)

(1103,)
(1103,)


In [63]:
# remove any rows from X_train and y_train that have a confidence_mean lower than 0.75
X_train = X_train[df.confidence_mean >= 0.85]
y_train = y_train[df.confidence_mean >= 0.85]
print(X_train.shape)
print(y_train.shape)

(643,)
(643,)


In [64]:
# check whether it increased or decreased the AUC of my best model
vect = CountVectorizer(stop_words='english', max_df=0.3, min_df=4)
tokenize_test(vect)

Features: 1120
AUC: 0.8295501510093786


## Task 9 (Challenge)

New comments have been submitted to the McDonald's website, and you need to **score them with the likelihood** that they are referring to rude service.

1. Before making predictions on out-of-sample data, it is important to re-train your model on all relevant data using the tuning parameters and preprocessing steps that produced the best AUC above.
    - In other words, X should be defined using either **all rows** or **only those rows with a confidence_mean of at least 0.75**, whichever produced a better AUC above.
    - X should refer to either the **review column** or the **review_city column**, whichever produced a better AUC above.
    - CountVectorizer should be instantiated with the **tuning parameters** that produced the best AUC above.
    - **`train_test_split()`** should not be used during this process.
2. Build a document-term matrix (from X) called **X_dtm**, and examine its shape.
3. Read the new comments stored in **`mcdonalds_new.csv`** into a DataFrame called **new_comments**, and examine it.
4. If your model uses a **review_city** column, create that column in the new_comments DataFrame. (Otherwise, skip this step.)
5. Build a document_term matrix (from the **new_comments** DataFrame) called **new_dtm**, and examine its shape.
6. Train your best model (Naive Bayes or logistic regression) using **X_dtm** and **y**.
7. Predict the "rude probability" for each comment in **new_dtm**, and store the probabilities in an object called **new_pred_prob**.
8. Print the **full text** for each new comment alongside its **"rude probability"**. (You may need to [increase the max_colwidth](https://www.youtube.com/watch?v=yiO43TQ4xvc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=28) to see the full text.) Examine the results, and comment on how well you think the model performed!

In [65]:
# define X and y using the data that produced the best AUC above
X = df.review_city
y = df.rude

In [66]:
# instantiate CountVectorizer with the tuning parameters that produced the best AUC above
vect = CountVectorizer(stop_words='english', max_df=0.3, min_df=4)

In [67]:
# fit and transform X into X_dtm
X_dtm = vect.fit_transform(X)
X_dtm.shape

(1471, 2103)

### Reading out of sample new comments

In [68]:
# read mcdonalds_new.csv into new_comments
path = '../data/mcdonalds_new.csv'
new_comments = pd.read_csv(path)

In [69]:
new_comments

Unnamed: 0,city,review
0,Las Vegas,Went through the drive through and ordered a #...
1,Chicago,Phenomenal experience. Efficient and friendly ...
2,Los Angeles,Ghetto lady helped me at the drive thru. Very ...
3,New York,Close to my workplace. It was well manged befo...
4,Portland,I've made at least 3 visits to this particular...
5,Houston,Why did I revisited this McDonald's again. I...
6,Atlanta,This specific McDonald's is the bar I hold all...
7,Dallas,My friend and I stopped in to get a late night...
8,Cleveland,Friendly people but completely unable to deliv...
9,,"Having visited many McDonald's over the years,..."


In [70]:
# concatenate review and city, separated by a space, replacing nulls with 'na'
new_comments['review_city'] = new_comments.review.str.cat(new_comments.city, sep=' ', na_rep='na')

In [71]:
# transform new_comments.review_city into new_dtm
new_dtm = vect.transform(new_comments.review_city)
new_dtm.shape

(10, 2103)

In [72]:
# train a MultinomialNB model
nb = MultinomialNB()
nb.fit(X_dtm, y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [73]:
# calculate the predicted probability of "rude service" for each new comment
new_pred_prob = nb.predict_proba(new_dtm)[:, 1]
new_pred_prob

array([0.3355455 , 0.00815264, 0.95704371, 0.03903489, 0.39969738,
       0.05922779, 0.27144306, 0.99999236, 0.09260042, 0.02660779])

In [74]:
# print the comment text alongside the predicted "rude probability"
# note: use sort() instead of sort_values() prior to pandas 0.17
pd.DataFrame({'comment':new_comments.review_city, 'rude_probability':new_pred_prob}).sort_values('rude_probability', ascending=False)

Unnamed: 0,comment,rude_probability
7,My friend and I stopped in to get a late night...,0.999992
2,Ghetto lady helped me at the drive thru. Very ...,0.957044
4,I've made at least 3 visits to this particular...,0.399697
0,Went through the drive through and ordered a #...,0.335545
6,This specific McDonald's is the bar I hold all...,0.271443
8,Friendly people but completely unable to deliv...,0.0926
5,Why did I revisited this McDonald's again. I...,0.059228
3,Close to my workplace. It was well manged befo...,0.039035
9,"Having visited many McDonald's over the years,...",0.026608
1,Phenomenal experience. Efficient and friendly ...,0.008153


In [75]:
# widen the column display
pd.set_option('display.max_colwidth', 1000)

In [76]:
pd.DataFrame({'comment':new_comments.review_city, 'rude_probability':new_pred_prob}).sort_values('rude_probability', ascending=False)

Unnamed: 0,comment,rude_probability
7,My friend and I stopped in to get a late night snack and we were refused service. The store claimed to be 24 hours and the manager was standing right there doing paper work but would not help us. The cashier was only concerned with doing things for the drive thru and said that the manager said he wasn't allowed to help us. We thought it was a joke at first but when realized it wasn't we said goodbye and they just let us leave. I work in a restaurant and this is by far the worst service I have ever seen. I know it was late and maybe they didn't want to be there but it was completely ridiculous. I think the manager should be fired. Dallas,0.999992
2,Ghetto lady helped me at the drive thru. Very rude and disrespectful to the co workers. Never coming back. Yuck! Los Angeles,0.957044
4,"I've made at least 3 visits to this particular location just because it's right next to my office building.. and all my experience have been consistently bad. There are a few helpers taking your orders throughout the drive-thru route and they are the worst. They rush you in placing an order and gets impatient once the order gets a tad bit complicated. Don't even bother changing your mind oh NO! They will glare at you and snap at you if you want to change something. I understand its FAST food, but I want my order placed right. Not going back if I can help it. Portland",0.399697
0,"Went through the drive through and ordered a #10 (cripsy sweet chili chicken wrap) without fries- the lady couldn't understand that I did not want fries and charged me for them anyways. I got the wrong order- a chicken sandwich and a large fries- my boyfriend took it back inside to get the correct order. The gentleman that ordered the chicken sandwich was standing there as well and she took the bag from my bf- glanced at the insides and handed it to the man without even offering to replace. I mean with all the scares about viruses going around... ugh DISGUSTING SERVICE. Then when she gave him the correct order my wrap not only had the sweet chili sauce on it, but the nasty (just not my first choice) ranch dressing on it!!!! I mean seriously... how lazy can you get!!!! I worked at McDonalds in Texas when I was 17 for about 8 months and I guess I was spoiled with good management. This was absolutely ridiculous. I was beyond disappointed. Las Vegas",0.335545
6,"This specific McDonald's is the bar I hold all other fast food joints to now. Been working in this area for 3 years now and gone to this location many times for drive-through pickup. Service is always fast, food comes out right, and the staff is extremely warm and polite. Atlanta",0.271443
8,"Friendly people but completely unable to deliver what was ordered at the drive through. Out of my last 6 orders they got it right 3 times. Incidentally, the billing was always correct - they just could not read the order and deliver. Very frustrating! Cleveland",0.0926
5,"Why did I revisited this McDonald's again. I needed to use the restroom facilities and the women's bathroom didn't have soap, the floor was wet, the bathroom stink, and the toilets were nasty. This McDonald's is very nasty. Houston",0.059228
3,"Close to my workplace. It was well manged before. Now it's OK. The parking can be tight sometimes. Like all McDonald's, prices are getting expensive. New York",0.039035
9,"Having visited many McDonald's over the years, I have to say that this one is the most efficient one ever! Even though it is still fast food, the service at the drive-thru is the best. They rarely make a mistake and I never see anyone parked in the drive-thru slots where they bring food out because they don't have it ready. So, if you like McDonald's fast food, it doesn't get any better than this. na",0.026608
1,"Phenomenal experience. Efficient and friendly staff. Clean restrooms, good, fast service and bilingual staff. One of the best restaurants in the chain. Chicago",0.008153


### END OF NOTEBOOK