# Midterm Assignment: McDonald's Sentiment Data Analysis

## Problem

McDonald’s receives thousands of consumer comment on their website every day and many of them are negative. Their corporate employees do not have the time to browse through every single comment, but they do want to read a subset that they are most interested in. In particular, articles about the rude service of their employees have recently surfaced on social media. In order to take appropriate action, they would now like to review comments about **rude service**. 

You are hired to develop a system that ranks each comment by the **likelihood that it is referring to rude service**. They will use this system to build a “rudeness dashboard” for their corporate employees, so that the employees can spend a few minutes each day examining the **most relevant recent comments**.


## Data

McDonald’s used the CrowdFlower platform to pay humans to hand-annotate approximately 1500 comments with the type of complaint. The list of complaint types can be found below, with the encoding used listed in parentheses: 
- Bad Food (BadFood)
- Bad Neighborhood (ScaryMcDs)
- Cost (Cost)
- Dirty Location (Filthy)
- Missing Item (MissingFood)
- Problem with Order (OrderProblem)
- Rude Service (RudeService)
- Slow Service (SlowService)
- None of the above (na) 

You will be asked to perform some tasks. In the midst of these tasks, some MCQs will be asked. You are to select the best possible option as your answer. Please answer them accordingly. 

In [2]:
# for Python 2: use print only as a function
from __future__ import print_function

## Task 1

Read **'mcdonalds.csv'** into a pandas DataFrame and examine it. (Instructions: mcdonalds.csv can be found in “IVLE Workbin > Midterm Assignment”) 

A description of the more important columns to get you started: 
- The **policies_violated** column lists the type of complaint. If there is more than one type, the types are separated by newline characters.
- The **policies_violated:confidence** column lists CrowdFlower's confidence in the judgments of its human annotators for that row (higher is better).
- The **city** column is the McDonald's location.
- The **review** column is the actual text comment.

**Please answer Question 1 as in midterm.pdf.** 

In [234]:
import pandas as pd
import numpy as np

data = pd.read_csv("mcdonalds.csv")
data.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,
2,679455655,False,finalized,3,2/21/15 0:26,SlowService\nOrderProblem,1.0\n1.0,Atlanta,,"First they ""lost"" my order, actually they gave...",
3,679455656,False,finalized,3,2/21/15 0:27,na,0.6667,Atlanta,,I see I'm not the only one giving 1 star. Only...,
4,679455657,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,"Well, it's McDonald's, so you know what the fo...",


In [235]:
print(data.loc[127].review)

Ok I'm waiting for like 10 minutes to place my order with the staff walking back & forth just looking at me like I'm crazy. And another 10 minutes or so before i got my food, This location use to be my stop in the mornings when I worked near here but they have fallen way off.


## Task 2

Remove any rows from the DataFrame in which the policies_violated column has a null value.
- **Note**: Null values are also known as “missing values”, and are encoded in pandas with the special value “NaN’. This is different from the “na” encoding used by CrowdFlower to denote “None of the above”. Rows that contain “na” should not be removed. 

**Please answer Questions 2 and 3 as in midterm.pdf.**

In [236]:
data['city'].isnull().sum()

87

In [237]:
data['policies_violated'].notnull().sum()

1471

In [238]:
data_clean = data[data['policies_violated'].notnull()]
data_clean.head(3)

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,
2,679455655,False,finalized,3,2/21/15 0:26,SlowService\nOrderProblem,1.0\n1.0,Atlanta,,"First they ""lost"" my order, actually they gave...",


In [239]:
data_clean.shape

(1471, 11)

## Task 3

Add a new column to the DataFrame called **"rude"** that is 1 if the **policies_violated** column contains the text "RudeService", and 0 if the **policies_violated** column does not contain "RudeService". The "rude" column is going to be your response variable, so check how many zeros and ones it contains.

**Please answer Question 4 as in midterm.pdf.**

In [240]:
def find_rude_service(text):
    c = text.find('RudeService')
    return 1 if c != -1 else 0

In [241]:
data_clean['rude'] = data_clean['policies_violated'].apply(find_rude_service)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [242]:
data_clean.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10,rude
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",,1
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,,1
2,679455655,False,finalized,3,2/21/15 0:26,SlowService\nOrderProblem,1.0\n1.0,Atlanta,,"First they ""lost"" my order, actually they gave...",,0
3,679455656,False,finalized,3,2/21/15 0:27,na,0.6667,Atlanta,,I see I'm not the only one giving 1 star. Only...,,0
4,679455657,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,"Well, it's McDonald's, so you know what the fo...",,1


In [244]:
data_clean.shape

(1471, 12)

In [245]:
data_clean[data_clean['rude'] == 0]

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10,rude
2,679455655,False,finalized,3,2/21/15 0:26,SlowService\nOrderProblem,1.0\n1.0,Atlanta,,"First they ""lost"" my order, actually they gave...",,0
3,679455656,False,finalized,3,2/21/15 0:27,na,0.6667,Atlanta,,I see I'm not the only one giving 1 star. Only...,,0
5,679455658,False,finalized,3,2/21/15 0:13,BadFood\nSlowService,0.7111\n0.6444,Atlanta,,This has to be one of the worst and slowest Mc...,,0
6,679455659,False,finalized,3,2/21/15 0:36,SlowService\nScaryMcDs,0.6562\n0.6562,Atlanta,,I'm not crazy about this McDonald's. ŒæThis is...,,0
9,679455662,False,finalized,3,2/21/15 0:12,na,1,Atlanta,,This McDonald's has gotten much better. Usuall...,,0
10,679455663,False,finalized,3,2/21/15 0:38,SlowService,1,Atlanta,,Let's start here only reason I came into McDon...,,0
13,679455666,False,finalized,3,2/21/15 0:15,SlowService,1,Atlanta,,"Believe it or not, this used to be q really go...",,0
15,679455668,False,finalized,3,2/21/15 0:38,SlowService\nScaryMcDs,0.7112\n0.6641,Atlanta,,25 minutes in drive through line. Gunshots fro...,,0
16,679455669,False,finalized,3,2/21/15 2:30,SlowService\nMissingFood\nBadFood,1.0\n1.0\n1.0,Atlanta,,"Super slow service, food's terrible like its b...",,0
17,679455670,False,finalized,3,2/21/15 0:15,SlowService,1,Atlanta,,SLOW-SLOW-SLOW! ŒæDon't go here if you have a ...,,0


In [246]:
len(data_clean[data_clean['rude'] == 0]) / len(data_clean)

0.6580557443915703

## Task 4

Define X using the **review** column and y using the **rude** column. Split X and y into training and testing sets (using the parameter **`random_state=1`**). Use CountVectorizer (with the **default parameters**) to create document-term matrices from X_train and X_test. 
- Note: Please remember to follow the instructions carefully by setting the parameters as required for reproducibility of results. 

**Please answer Questions 5 and 6 as in midterm.pdf.**

In [303]:
X = data_clean['review']
Y = data_clean['rude']

In [304]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=1)

In [305]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [306]:
vect.fit_transform(X_train)

<1103x7300 sparse matrix of type '<class 'numpy.int64'>'
	with 70807 stored elements in Compressed Sparse Row format>

In [307]:
a = vect.transform(X_test).toarray()

In [308]:
count_true = 0
count_false = 0
for x in a[24]:
    if x == 1:
        count_true += 1
    else:
        count_false += 1
print(len(a[24]))
print(count_true)
print(count_false)
print(count_true + count_false)

7300
3
7297
7300


## Task 5

Fit a Multinomial Naive Bayes model to the training set, calculate the **predicted probabilities** for the testing set, and then calculate the AUC. Repeat this task using a logistic regression model to compare which of the two models achieves a better AUC. 
- **Note**: McDonald’s requires you to rank the comments by the likelihood that they refer to rude service. In this case, classification accuracy is NOT the relevant evaluation metric. Area Under Curve (AUC) is a more useful evaluation metric for this scenario, since it measures the ability of the classifier to assign higher predicted probabilities to positive instances than to negative instances. 

**Please answer Questions 7, 8 and 9 as in midterm.pdf.** 

In [309]:
x_dtm = vect.fit_transform(X_train)
x_dtm

<1103x7300 sparse matrix of type '<class 'numpy.int64'>'
	with 70807 stored elements in Compressed Sparse Row format>

In [310]:
x_test_dtm = vect.transform(X_test)
x_test_dtm

<368x7300 sparse matrix of type '<class 'numpy.int64'>'
	with 22035 stored elements in Compressed Sparse Row format>

In [311]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [312]:
%time nb.fit(x_dtm, y_train)

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 2.25 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [313]:
y_predict = nb.predict(x_test_dtm[0:5])
y_predict

array([1, 0, 1, 0, 0])

In [262]:
y_predict[:5]

array([1, 0, 1, 0, 0])

In [263]:
from sklearn import metrics

In [266]:
y_pred_prob = nb.predict_proba(x_test_dtm)[:, 1]
y_pred_prob

array([  9.99234643e-01,   2.59242971e-01,   8.57868677e-01,
         2.67695702e-06,   2.68548521e-01,   9.97469782e-01,
         7.40699844e-04,   1.11406133e-09,   1.91914636e-01,
         4.11011564e-01,   9.97466984e-01,   5.38284222e-02,
         6.80854939e-07,   8.34493700e-01,   2.15804856e-01,
         9.95641179e-01,   1.19139453e-04,   1.00000000e+00,
         2.95609981e-03,   8.56160253e-02,   1.07358615e-07,
         1.24737099e-03,   3.17424163e-03,   9.99811373e-01,
         2.97503091e-01,   1.69157013e-07,   5.73190648e-01,
         1.80537581e-01,   9.23675492e-02,   7.26950593e-01,
         7.31090780e-01,   9.99998319e-01,   9.99999992e-01,
         1.53328396e-06,   9.99999991e-01,   1.84177211e-02,
         3.56210035e-01,   5.82027503e-01,   5.44354319e-04,
         2.65523485e-01,   1.27716883e-03,   9.76737932e-04,
         7.18630183e-01,   3.16802348e-02,   1.07181170e-09,
         5.65547636e-06,   4.53633177e-03,   2.72992981e-04,
         6.94722415e-02,

In [267]:
metrics.roc_auc_score(y_test, y_pred_prob)

0.84260054045461774

In [268]:
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [269]:
logreg.fit(x_dtm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [270]:
y_log_pred_prob = logreg.predict_proba(x_test_dtm)[:, 1]
y_log_pred_prob

array([  6.25389450e-01,   1.05754070e-01,   7.90141143e-01,
         3.74281004e-02,   4.66577393e-01,   8.91779105e-01,
         2.28889282e-04,   1.48801662e-02,   2.37555257e-01,
         1.08308607e-01,   9.76128478e-01,   2.07538902e-01,
         6.55979196e-03,   8.25495936e-01,   2.93810580e-01,
         9.37374585e-01,   6.78672491e-03,   9.99985911e-01,
         3.02491317e-02,   7.61189855e-01,   1.16738205e-01,
         6.42703408e-02,   8.25979119e-03,   6.37884017e-01,
         1.42652186e-01,   1.40337531e-03,   8.95294126e-01,
         5.53867472e-02,   3.27727873e-01,   2.43932246e-01,
         7.20170810e-01,   9.96852702e-01,   9.77313264e-01,
         4.06227978e-02,   6.71192105e-01,   1.10247089e-02,
         3.49702101e-02,   6.85325006e-02,   7.96734177e-01,
         4.06324124e-01,   7.42751390e-02,   1.96956427e-02,
         4.68482060e-01,   8.68938475e-02,   7.52135575e-05,
         1.62521599e-02,   6.36370303e-03,   3.57069959e-02,
         7.19950716e-02,

In [271]:
metrics.roc_auc_score(y_test, y_log_pred_prob)

0.8233667143538389

In [272]:
metrics.roc_auc_score(y_test, y_pred_prob) - metrics.roc_auc_score(y_test, y_log_pred_prob)

0.019233826100778839

## Task 6

Using Naive Bayes, try **tuning CountVectorizer** using some of the techniques we learned in class. Check the testing set AUC after each change, and find the set of parameters that increases AUC the most. (This is meant for your own learning experience)
- **Hint**: It is highly recommended that you adapt the **`tokenize_test()`** function from class for this purpose, since it will allow you to iterate quickly through different sets of parameters. 

**Please answer Questions 10 and 11 as in midterm.pdf.**

In [273]:
def tokenize_test(vect):
    # Transform fit
    x_dtm = vect.fit_transform(X_train)
    x_test_dtm = vect.transform(X_test)
    
    # import and instantiate a Multinomial Naive Bayes model
    from sklearn.naive_bayes import MultinomialNB
    nb = MultinomialNB()
    
    nb.fit(x_dtm, y_train)
    y_pred_prob = nb.predict_proba(x_test_dtm)[:, 1]
    
    from sklearn import metrics
    return metrics.roc_auc_score(y_test, y_pred_prob)

In [274]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [275]:
tokenize_test(vect)

0.84260054045461774

In [276]:
# max_df
vect = CountVectorizer(max_df=0.5)
tokenize_test(vect)

0.84488952471785073

In [277]:
# min
vect = CountVectorizer(min_df=2)
tokenize_test(vect)

0.84458750596089649

In [278]:
# stopwords
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

0.85352090287712601

In [279]:
vect = CountVectorizer(max_features=1000)
tokenize_test(vect)

0.83009060562708625

In [280]:
# include 1-grams and 2-grams, and only keep terms that appear in at least 2 documents
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

0.81959942775393424

In [281]:
vect = CountVectorizer(min_df=4, max_df=0.3, stop_words='english')
tokenize_test(vect)

0.86215228103640118

In [282]:
x_dtm = vect.fit_transform(X_train)
x_dtm

<1103x1732 sparse matrix of type '<class 'numpy.int64'>'
	with 31806 stored elements in Compressed Sparse Row format>

## Task 7 

The city column might be predictive of the response, but we are currently not using it as a feature. We will now explore to see if we can increase the AUC by adding city to the model. You are to do the following: 
1. Create a new DataFrame column, review_city, that concatenates the review text with the city text. One easy way to combine string columns in pandas is by using the `Series.str.cat()` method. Make sure to use the whitespace character as a separator, as well as replacing null city values with a reasonable string value such as ‘na’. 
2. Redefine X using the review_city column, and re-split X and y into training and testing sets (using the parameter `random_state=1`). 
3. By allowing for English stopwords removal, and setting the following parameters as `max_df = 0.3`, `min_df=4` in the CountVectorizer, check whether it has increased or decreased the AUC. 

**Please answer Question 12 as in midterm.pdf.** 

In [283]:
def get_review_city(text, city):
    return pd.Series([text, city]).str.cat(sep=' ', na_rep='na')

In [284]:
data_clean['review_city'] = data_clean.apply(lambda x: get_review_city(x['review'], x['city']), axis=1)
data_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10,rude,review_city
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",,1,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,,1,Terrible customer service. ŒæI came in at 9:30...
2,679455655,False,finalized,3,2/21/15 0:26,SlowService\nOrderProblem,1.0\n1.0,Atlanta,,"First they ""lost"" my order, actually they gave...",,0,"First they ""lost"" my order, actually they gave..."
3,679455656,False,finalized,3,2/21/15 0:27,na,0.6667,Atlanta,,I see I'm not the only one giving 1 star. Only...,,0,I see I'm not the only one giving 1 star. Only...
4,679455657,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,"Well, it's McDonald's, so you know what the fo...",,1,"Well, it's McDonald's, so you know what the fo..."
5,679455658,False,finalized,3,2/21/15 0:13,BadFood\nSlowService,0.7111\n0.6444,Atlanta,,This has to be one of the worst and slowest Mc...,,0,This has to be one of the worst and slowest Mc...
6,679455659,False,finalized,3,2/21/15 0:36,SlowService\nScaryMcDs,0.6562\n0.6562,Atlanta,,I'm not crazy about this McDonald's. ŒæThis is...,,0,I'm not crazy about this McDonald's. ŒæThis is...
7,679455660,False,finalized,3,2/21/15 0:15,RudeService,0.6801,Atlanta,,One Star and I'm beng kind. I blame management...,,1,One Star and I'm beng kind. I blame management...
8,679455661,False,finalized,3,2/21/15 0:29,SlowService\nRudeService\nMissingFood,1.0\n1.0\n0.6667,Atlanta,,Never been upset about any fast food drive thr...,,1,Never been upset about any fast food drive thr...
9,679455662,False,finalized,3,2/21/15 0:12,na,1,Atlanta,,This McDonald's has gotten much better. Usuall...,,0,This McDonald's has gotten much better. Usuall...


In [285]:
X_city = data_clean['review_city']
X_city

0       I'm not a huge mcds lover, but I've been to be...
1       Terrible customer service. ŒæI came in at 9:30...
2       First they "lost" my order, actually they gave...
3       I see I'm not the only one giving 1 star. Only...
4       Well, it's McDonald's, so you know what the fo...
5       This has to be one of the worst and slowest Mc...
6       I'm not crazy about this McDonald's. ŒæThis is...
7       One Star and I'm beng kind. I blame management...
8       Never been upset about any fast food drive thr...
9       This McDonald's has gotten much better. Usuall...
10      Let's start here only reason I came into McDon...
11      Other businesses throughout Metro Atlanta open...
12      The drive thru makes them lost a star since my...
13      Believe it or not, this used to be q really go...
14      As the previous yelpers have already stated, t...
15      25 minutes in drive through line. Gunshots fro...
16      Super slow service, food's terrible like its b...
17      SLOW-S

In [286]:
Y

0       1
1       1
2       0
3       0
4       1
5       0
6       0
7       1
8       1
9       0
10      0
11      1
12      1
13      0
14      1
15      0
16      0
17      0
18      0
19      1
20      0
21      0
22      1
23      0
24      0
25      0
26      1
27      0
28      1
29      0
       ..
1495    0
1496    0
1497    0
1498    0
1499    1
1500    0
1501    0
1502    1
1503    0
1504    1
1505    1
1506    0
1507    0
1508    0
1509    1
1510    0
1511    1
1512    1
1513    0
1514    0
1515    0
1516    0
1517    1
1518    0
1519    1
1520    0
1521    0
1522    0
1523    0
1524    0
Name: rude, dtype: int64

In [287]:
X_city_train, X_city_test, y_city_train, y_city_test = train_test_split(X_city, Y, random_state=1)

In [288]:
vect = CountVectorizer(min_df=4, max_df=0.3, stop_words='english')

In [289]:
# Transform fit
x_dtm = vect.fit_transform(X_city_train)
x_test_dtm = vect.transform(X_city_test)

# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

nb.fit(x_dtm, y_city_train)
y_pred_prob = nb.predict_proba(x_test_dtm)[:, 1]

from sklearn import metrics
print(metrics.roc_auc_score(y_city_test, y_pred_prob))

0.864854554125


In [290]:
metrics.roc_auc_score(y_city_test, y_pred_prob) - 0.86215228103640118

0.0027022730885392088

## Task 8 

The **policies_violated:confidence** column may be useful as it is a measure of the training data quality. You are to calculate the **mean confidence** score for each row of your McDonald’s dataset (i.e. X_train together with X_test) and store these mean scores in a new column. For example the confidence scores for the first row are 1.0\r\n0.6667\r\n0.6667, so you should calculate a mean of 0.7778. Here are some of the steps you can follow: 
1. Using the `Series.str.split()` method, convert the policies_violated:confidence column into lists of one or more “confidence scores”. Save the results as a new DataFrame column called **confidence_list**. 
2. Apply a function that can calculate the mean of a list of numbers, and pass that function to the `Series.apply()` method of the **confidence_list** column. Save those scores in a new DataFrame column called **confidence_mean**. 

**Please answer Question 13 as in midterm.pdf.**

In [291]:
data_clean['confidence_list'] = data_clean['policies_violated:confidence'].str.split()
data_clean.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10,rude,review_city,confidence_list
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",,1,"I'm not a huge mcds lover, but I've been to be...","[1.0, 0.6667, 0.6667]"
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,,1,Terrible customer service. ŒæI came in at 9:30...,[1]


In [292]:
def get_confidence_mean(lst):
    lst = list(map(lambda x: float(x), lst))
    return sum(lst)/len(lst)

In [293]:
data_clean['confidence_mean'] = data_clean['confidence_list'].apply(get_confidence_mean)
data_clean.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10,rude,review_city,confidence_list,confidence_mean
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",,1,"I'm not a huge mcds lover, but I've been to be...","[1.0, 0.6667, 0.6667]",0.7778
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,,1,Terrible customer service. ŒæI came in at 9:30...,[1],1.0


In [294]:
len(data_clean[data_clean['confidence_mean'] == 1])

785

We will now like to remove lower-quality rows from the training set to reduce noise. You are to remove all rows from X_train and y_train that have a confidence_mean lower than 0.75. 

**Please answer Questions 14 and 15 as in midterm.pdf.**

In [295]:
data_clean.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10,rude,review_city,confidence_list,confidence_mean
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",,1,"I'm not a huge mcds lover, but I've been to be...","[1.0, 0.6667, 0.6667]",0.7778
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,,1,Terrible customer service. ŒæI came in at 9:30...,[1],1.0
2,679455655,False,finalized,3,2/21/15 0:26,SlowService\nOrderProblem,1.0\n1.0,Atlanta,,"First they ""lost"" my order, actually they gave...",,0,"First they ""lost"" my order, actually they gave...","[1.0, 1.0]",1.0
3,679455656,False,finalized,3,2/21/15 0:27,na,0.6667,Atlanta,,I see I'm not the only one giving 1 star. Only...,,0,I see I'm not the only one giving 1 star. Only...,[0.6667],0.6667
4,679455657,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,"Well, it's McDonald's, so you know what the fo...",,1,"Well, it's McDonald's, so you know what the fo...",[1],1.0


In [296]:
X_city_train, X_city_test, y_city_train, y_city_test = train_test_split(X_city, Y, random_state=1)

In [297]:
X_city_train = X_city_train[data_clean['confidence_mean'] >=0.75]
y_city_train = y_city_train[data_clean['confidence_mean'] >=0.75]

In [298]:
vect = CountVectorizer(min_df=4, max_df=0.3, stop_words='english')

In [299]:
# Transform fit
x_dtm = vect.fit_transform(X_city_train)
x_test_dtm = vect.transform(X_city_test)

# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

nb.fit(x_dtm, y_city_train)
y_pred_prob = nb.predict_proba(x_test_dtm)[:, 1]

from sklearn import metrics
print(metrics.roc_auc_score(y_city_test, y_pred_prob))

0.849690033381


In [300]:
len(X_city_train)

799