# Midterm Assignment: McDonald's Sentiment Data Analysis - Solution

## Problem

McDonald’s receives thousands of consumer comment on their website every day and many of them are negative. Their corporate employees do not have the time to browse through every single comment, but they do want to read a subset that they are most interested in. In particular, articles about the rude service of their employees have recently surfaced on social media. In order to take appropriate action, they would now like to review comments about **rude service**. 

You are hired to develop a system that ranks each comment by the **likelihood that it is referring to rude service**. They will use this system to build a “rudeness dashboard” for their corporate employees, so that the employees can spend a few minutes each day examining the **most relevant recent comments**.


## Data

McDonald’s used the CrowdFlower platform to pay humans to hand-annotate approximately 1500 comments with the type of complaint. The list of complaint types can be found below, with the encoding used listed in parentheses: 
- Bad Food (BadFood)
- Bad Neighborhood (ScaryMcDs)
- Cost (Cost)
- Dirty Location (Filthy)
- Missing Item (MissingFood)
- Problem with Order (OrderProblem)
- Rude Service (RudeService)
- Slow Service (SlowService)
- None of the above (na) 

You will be asked to perform some tasks. In the midst of these tasks, some MCQs will be asked. You are to select the best possible option as your answer. Please answer them accordingly. 

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

## Task 1

Read **'mcdonalds.csv'** into a pandas DataFrame and examine it. (Instructions: mcdonalds.csv can be found in “IVLE Workbin > Midterm Assignment > data”) 

A description of the more important columns to get you started: 
- The **policies_violated** column lists the type of complaint. If there is more than one type, the types are separated by newline characters.
- The **policies_violated:confidence** column lists CrowdFlower's confidence in the judgments of its human annotators for that row (higher is better).
- The **city** column is the McDonald's location.
- The **review** column is the actual text comment.

**Please answer Question 1 as in midterm.pdf.** 

In [2]:
import pandas as pd
df = pd.read_csv('data/mcdonalds.csv')
df.loc[302,'review']

"Went here after work for a quick Mickey D fix. I had order a 6 piece chicken nuggets meal. Sorry to say but I think it will be my last time here. Don't get me wrong, the food tasted fine by my insides didn't think so."

## Task 2

Remove any rows from the DataFrame in which the policies_violated column has a null value.
- **Note**: Null values are also known as “missing values”, and are encoded in pandas with the special value “NaN’. This is different from the “na” encoding used by CrowdFlower to denote “None of the above”. Rows that contain “na” should not be removed.

- **Note**: pandas.notnull() can return true if the object is not null and false if the object is null.

**Please answer Question 2 as in midterm.pdf.**

In [3]:
print(sum(df.policies_violated.isna()))
df = df[df.policies_violated.notnull()]
print(sum(df.policies_violated.isna()))
print(df.shape)

54
0
(1471, 11)


### Task 3

Add a new column to the DataFrame called **"rude"** that is 1 if the **policies_violated** column contains the text "RudeService", and 0 if the **policies_violated** column does not contain "RudeService". The "rude" column is going to be your response variable, so check how many zeros and ones it contains.

- **Note**: .iloc[] function can be used to select dataframe rows by position

**Please answer Question 3 as in midterm.pdf.**

In [4]:
df['rude'] = [1 if 'RudeService' in review else 0 for review in df.policies_violated]
print(1 - sum(df.head(500).rude)/500)

0.654


## Task 4

Define X using the **review** column and y using the **rude** column. Split X and y into training and testing sets (using the parameter **`random_state=1`**). Use CountVectorizer (with the **default parameters**) to create document-term matrices from X_train and X_test. 
- Note: Please remember to follow the instructions carefully by setting the parameters as required for reproducibility of results. 

**Please answer Question 4 as in midterm.pdf.**

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
X = df.review
y = df.rude
x_train, x_test, y_train, y_test = train_test_split(X,y,random_state=1)
vectorizer = CountVectorizer()
dt_train = vectorizer.fit_transform(x_train)
dt_test = vectorizer.transform(x_test)
print(dt_train.shape[1])
print(len(vectorizer.get_feature_names()))

7300
7300


## Task 5

Fit a Multinomial Naive Bayes model to the training set, calculate the **predicted probabilities** for the testing set, and then calculate the AUC. Repeat this task using a logistic regression model to compare which of the two models achieves a better AUC. 
- **Note**: McDonald’s requires you to rank the comments by the likelihood that they refer to rude service. In this case, classification accuracy is NOT the relevant evaluation metric. Area Under Curve (AUC) is a more useful evaluation metric for this scenario, since it measures the ability of the classifier to assign higher predicted probabilities to positive instances than to negative instances. 

**Please answer Questions 5 and 6 as in midterm.pdf.** 

In [6]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

nb = MultinomialNB()
nb.fit(dt_train, y_train)
y_pred_prob =[res[1] for res in nb.predict_proba(dt_test)]
y_pred_class = nb.predict(dt_test)

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob, pos_label=1)
auc_nb = metrics.auc(fpr, tpr)
print(y_pred_prob[4])
print(y_pred_class[4])
print(auc_nb)

lg = LogisticRegression()
lg.fit(dt_train, y_train)
y_pred_prob = [res[1] for res in lg.predict_proba(dt_test)]
y_pred_class = lg.predict(dt_test)

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob, pos_label=1)
auc_lg = metrics.auc(fpr, tpr)
print(y_pred_class[4])
print(auc_lg)
print(auc_nb - auc_lg)




0.26854852118209765
0
0.8426005404546177
0
0.8233985058019394
0.019202034652678335


## Task 6

Using Naive Bayes, try **tuning CountVectorizer** using some of the techniques we learned in class. Check the testing set AUC after each change, and find the set of parameters that increases AUC the most. (This is meant for your own learning experience)
- **Hint**: It is highly recommended that you adapt the **`tokenize_test()`** function from class for this purpose, since it will allow you to iterate quickly through different sets of parameters. 


**Please answer Questions 7 and 8 as in midterm.pdf.**

In [7]:
vectorizer = CountVectorizer(stop_words='english', max_df = 0.3, min_df = 4)
dt_train = vectorizer.fit_transform(x_train)
dt_test = vectorizer.transform(x_test)
print(dt_train.shape[1])

nb = MultinomialNB()
nb.fit(dt_train, y_train)
y_pred_prob =[res[1] for res in nb.predict_proba(dt_test)]
y_pred_class = nb.predict(dt_test)

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob, pos_label=1)
auc_nb = metrics.auc(fpr, tpr)
print(y_pred_prob[4])
print(y_pred_class[4])
print(auc_nb)

1732
0.9104643300897306
1
0.8621522810364012
