# Machine Learning and NLP Exercises #

## Introduction ##

We will be using the same review data set from Kaggle from Week 2 for this exercise. The product we'll focus on this time is a cappuccino cup. The goal of this week is to not only preprocess the data, but to classify reviews as positive or negative based on the review text.

The following code will help you load in the data.

In [1]:
import numpy as np
import nltk
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
data = pd.read_csv('../data/coffee.csv')
data.head()

Unnamed: 0,user_id,stars,reviews
0,A2XP9IN4JOMROD,1,I wanted to love this. I was even prepared for...
1,A2TS09JCXNV1VD,5,Grove Square Cappuccino Cups were excellent. T...
2,AJ3L5J7GN09SV,2,I bought the Grove Square hazelnut cappuccino ...
3,A3CZD34ZTUJME7,1,"I love my Keurig, and I love most of the Keuri..."
4,AWKN396SHAQGP,1,It's a powdered drink. No filter in k-cup.<br ...


## Question 1 ##

* Determine how many reviews there are in total.
* Determine the percent of 1, 2, 3, 4 and 5 star reviews.
* Create a new data set for modeling with the following columns:
     - Column 1: 'positive' if review = 4 or 5, and 'negative' if review = 1 or 2
     - Column 2: review text
* Take a look at the number of positive and negative reviews in the newly created data set.

Checkpoint: the resulting data set should have 514 reviews.

Use the preprocessing code below to clean the reviews data before moving on to modeling.

In [3]:
print('The number of reviews is',data.shape[0])

The number of reviews is 542


In [4]:
(data.groupby('stars').size())

stars
1     96
2     45
3     28
4     65
5    308
dtype: int64

In [5]:
stopwords_english = set(stopwords.words('english'))
data.loc[:,"reviews"] = data.reviews.apply(lambda x:" ".join(re.findall('[\w]+',x)))
def remove_stopwords(s):
    s = " ".join(word for word in s.split() if word not in stopwords_english)
    return s
data.loc[:,"reviews"] = data.reviews.apply(lambda x:remove_stopwords(x))
data.head()

Unnamed: 0,user_id,stars,reviews
0,A2XP9IN4JOMROD,1,I wanted love I even prepared somewhat like ch...
1,A2TS09JCXNV1VD,5,Grove Square Cappuccino Cups excellent Tasted ...
2,AJ3L5J7GN09SV,2,I bought Grove Square hazelnut cappuccino k cu...
3,A3CZD34ZTUJME7,1,I love Keurig I love Keurig coffees This insta...
4,AWKN396SHAQGP,1,It powdered drink No filter k cup br Just buy ...


In [6]:
for i in range(data.shape[0]):
    if data.loc[i,'stars']==1 or data.loc[i,'stars']==2:
        data.loc[i,'reviews_text'] = 'negative'
    elif data.loc[i,'stars']==4 or data.loc[i,'stars']==5:
        data.loc[i,'reviews_text'] = 'positive'
data_trim = data[data.reviews_text.apply(lambda x:x=='positive' or x=='negative')]
data_trim.shape

(514, 4)

In [15]:
data_trim.head()

Unnamed: 0,user_id,stars,reviews,reviews_text
0,A2XP9IN4JOMROD,1,I wanted love I even prepared somewhat like ch...,negative
1,A2TS09JCXNV1VD,5,Grove Square Cappuccino Cups excellent Tasted ...,positive
2,AJ3L5J7GN09SV,2,I bought Grove Square hazelnut cappuccino k cu...,negative
3,A3CZD34ZTUJME7,1,I love Keurig I love Keurig coffees This insta...,negative
4,AWKN396SHAQGP,1,It powdered drink No filter k cup br Just buy ...,negative


## Question 2 ##

Prepare the data for modeling:
* Split the data into training and test sets. You should have four sets of data - X_train, X_test, y_train, y_test

Create numerical features with Count Vectorizer. Create two document-term matrices:
* Matrix 1: Terms should be unigrams (single words), and values should be word counts (Hint: this is the Count Vectorizer default)
* Matrix 2: Terms should be unigrams and bigrams, and values should be binary values

Recommendation: Utilize Count Vectorizer's stop words function to remove stop words from the reviews text.

In [16]:
x = data_trim.loc[:,'reviews']
y = data_trim.loc[:,'reviews_text']

In [17]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.2)

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
xtrain_cv = cv.fit_transform(xtrain).toarray()
xtest_cv = cv.transform(xtest).toarray()

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lr = LogisticRegression(solver='lbfgs')
lr.fit(xtrain_cv,ytrain)
y_pred = lr.predict(xtest_cv)

In [9]:
def build_ngrams(text,n=2):
    tokens = text.lower().split()
    return list(ngrams(tokens,n))

In [10]:
corpus = [build_ngrams(document) for document in corpus]

In [11]:
corpus[:3]

[[('i', 'wanted'),
  ('wanted', 'love'),
  ('love', 'i'),
  ('i', 'even'),
  ('even', 'prepared'),
  ('prepared', 'somewhat'),
  ('somewhat', 'like'),
  ('like', 'cheap'),
  ('cheap', 'circle'),
  ('circle', 'k'),
  ('k', 'cappuccino'),
  ('cappuccino', 'unfortunately'),
  ('unfortunately', 'product'),
  ('product', 'really'),
  ('really', 'greasy'),
  ('greasy', 'you'),
  ('you', 'actually'),
  ('actually', 'see'),
  ('see', 'grease'),
  ('grease', 'cup'),
  ('cup', 'it'),
  ('it', '80'),
  ('80', 'calories'),
  ('calories', 'per'),
  ('per', 'serving'),
  ('serving', 'taste'),
  ('taste', 'really'),
  ('really', 'really'),
  ('really', 'powder'),
  ('powder', 'tasting'),
  ('tasting', 'like'),
  ('like', 'powdered'),
  ('powdered', 'milk'),
  ('milk', 'i'),
  ('i', 'expecting'),
  ('expecting', 'starbucks'),
  ('starbucks', 'cap'),
  ('cap', 'k'),
  ('k', 'cup'),
  ('cup', 'i'),
  ('i', 'expecting'),
  ('expecting', 'little'),
  ('little', 'br'),
  ('br', 'br'),
  ('br', 'i'),
  ('i'

In [12]:
count_vect = CountVectorizer(analyzer=lambda x:x)
x = count_vect.fit_transform(corpus).toarray()
data_ngram = pd.DataFrame(x,columns=count_vect.get_feature_names())

In [13]:
data_ngram.head()

Unnamed: 0,"(0, 42)","(00, i)","(00, k)","(0g, fiber)","(0g, protein)","(1, 10)","(1, 2)","(1, 20)","(1, 3)","(1, 30)",...,"(yummy, must)","(yummy, perfect)","(yummy, real)","(yummy, strong)","(yummy, suitable)","(yummy, they)","(yummy, though)","(yummy, treat)","(yummy, you)","(yup, exactly)"
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
data_ngram.shape

(542, 10789)

## Question 3 ##

Use Logistic Regression to classify reviews as positive or negative. Do this for both matrices.
* Fit a Logistic Regression model on the training data
* Apply the model on the test data and calculate the following error metrics: accuracy, precision, recall, F1 score
* Optional: Visualize the confusion matrix for both models
* Compare the error metrics of the two matrices

Recommendation: Create a function to calculate the error metrics, since you'll be doing this multiple times.

## Question 4 ##

Let's try using another machine learning technique to classify these reviews as positive or negative. Go through the exact same exercise in the previous step, except this time, use Naive Bayes instead of Logistic Regression.

For count data, use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB). For binary data, use [Bernoulli Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB).

Compare the results of both the Logistic Regression and Naive Bayes models.

## Question 5 ##

Up to this point, we've been using Count Vectorizer to create document-term matrices to input into the models. For at least one of the four models you've created so far, use TF-IDF Vectorizer instead of Count Vectorizer, and see if it improves the results.

Out of all of the models you've created, which model do you think best classifies positive and negative cappuccino cup reviews?