# <font color="red"> Machine Learning and NLP Exercises

# Introduction

We will be using the same review data set from Kaggle for this exercise. The product we'll focus on this time is a cappuccino cup. The goal of this week is to not only preprocess the data, but to classify reviews as positive or negative based on the review text.

The following code will help you load in the data.


In [1]:
import nltk
import pandas as pd
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
data = pd.read_csv('/Users/Cassandra/Downloads/coffee.csv')
data.head()

Unnamed: 0,user_id,stars,reviews
0,A2XP9IN4JOMROD,1,I wanted to love this. I was even prepared for...
1,A2TS09JCXNV1VD,5,Grove Square Cappuccino Cups were excellent. T...
2,AJ3L5J7GN09SV,2,I bought the Grove Square hazelnut cappuccino ...
3,A3CZD34ZTUJME7,1,"I love my Keurig, and I love most of the Keuri..."
4,AWKN396SHAQGP,1,It's a powdered drink. No filter in k-cup.<br ...


# Question 1 

- Determine how many reviews there are in total.


Use the preprocessing code below to clean the reviews data before moving on to modeling.


In [3]:
# Text preprocessing steps - remove numbers, captial letters and punctuation
import re
import string

alphanumeric = lambda x: re.sub(r"""\w*\d\w*""", ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())

data['reviews'] = data.reviews.map(alphanumeric).map(punc_lower)
data.head()

Unnamed: 0,user_id,stars,reviews
0,A2XP9IN4JOMROD,1,i wanted to love this i was even prepared for...
1,A2TS09JCXNV1VD,5,grove square cappuccino cups were excellent t...
2,AJ3L5J7GN09SV,2,i bought the grove square hazelnut cappuccino ...
3,A3CZD34ZTUJME7,1,i love my keurig and i love most of the keuri...
4,AWKN396SHAQGP,1,it s a powdered drink no filter in k cup br ...


In [4]:
len(data)

542

# Question 2: Classsification *(20% testing, 80% training)*

Processes for classification 

### Step 1: Prepare the data (identify the feature and label)

In [5]:
X = data["reviews"]
y = data["stars"]

### Step 2: Vectorize the feature

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = "word")

X = vectorizer.fit_transform(X)
print(X.shape)

(542, 2320)


### Step 3: Split the data into training and testing sets

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)

(433, 2320)
(109, 2320)


### Step 4: Idenfity the model/ classifier to be used. Feed the train data into the model

### - Decision Tree

In [8]:
from sklearn.tree import DecisionTreeClassifier

DT_Classifier = DecisionTreeClassifier()
DT_Classifier.fit(X_train, y_train)
DT_Classifier.predict(X_test)

array([1, 5, 5, 5, 5, 5, 5, 5, 4, 5, 4, 2, 5, 3, 5, 5, 5, 1, 5, 1, 2, 5,
       5, 5, 4, 5, 5, 4, 5, 5, 5, 2, 5, 3, 5, 5, 5, 3, 5, 1, 5, 3, 5, 5,
       5, 5, 5, 2, 5, 5, 5, 4, 1, 5, 5, 5, 5, 1, 5, 5, 4, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 1, 5, 5, 5, 4, 4, 5, 5, 1, 2, 5, 5, 1, 5, 3, 5, 1, 1,
       1, 5, 5, 1, 5, 5, 5, 1, 5, 5, 5, 1, 5, 5, 5, 3, 4, 5, 5, 5, 3])

### - Random Forest

In [9]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier()

rf_classifier.fit(X_train, y_train)

rf_classifier.predict(X_test)

array([5, 5, 5, 5, 5, 5, 5, 5, 1, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 5, 5, 5])

# Question 3 
Generate the accuracy scores for Decision Tree and Random Forest.  

In [10]:
from sklearn.metrics import accuracy_score

y_pred_dt = DT_Classifier.predict(X_test)
print("Decision Tree Accuracy: {}".format(accuracy_score(y_test, y_pred_dt)))

y_pred_rf = rf_classifier.predict(X_test)
print("Random Forest Accuracy: {}".format(accuracy_score(y_test, y_pred_rf)))

Decision Tree Accuracy: 0.5045871559633027
Random Forest Accuracy: 0.5963302752293578


# Question 4
Predict the rate of this review, 

<font color="blue">__"I dislike this coffee, terrible taste and very greasy."__



by using Decision Tree, Random Forest

In [11]:
test_sentence = "I dislike this coffee, terrible taste and very greasy"

test_sentence = re.sub(r"""\w*\d\w*""", " ", test_sentence)
test_sentence = re.sub("[%s]" % re.escape(string.punctuation), " ", test_sentence.lower())
test_sentence = [test_sentence]
test_sentence = vectorizer.transform(test_sentence)
rf_classifier.predict(test_sentence)

array([5])