# Exercise 3 : Text Classification (Logistic Regression, Naive Bayes, K-NN)

Classify reviews of Musical Instruments present in Amazon using the following algorithms: <br>
1) Logistic Regression <br>
2) Naive Bayes <br>
3) K- Nearest Neighbour <br>
(Note: Reviews with overall score <=4 is to be treated as negative and those with overall score >4 is to be treated as positive)

The dataset can be obtained from http://jmcauley.ucsd.edu/data/amazon/ http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering R. He, J. McAuley WWW, 2016

In [11]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import re
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from pylab import *
import nltk
import warnings
warnings.filterwarnings('ignore')

In [2]:
review_data = pd.read_json('data_ch3/reviews_Musical_Instruments_5.json', lines=True)
review_data.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,1384719342,"[0, 0]",5,"Not much to write about here, but it does exac...","02 28, 2014",A2IBPI20UZIR0U,"cassandra tu ""Yeah, well, that's just like, u...",good,1393545600
1,1384719342,"[13, 14]",5,The product does exactly as it should and is q...,"03 16, 2013",A14VAT5EAX3D9S,Jake,Jake,1363392000
2,1384719342,"[1, 1]",5,The primary job of this device is to block the...,"08 28, 2013",A195EZSQDW3E21,"Rick Bennette ""Rick Bennette""",It Does The Job Well,1377648000
3,1384719342,"[0, 0]",5,Nice windscreen protects my MXL mic and preven...,"02 14, 2014",A2C00NNG1ZQQG2,"RustyBill ""Sunday Rocker""",GOOD WINDSCREEN FOR THE MONEY,1392336000
4,1384719342,"[0, 0]",5,This pop filter is great. It looks and perform...,"02 21, 2014",A94QU4C90B1AX,SEAN MASLANKA,No more pops when I record my vocals.,1392940800


In [3]:
review_data['overall'].value_counts()

5    6938
4    2084
3     772
2     250
1     217
Name: overall, dtype: int64

In [5]:
lemmatizer = WordNetLemmatizer()

In [8]:
review_data['cleaned_review_text'] = review_data['reviewText'].apply(\
lambda x : ' '.join([lemmatizer.lemmatize(word.lower()) \
    for word in word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x)))]))

In [10]:
review_data[['cleaned_review_text', 'reviewText', 'overall']].head()

Unnamed: 0,cleaned_review_text,reviewText,overall
0,not much to write about here but it doe exactl...,"Not much to write about here, but it does exac...",5
1,the product doe exactly a it should and is qui...,The product does exactly as it should and is q...,5
2,the primary job of this device is to block the...,The primary job of this device is to block the...,5
3,nice windscreen protects my mxl mic and preven...,Nice windscreen protects my MXL mic and preven...,5
4,this pop filter is great it look and performs ...,This pop filter is great. It looks and perform...,5


In [14]:
tfidf_model = TfidfVectorizer(max_features=500)
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(review_data['cleaned_review_text']).todense())
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

Unnamed: 0,10,100,12,20,34,able,about,accurate,acoustic,actually,...,won,work,worked,worth,would,wrong,year,yet,you,your
0,0.0,0.0,0.0,0.0,0.0,0.0,0.159684,0.0,0.0,0.0,...,0.0,0.134327,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.085436,0.0,0.0,0.0,0.0,0.0,0.0,0.067074,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.115312,0.0,0.0,0.0,0.07988,0.111989
3,0.0,0.0,0.0,0.0,0.0,0.339573,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.303608,0.0


In [32]:
#Let's consider review with overall score <= 4 to be negative (encode it as 0) 
#and overall score > 4 to be positive (encode it as 1)

review_data['target'] = review_data['overall'].apply(lambda x : 0 if x<=4 else 1)
review_data['target'].value_counts()

1    6938
0    3323
Name: target, dtype: int64

## Logistic Regression

In [23]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(tfidf_df,review_data['target'])
predicted_labels = logreg.predict(tfidf_df)

In [31]:
logreg.predict_proba(tfidf_df)[:,1]

array([0.57128804, 0.68592538, 0.56024427, ..., 0.65982122, 0.55011385,
       0.21210023])

In [25]:
review_data['predicted_labels'] = predicted_labels

In [26]:
pd.crosstab(review_data['target'], review_data['predicted_labels'])

predicted_labels,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1543,1780
1,626,6312


# Naive Bayes Classifier

In [33]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(tfidf_df,review_data['target'])
predicted_labels = nb.predict(tfidf_df)

In [34]:
nb.predict_proba(tfidf_df)[:,1]

array([9.97730158e-01, 3.63599675e-09, 9.45692105e-07, ...,
       2.46001047e-02, 3.43660991e-08, 1.72767906e-27])

In [35]:
review_data['predicted_labels_nb'] = predicted_labels

In [36]:
pd.crosstab(review_data['target'], review_data['predicted_labels_nb'])

predicted_labels_nb,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2333,990
1,2380,4558


# K-Nearest Neighbour

In [38]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(tfidf_df,review_data['target'])
review_data['predicted_labels_knn'] = knn.predict(tfidf_df)

In [39]:
pd.crosstab(review_data['target'], review_data['predicted_labels_knn'])

predicted_labels_knn,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2594,729
1,375,6563


Reference / Citation for the dataset: http://jmcauley.ucsd.edu/data/amazon/
http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz    
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley WWW, 2016