# Amazon Fine Food Reviews  SVC


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review

## Importing libraries and loading the dataset :
* Cleaning and handling deduplication of data is already performed.

In [1]:
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn import cross_validation
from sklearn import datasets, neighbors
import plotly



In [2]:
conn = sqlite3.connect('final.sqlite')

In [3]:
data = pd.read_sql_query("""
SELECT *
FROM Reviews""", conn)
data.head(2)

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
0,138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,b'witti littl book make son laugh loud recit c...
1,138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,positive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",b'grew read sendak book watch realli rosi movi...


In [4]:
data.shape

(364171, 12)

In [5]:
# Sampling the data :
sample_data = data.sample(n=50000)
sample_data['Score'].value_counts()

positive    42199
negative     7801
Name: Score, dtype: int64

## Time Based Splitting :

In [6]:
# Sorting the sample data using Time column
sorted_sample = sample_data.sort_values(by='Time')
sorted_sample.head(2)

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
30,138683,150501,0006641040,AJ46FKXOVC7NR,Nicholas A Mesiano,2,2,positive,940809600,This whole series is great way to spend time w...,I can remember seeing the show when it aired o...,b'rememb see show air televis year ago child s...
330,346055,374359,B00004CI84,A344SMIA5JECGM,Vincent P. Ross,1,2,positive,944438400,A modern day fairy tale,"A twist of rumplestiskin captured on film, sta...",b'twist rumplestiskin captur film star michael...


In [1]:
# Getting the Labels i.e the Score out of the dataframe.
y = sorted_sample['Score']
# Removing the Labels i.e the Score Column from the dataframe as we wont need it to train a KNN.
sorted_sample = sorted_sample.drop(columns='Score')
sorted_sample.head(2)

NameError: name 'sorted_sample' is not defined

In [8]:
# Splitting into Test, Train and Cross-Validation set -
x_train = sorted_sample[0:40000]
y_train = y[0:40000]
x_test = sorted_sample[40000:50000]
y_test = y[40000:50000]

print ("Training Set - ", x_train.shape)
print ("Test Set - ", x_test.shape)

Training Set -  (40000, 11)
Test Set -  (10000, 11)


## Bag of Words :

In [9]:
# Generating bag of words features.
count_vect = CountVectorizer()
bow_train = count_vect.fit_transform(x_train['CleanedText'].values)
bow_train.shape

(40000, 24930)

In [10]:
bow_test = count_vect.transform(x_test['CleanedText'].values)
bow_test.shape

(10000, 24930)

#### Finding optimal gamma and C using grdisearch and randomsearch :

In [11]:
# Importing SVC and Gridsearch and Randomsearch
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC

In [None]:
clf = SVC()
param = [{'C': [10**-3, 10**-2, 10**-1, 10**0, 10**1, 10**2, 10**3], 'gamma':[0.01, 0.1, 1, 10]}]
param

[{'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.1, 1, 10]}]

In [None]:
# Using grid search to find optimal C and gamma
model1 = GridSearchCV(clf, param)
model1.fit(bow_train, y_train)

In [None]:
# Getting the best model
model1.best_estimator_

In [None]:
# Evaluating the model
model1.score(bow_test, y_test)

**Grid Search Conclusions for Bag of Words **
* GridSearch gave the value of C as 10 and gamma as 0.01.
* Accuracy is - 87.1 %

** Using RandomizedSearchCV now **

In [None]:
# Import randint library to generate distributions.
from scipy.stats import randint as sp_randint

In [None]:
param_2 = {'C': sp_randint(10**-3, 10**3), 'gamma': sp_randint(10**-1, 10**1)}
param_2

In [None]:
# Using randomzied search to find optimal C and gamma
model2 = RandomizedSearchCV(clf, param_2)
model2.fit(bow_train, y_train)

In [None]:
# Getting the best model
model2.best_estimator_

In [None]:
# Evaluating the model
model2.score(bow_test, y_test)

** RandomizedSearchCV results for Bag of Words **
* C obtained is 712 and gamma obtained is 2.
* Accuracy of model = 81.05 %, approx 6% lower than what was obtained using GridSearchCV

## TFIDF :

In [None]:
# Generating TFIDF features.
tfidf = TfidfVectorizer()
tf_train = tfidf.fit_transform(x_train['CleanedText'].values)
tf_train.shape

In [None]:
tf_test = tfidf.transform(x_test['CleanedText'].values)
tf_test.shape

#### Grid Search :

In [None]:
# Using grid search to find optimal C and gamma
model1 = GridSearchCV(clf, param)
model1.fit(tf_train, y_train)

In [None]:
# Getting best model
model1.best_estimator_

In [None]:
# Score on test data 
model1.score(tf_test, y_test)

**Grid Search Conclusions for TFIDF **
* GridSearch gave the value of C as 100 and gamma as 0.01.
* Accuracy is - 87.1 %, exactly similar to BoW SVC using gridsearchCV

** Using RandomizedSearchCV now **

In [None]:
# Using randomzied search to find optimal C and gamma
model2 = RandomizedSearchCV(clf, param_2)
model2.fit(tf_train, y_train)

In [None]:
# Getting the best model
model2.best_estimator_

In [None]:
# Evaluating the model
model2.score(tf_test, y_test)

** RandomizedSearchCV results for TFIDF **
* C obtained is 532 and gamma obtained is 0 !!!.
* Accuracy of model = 81.05 %, approx 6% lower than what was obtained using GridSearchCV


** The results obtained from BoW and TFIDF are exactly similar !!!, Same accuracy for gridsearchcv and randomsearchcv **

## Word2Vec :
* We will train W2V on our train dataset.

In [None]:
# removing html tags and apostrophes if present.
import re
def cleanhtml(sentence): #function to clean the word of any html-tags
    cleanr = re.compile('<.*?>')
    sentence = sentence.decode('utf-8')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
def cleanpunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned

In [None]:
# converting the train set into clean form which does not contain html tags etc.
import gensim
i=0
list_of_sent_train=[]
for sent in x_train['CleanedText'].values:
    filtered_sentence=[]
    sent=cleanhtml(sent)
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):    
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue 
    list_of_sent_train.append(filtered_sentence)

In [None]:
# Doing the same for test dataset.
list_of_sent_test=[]
for sent in x_test['CleanedText'].values:
    filtered_sentence=[]
    sent=cleanhtml(sent)
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):    
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue 
    list_of_sent_test.append(filtered_sentence)

In [None]:
# Training the wor2vec model using train dataset
w2v_model=gensim.models.Word2Vec(list_of_sent_train,min_count=5,size=200, workers=4) 

### Avg-W2V :

In [None]:
sent_vectors_train = []; # the avg-w2v for each sentence/review is stored in this train
for sent in list_of_sent_train: # for each review/sentence
    sent_vec = np.zeros(20) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors_train.append(sent_vec)
    
print (len(sent_vectors_train))
print (len(sent_vectors_train[0]))

In [None]:
sent_vectors_test = []; # the avg-w2v for each sentence/review is stored in this test
for sent in list_of_sent_test: # for each review/sentence
    sent_vec = np.zeros(20) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors_test.append(sent_vec)
    
print (len(sent_vectors_test))
print (len(sent_vectors_test[0]))

#### GridSearch on obtained AVG-W2V data -

In [None]:
# Using grid search to find optimal C and gamma
model1 = GridSearchCV(clf, param)
model1.fit(sent_vectors_train, y_train)

In [None]:
# Getting best model
model1.best_estimator_

In [None]:
# Score on test data 
model1.score(sent_vectors_test, y_test)

**Grid Search Conclusions for AVG-W2V **
* GridSearch gave the value of C as 0.001 and gamma as 0.01.
* Accuracy is - 81.05 %

** Using RandomizedSearchCV now **

In [None]:
# Using randomzied search to find optimal C and gamma
model2 = RandomizedSearchCV(clf, param_2)
model2.fit(sent_vectors_train, y_train)

In [None]:
# Getting the best model
model2.best_estimator_

In [None]:
# Getting score on test data
model2.score(sent_vectors_test, y_test)

** RandomizedSearchCV results for AVG-W2V **
* C obtained is 506 and gamma obtained is 0.
* Accuracy of model = 81.05 % same as gridsearchCV.

### TFIDF-W2V :

In [None]:
# We will first create tfidf w2v features.
tf_idf_vect = TfidfVectorizer()
final_tf_idf = tf_idf_vect.fit_transform(x_train['CleanedText'].values)
tfidf_feat = tf_idf_vect.get_feature_names() # tfidf words/col-names

In [None]:
tfidf_sent_vectors_train = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in list_of_sent_train: # for each review/sentence
    sent_vec = np.zeros(20) # as word vectors are of zero length
    weight_sum = 0.0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            # obtain the tf_idfidf of a word in a sentence/review
            tfidf = final_tf_idf[row, tfidf_feat.index(word)]
            sent_vec += (vec * tfidf)
            weight_sum += tfidf
        except:
            pass
    sent_vec /= weight_sum
    tfidf_sent_vectors_train.append(sent_vec)
    row += 1
    
print (len(tfidf_sent_vectors_train))
print (len(tfidf_sent_vectors_train[0]))
print (tfidf_sent_vectors_train[0])

In [None]:
tfidf_sent_vectors_test = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in list_of_sent_test: # for each review/sentence
    sent_vec = np.zeros(20) # as word vectors are of zero length
    weight_sum = 0.0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            # obtain the tf_idfidf of a word in a sentence/review
            tfidf = final_tf_idf[row, tfidf_feat.index(word)]
            sent_vec += (vec * tfidf)
            weight_sum += tfidf
        except:
            pass
    sent_vec /= weight_sum
    tfidf_sent_vectors_test.append(sent_vec)
    row += 1
    
print (len(tfidf_sent_vectors_test))
print (len(tfidf_sent_vectors_test[0]))
print (tfidf_sent_vectors_test[1])

#### GridSearchCV :

In [None]:
# Using grid search to find optimal C or 1/lamda
model1 = GridSearchCV(clf, param)
model1.fit(tfidf_sent_vectors_train, y_train)

In [None]:
# Getting the best model
model1.best_estimator_

In [None]:
# nan values are assigned 0 (if present)
tfidf_sent_vectors_test = np.array(tfidf_sent_vectors_test)
tfidf_sent_vectors_test = np.nan_to_num(tfidf_sent_vectors_test)

In [None]:
# Evaluating the model
model1.score(tfidf_sent_vectors_test, y_test)

**Grid Search Conclusions for TFIDF W2V **
* GridSearch gave the value of C as 0.001 and gamma as 0.01.
* Accuracy is - 81.05 %

** Using RandomizedSearchCV now **

In [None]:
# Using randomzied search to find optimal C and gamma
model2 = RandomizedSearchCV(clf, param_2)
model2.fit(tfidf_sent_vectors_train, y_train)

In [None]:
# Getting the best model
model2.best_estimator_

In [None]:
# Evaluating the model
model2.score(tfidf_sent_vectors_test, y_test)

** RandomizedSearchCV results for TFIDF-W2V **
* C obtained is 228 and gamma obtained is 0.
* Accuracy of model = 81.05 %

## Conclusions :
1. Bag of Words and TFIDF model have same accuracy of 87.1 % using GridSearchCV and 81.05 % when using RandomziedSearchCV
2. Avg-W2V and TFIDF-W2V also have same accuracy of 81.05 % using both Grid and Random search.

<table>
    <tr>
        <th>Model</th><th>GridSearch C and Gamma</th><th>RandomSearch C and gamma</th>
    </tr>
    <tr>
        <td>Bag of words</td><td>C = 10 , gamma = 0.01</td><td>C = 712, gamma = 2</td>
    </tr>
    <tr>
        <td>TFIDF</td><td>C = 100, gamma = 0.01</td><td> C = 532, gamma = 0</td>
    </tr>
    <tr>
        <td>Avg-W2V</td><td>C = 0.001, gamma = 0.01</td><td>C = 506, gamma = 0</td>
    </tr>
    <tr>
        <td>Tfidf-W2V</td><td>C = 0.001, gamma = 0.01</td><td>C = 228, gamma = 0</td>
    </tr>
</table>