# Amazon Fine Food Review

Dataset link: https://www.kaggle.com/snap/amazon-fine-food-reviews

## Context

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

**Data includes:**

* Reviews from Oct 1999 - Oct 2012
* 568,454 reviews
* 256,059 users
* 74,258 products
* 260 users with > 50 reviews

## Importing the Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import string
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split as tts

## Importing the dataset

In [2]:
data=pd.read_csv("Reviews.csv")

In [3]:
# description of data
data.describe()

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,568454.0,568454.0,568454.0,568454.0,568454.0
mean,284227.5,1.743817,2.22881,4.183199,1296257000.0
std,164098.679298,7.636513,8.28974,1.310436,48043310.0
min,1.0,0.0,0.0,1.0,939340800.0
25%,142114.25,0.0,0.0,4.0,1271290000.0
50%,284227.5,0.0,1.0,5.0,1311120000.0
75%,426340.75,2.0,2.0,5.0,1332720000.0
max,568454.0,866.0,923.0,5.0,1351210000.0


In [4]:
# number of rows and columns
data.shape

(568454, 10)

In [5]:
# names of the columns
data.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [6]:
# top 5 elements of the dataset
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [7]:
# 5 values from the end
data.tail()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
568449,568450,B001EO7N10,A28KG5XORO54AY,Lettie D. Carter,0,0,5,1299628800,Will not do without,Great for sesame chicken..this is a good if no...
568450,568451,B003S1WTCU,A3I8AFVPEE8KI5,R. Sawyer,0,0,2,1331251200,disappointed,I'm disappointed with the flavor. The chocolat...
568451,568452,B004I613EE,A121AA1GQV751Z,"pksd ""pk_007""",2,2,5,1329782400,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o..."
568452,568453,B004I613EE,A3IBEVCTXKNOH,"Kathy A. Welch ""katwel""",1,1,5,1331596800,Favorite Training and reward treat,These are the BEST treats for training and rew...
568453,568454,B001LR2CU2,A3LGQPJCZVL9UC,srfell17,0,0,5,1338422400,Great Honey,"I am very satisfied ,product is as advertised,..."


In [8]:
# counting the count of unique scores in the dataset
data["Score"].value_counts()

5    363122
4     80655
1     52268
3     42640
2     29769
Name: Score, dtype: int64

From the above results we can conclude that the dataset is not balanced. As there are is not equal distribution of each score points.

As we can see that, Scores 5 and 4 can clearly be termed as positive and Scores like 1 and 2 can clearly be classified as negative. But when it comes to Score of 3 its mixed. It can fall in positive category or it can fall in negative. So we are considering score value of 3 to be neutral and not taking it into consideration.

In [9]:
data=data[data["Score"]!=3]

In [10]:
# counting the values of score
data["Score"].value_counts()

5    363122
4     80655
1     52268
2     29769
Name: Score, dtype: int64

In [11]:
# converting al the scores to positive and negative
def change_score(x):
    if x>3:
        return 'positive'
    return 'negative'


data["Score"]=data["Score"].map(change_score)
data["Score"].value_counts()

positive    443777
negative     82037
Name: Score, dtype: int64

## Processing the Data

### Cleaning the Data

In [12]:
# removing duplications in the dataset

data=data.sort_values("ProductId",axis=0,ascending=True)
data=data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},keep='first',inplace=False)
data.shape

(364173, 10)

In [13]:
# helpfulness numerator will always be <= helfulness denominator

data=data[data.HelpfulnessNumerator<=data.HelpfulnessDenominator]
data.shape

(364171, 10)

In [14]:
# taking the necessary data from the dataset

data=data[["Score","Text"]]
data.shape

(364171, 2)

In [15]:
# top 5 elements of the dataset
data.head()

Unnamed: 0,Score,Text
150523,positive,this witty little book makes my son laugh at l...
150505,positive,"I grew up reading these Sendak books, and watc..."
150506,positive,This is a fun way for children to learn their ...
150507,positive,This is a great little book to read aloud- it ...
150508,positive,This is a book of poetry about the months of t...


### Text processing

**Text processing can be done by the following ways:**

* Removing all the stopwords.
* Converting all the letters to lower case
* Stemming
* Lemmitisation

In [16]:
# 1st round of cleaning
def clean_html(text):
    clean=re.compile('<.*?>')
    cleantext=re.sub(clean,'',text)
    return cleantext

def clean_text1(text):
    text=text.lower()
    text=re.sub('\[.*?\]','',text)
    text=re.sub('[%s]'%re.escape(string.punctuation),'',text)
    text=re.sub('\w*\d\w*','',text)
    return text

cleaned_html=lambda x:clean_html(x)
cleaned1=lambda x:clean_text1(x)

data['Text']=pd.DataFrame(data.Text.apply(cleaned_html))
data['Text']=pd.DataFrame(data.Text.apply(cleaned1))

In [17]:
# 2nd round of cleaning
def clean_text2(text):
    text=re.sub('[''"",,,]','',text)
    text=re.sub('\n','',text)
    return text

cleaned2=lambda x:clean_text2(x)
data['Text']=pd.DataFrame(data.Text.apply(cleaned2))

In [18]:
# top 10 elements of the data
data.head(10)

Unnamed: 0,Score,Text
150523,positive,this witty little book makes my son laugh at l...
150505,positive,i grew up reading these sendak books and watch...
150506,positive,this is a fun way for children to learn their ...
150507,positive,this is a great little book to read aloud it h...
150508,positive,this is a book of poetry about the months of t...
150510,positive,a charming rhyming book that describes the cir...
150511,positive,i set aside at least an hour each day to read ...
150512,positive,i remembered this book from my childhood and g...
150513,positive,its a great book with adorable illustrations ...
150514,positive,this book is a family favorite and was read to...


## Splitting the dataset

In [19]:
tfidf=TfidfVectorizer(max_features=2500)
x=data.iloc[0:,1].values
x=tfidf.fit_transform(x).toarray()

In [20]:
y=data.iloc[0:,0].values

In [24]:
xtrain,xtest,ytrain,ytest=tts(x,y,test_size=0.25,random_state=225)

### Training the model

In [29]:
'''import xgboost as xgb
from sklearn.model_selection import GridSearchCV

clf = xgb.XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

model=GridSearchCV(clf,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)
'''
from sklearn.naive_bayes import GaussianNB
clf=GaussianNB()
clf.fit(xtrain,ytrain)

GaussianNB(priors=None, var_smoothing=1e-09)

In [30]:
ypred=clf.predict(xtest)

In [31]:
#model accuracy
accuracy_score(ypred,ytest)

0.7828388783322167

In [33]:
A=confusion_matrix(ytest,ypred)
print(A)

[[11699  2517]
 [17254 59573]]


In [34]:
# f1 score
recall=A[0][0]/(A[0][0]+A[1][0])
precision=A[0][0]/(A[0][0]+A[0][1])
F1=2*recall*precision/(recall+precision)
print(F1)

0.5420093122379486
