# Amazon Fine Food Reviews Analysis
### Context
This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

###### Information about dataset
###### Reviews from Oct 1999 - Oct 2012
###### 568,454 reviews
###### 256,059 users
###### 74,258 products
###### 260 users with > 50 reviews

## Attribution Information
1. ID
2. ProductId
3. UserId
4. ProfileName
5. HelpfulnessNumerator - Number of users who found the review helpful
6. HelpfulnessDenominator - Number of users who indicated whether they found the review helpful or not
7. Score - Rating between 1 and 5 ****
8. Time - Timestamp for the review
9. Summary - Brief summary of the review
10. Text - Text of the review *****

#### Rating: 
#### 1,2- Negative, 3- Neutral, 4,5-Positive
#### Ignore 3

### Load the dataset- sqlite

In [1]:
import os,sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')

import nltk
import string
import sqlite3

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [2]:
con= sqlite3.connect('database.sqlite')
con

<sqlite3.Connection at 0x27f16892e40>

In [3]:
# Approach 1
filtered_data=pd.read_sql_query("""select * from reviews where score!=3 limit 5000""",con)
filtered_data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
4995,5423,B00622CYVS,A17ASMX6QMO6XY,E. Harvill,0,1,2,1277424000,Not so tasty...,"My baby didn't seem into these dinners, so I t..."
4996,5424,B00622CYVS,A32DHN8U74GCAR,"Granola Girl ""michele j.""",0,1,4,1240790400,Food Delivery,This is great! Organic baby food options - de...
4997,5425,B00622CYVS,A2YHXAZLCLDT8D,"Mark Smith ""Food lover""",0,1,5,1236988800,Dinner time is Earths Best TIme !!,My little guy loves to try new foods..so this ...
4998,5426,B00622CYVS,A2NYT3UXUTBY23,C&GHoll,1,3,2,1249603200,Wrong item shipped,We ordered the Earth's best 2nd dinner variety...


In [4]:
def partition(x):
    if x<3:
        return 0
    return 1

# Changing reviews with score less than 3 to be negative(0) and more than 3 to be positive(1)

actualScore=filtered_data['Score']
PositiveNegative=actualScore.map(partition)
filtered_data['Score']=PositiveNegative
print('Number of data points in our dataset',filtered_data)
filtered_data.head(10)

Number of data points in our dataset         Id   ProductId          UserId                      ProfileName  \
0        1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1        2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2        3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3        4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4        5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   
...    ...         ...             ...                              ...   
4995  5423  B00622CYVS  A17ASMX6QMO6XY                       E. Harvill   
4996  5424  B00622CYVS  A32DHN8U74GCAR        Granola Girl "michele j."   
4997  5425  B00622CYVS  A2YHXAZLCLDT8D          Mark Smith "Food lover"   
4998  5426  B00622CYVS  A2NYT3UXUTBY23                          C&GHoll   
4999  5427  B00622CYVS  A3EPC08TVAPA0N                          Krissia   

      HelpfulnessNumerator  HelpfulnessDenominator  Score     

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,1,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,1,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,1,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,1,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,1,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...


In [5]:
filtered_data.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [6]:
# Approach 2
# Just for reference
display=pd.read_sql_query(
"""
select UserId, ProductId, ProfileName, Time, Score, Text, count(*) from reviews
group by UserId
Having count(*)>1
""",con
)
display.shape

(80668, 7)

In [7]:
display.head()

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,count(*)
0,#oc-R115TNMSPFT9I7,B005ZBZLT4,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ESG,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B005ZBZLT4,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ESG,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBEV0,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [8]:
display['count(*)'].sum()

393063

In [9]:
# Sorting dataset into ascending order

sorted_data=filtered_data.sort_values('ProductId',axis=0,ascending=True,inplace=False,
                                     kind='quicksort')
sorted_data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
2546,2774,B00002NCJC,A196AJHU9EASJN,Alex Chaffee,0,0,1,1282953600,thirty bucks?,Why is this $[...] when the same product is av...
2547,2775,B00002NCJC,A13RRPGE79XFFH,reader48,0,0,1,1281052800,Flies Begone,We have used the Victor fly bait for 3 seasons...
1145,1244,B00002Z754,A3B8RCEI0FXFI6,B G Chase,10,10,1,962236800,WOW Make your own 'slickers' !,I just received my shipment and could hardly w...
1146,1245,B00002Z754,A29Z5PI9BW2PU3,Robbie,7,7,1,961718400,Great Product,This was a really good idea and the final prod...
2942,3204,B000084DVR,A1UGDJP1ZJWVPF,"T. Moore ""thoughtful reader""",1,1,1,1177977600,Good stuff!,I'm glad my 45lb cocker/standard poodle puppy ...
...,...,...,...,...,...,...,...,...,...,...
711,765,B009HINRX8,A1OEL4UZT3KKI4,"coffee drinker in PA ""coffee drinker in PA""",0,0,1,1344988800,great coffee - terrible price,"This is one of the best choices, in my opinion..."
710,764,B009HINRX8,ADDBLG0CFY9AI,S.A.D.,1,1,1,1326758400,Best of the Tassimo's,We've tried many Tassimo flavors. This is by ...
709,763,B009HINRX8,A3N9477PUE6WMR,patc477,4,4,1,1323302400,Good Tasting cup o' joe,This is a bold blend that has a great taste. T...
713,768,B009HINRX8,A2CAZG1CQ8BQI5,Patricia J. Nohalty,0,0,1,1337212800,Kona for Tassimo,Of all the coffee's available for Tassimo this...


In [10]:
sorted_data.shape

(5000, 10)

In [11]:
# Remove duplicate data in case exist
final=sorted_data.drop_duplicates(subset={'UserId','ProductId','Time','Text'},
                                 keep='first',inplace=False)
final.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
2546,2774,B00002NCJC,A196AJHU9EASJN,Alex Chaffee,0,0,1,1282953600,thirty bucks?,Why is this $[...] when the same product is av...
2547,2775,B00002NCJC,A13RRPGE79XFFH,reader48,0,0,1,1281052800,Flies Begone,We have used the Victor fly bait for 3 seasons...
1145,1244,B00002Z754,A3B8RCEI0FXFI6,B G Chase,10,10,1,962236800,WOW Make your own 'slickers' !,I just received my shipment and could hardly w...
1146,1245,B00002Z754,A29Z5PI9BW2PU3,Robbie,7,7,1,961718400,Great Product,This was a really good idea and the final prod...
2942,3204,B000084DVR,A1UGDJP1ZJWVPF,"T. Moore ""thoughtful reader""",1,1,1,1177977600,Good stuff!,I'm glad my 45lb cocker/standard poodle puppy ...


In [12]:
final.shape

(4994, 10)

In [13]:
final['Score'].value_counts()/(len(final))*100

1    83.780537
0    16.219463
Name: Score, dtype: float64

In [14]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
final.shape

(4994, 10)

# Text preporocessing

In [15]:
# Printing some sample review on the text column
sent_0=final['Text'].values[0]
print(sent_0)
print('.......................')

sent_500=final['Text'].values[500]
print(sent_500)
print('.......................')

Why is this $[...] when the same product is available for $[...] here?<br />http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
.......................
I bought these for my  grandbabies because they are earths best and organic. They like them,but not as well as the cinnamon ones. I tried them,thought they tasted kind of like dirt.
.......................


In [16]:
from bs4 import BeautifulSoup

In [17]:
soup=BeautifulSoup(sent_0,'lxml')
text=soup.get_text()
print(text)

Why is this $[...] when the same product is available for $[...] here?http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDYThe Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [18]:
def decontracted(phrase):
    # Specific- SME support may be needed
    phrase=re.sub(r"won't","will not",phrase)
    phrase=re.sub(r"don't","do not",phrase)
    phrase=re.sub(r"n\'t","not",phrase)

In [19]:
# Combine all text 
import re
sent_0=re.sub('[^a-zA-Z0-9]+',' ',sent_0)
print(sent_0)

Why is this when the same product is available for here br http www amazon com VICTOR FLY MAGNET BAIT REFILL dp B00004RBDY br br The Victor M380 and M502 traps are unreal of course total fly genocide Pretty stinky but only right nearby 


Preprocessing:
re
bs4

In [20]:
from nltk.corpus import stopwords

In [21]:
# Combining all text preprocessing
from tqdm import tqdm

preprocessed_review=[]

for sentences in tqdm(final['Text'].values):
    sentences=re.sub(r"http\S+","",sentences)
    sentences=BeautifulSoup(sentences,'lxml').get_text()
    #sentences=decontracted(sentences)
    sentences=re.sub('[^a-zA-Z0-9]+',' ',sentences)
    sentences=sentences.lower()
    #sentences=' '.join(e.lower() for e in sentences.split() if e.lower() not in stopwords)
    preprocessed_review.append(sentences.strip())

100%|████████████████████████████████████████████████████████████████████████████| 4994/4994 [00:01<00:00, 4400.59it/s]


In [22]:
preprocessed_review[0]

'why is this when the same product is available for here the victor m380 and m502 traps are unreal of course total fly genocide pretty stinky but only right nearby'

# Build Machine Learning Algorithm

Machine learning approachs-
BOW/n-gram/TF-IDF/NB/RF/XGB

Deep learning approachs-
word2vec/Glove/BERT

In [23]:
# BOW
from sklearn.feature_extraction.text import CountVectorizer

In [24]:
count_vect= CountVectorizer()

In [25]:
count_vect.fit(preprocessed_review)
print('some feature name', count_vect.get_feature_names()[:20])
print('........................')
final_counts= count_vect.transform(preprocessed_review)
print('The type of count vectorizer', type(final_counts))
print('The shape of the text BOW', final_counts.get_shape())
print('The number of unique words', final_counts.get_shape()[1])

some feature name ['00', '000', '000kwh', '002', '0100', '0174', '02', '03', '03510', '042608460503serving', '0472066978', '06', '0738551856', '09', '090', '0g', '0gcholesterol', '0gprotein', '0mg', '0mgsodium']
........................
The type of count vectorizer <class 'scipy.sparse._csr.csr_matrix'>
The shape of the text BOW (4994, 13593)
The number of unique words 13593


In [27]:
# N-gram
count_vect= CountVectorizer(ngram_range=(1, 2),max_features=5000)

final_bigram= count_vect.fit_transform(preprocessed_review)

print('The type of count vectorizer', type(final_bigram))
print('The shape of the text BOW', final_bigram.get_shape())
print('The number of unique words', final_bigram.get_shape()[1])

The type of count vectorizer <class 'scipy.sparse._csr.csr_matrix'>
The shape of the text BOW (4994, 5000)
The number of unique words 5000


In [28]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf= TfidfVectorizer()

final_tfidf= tfidf.fit_transform(preprocessed_review)

print('The type of count vectorizer', type(final_tfidf))
print('The shape of the text BOW', final_tfidf.get_shape())
print('The number of unique words', final_tfidf.get_shape()[1])

The type of count vectorizer <class 'scipy.sparse._csr.csr_matrix'>
The shape of the text BOW (4994, 13593)
The number of unique words 13593


In [29]:
x=final_tfidf.toarray()
x

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [30]:
x.shape

(4994, 13593)

In [31]:
final.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [32]:
y=final['Score']
y

2546    1
2547    1
1145    1
1146    1
2942    1
       ..
711     1
710     1
709     1
713     1
1362    0
Name: Score, Length: 4994, dtype: int64

In [33]:
# Split the data into train and test

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=101)
print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

(3495, 13593) (1499, 13593) (3495,) (1499,)


In [34]:
from sklearn.naive_bayes import MultinomialNB
nb_model=MultinomialNB().fit(x_train,y_train)
y_pred_train_nb=nb_model.predict(x_train)
y_pred_test_nb=nb_model.predict(x_test)

In [35]:
print(accuracy_score(y_train,y_pred_train_nb))
print('....................')
print(accuracy_score(y_test,y_pred_test_nb))

0.8394849785407725
....................
0.8345563709139426
