# Amazon Fine Food Reviews Analysis
### Context
This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

###### Information about dataset
###### Reviews from Oct 1999 - Oct 2012
###### 568,454 reviews
###### 256,059 users
###### 74,258 products
###### 260 users with > 50 reviews

## Attribution Information
1. ID
2. ProductId
3. UserId
4. ProfileName
5. HelpfulnessNumerator - Number of users who found the review helpful
6. HelpfulnessDenominator - Number of users who indicated whether they found the review helpful or not
7. Score - Rating between 1 and 5 ****
8. Time - Timestamp for the review
9. Summary - Brief summary of the review
10. Text - Text of the review *****


# Objective : Given a review, we have to determine the review is either positive (4 or 5) or negative (1 or 2)
# Review 3 - is neutral so we have to ignore them 

# Q: How to determine if a review is positive or negative ?

### Load the dataset - SQLite dataset

In [2]:
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')


import nltk
import string
import sqlite3

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score


In [3]:
con = sqlite3.connect('database.sqlite')
con

<sqlite3.Connection at 0x15dcd2826c0>

In [None]:
"""
select * from con where Score !=3 
"""

In [7]:
filtered_data = pd.read_sql_query("""select * from reviews where score !=3 limit 5000""", con)
filtered_data.shape

(5000, 10)

In [8]:
filtered_data.head(1)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...


In [9]:
filtered_data['Score'].value_counts()

5    3420
4     767
1     504
2     309
Name: Score, dtype: int64

In [10]:
def partition(x):
    if x <3:
        return 0
    return 1

# Changing reviews with score less than 3 to be negative (0) and more than 3 to be positive(1)

actualScore = filtered_data['Score']
PositiveNegative = actualScore.map(partition)
filtered_data['Score'] = PositiveNegative
print("Number of data points in our dataset", filtered_data.shape)
filtered_data.head(10)

Number of data points in our dataset (5000, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,1,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,1,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,1,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,1,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,1,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...


In [11]:
filtered_data.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [None]:
# just for your refrence 

In [12]:
display = pd.read_sql_query(""" 
select UserId, ProductId, ProfileName, Time, Score, Text, count(*) from reviews 
group by UserID
Having count(*)>1""", con)
display.shape

(80668, 7)

In [13]:
display.head()

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,count(*)
0,#oc-R115TNMSPFT9I7,B005ZBZLT4,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ESG,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B005ZBZLT4,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ESG,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBEV0,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [14]:
display['count(*)'].sum()

393063

In [15]:
# actual dataset
filtered_data.head(2)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...


In [16]:
# Sorting the dataset into ascending order
sorted_data = filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False,
                                       kind='quicksort')

In [18]:
sorted_data.shape

(5000, 10)

In [19]:
# remove duplicate data in case exist
final = sorted_data.drop_duplicates(subset={'UserId','ProductId','Time','Text'},
                                   keep='first', inplace=False)

In [20]:
final.shape

(4994, 10)

In [25]:
final.head(100)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
2546,2774,B00002NCJC,A196AJHU9EASJN,Alex Chaffee,0,0,1,1282953600,thirty bucks?,Why is this $[...] when the same product is av...
2547,2775,B00002NCJC,A13RRPGE79XFFH,reader48,0,0,1,1281052800,Flies Begone,We have used the Victor fly bait for 3 seasons...
1145,1244,B00002Z754,A3B8RCEI0FXFI6,B G Chase,10,10,1,962236800,WOW Make your own 'slickers' !,I just received my shipment and could hardly w...
1146,1245,B00002Z754,A29Z5PI9BW2PU3,Robbie,7,7,1,961718400,Great Product,This was a really good idea and the final prod...
2942,3204,B000084DVR,A1UGDJP1ZJWVPF,"T. Moore ""thoughtful reader""",1,1,1,1177977600,Good stuff!,I'm glad my 45lb cocker/standard poodle puppy ...
...,...,...,...,...,...,...,...,...,...,...
932,1011,B0002MKFEM,A3QLX72AO0DD5Z,Carlito Picache,1,2,1,1226361600,Way too salty,I tried this and I found it too salty.<br />Pl...
4090,4433,B0002NYO98,A376TWN7I4HMZ8,helios,0,1,1,1324252800,Exaclty what i ordered,"Again, exactly what I ordered. No fuss, no mus..."
4089,4432,B0002NYO98,A5DVX3B075B09,Patricia Kays,0,0,1,1338940800,LOVELY JUNIPER BERRIES,"Dried berries, still with texture and the quin..."
4270,4642,B0002NYO9I,A376TWN7I4HMZ8,helios,0,1,1,1324252800,Exaclty what i ordered,"Again, exactly what I ordered. No fuss, no mus..."


In [22]:
final['Score'].value_counts()

1    4184
0     810
Name: Score, dtype: int64

In [23]:
final['HelpfulnessNumerator'].sum()

7854

In [24]:
final['HelpfulnessDenominator'].sum()

10068

In [26]:
final = final[final.HelpfulnessNumerator <= final.HelpfulnessDenominator]

In [27]:
final.shape

(4994, 10)

# Text Preprocessing 

In [28]:
# Printing some sample review on the text column
sent_0 = final['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = final['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1500 = final['Text'].values[1500]
print(sent_1500)
print("="*50)

sent_2757 = final['Text'].values[2757]
print(sent_2757)
print("="*50)

sent_4900 = final['Text'].values[4900]
print(sent_4900)
print("="*50)

Why is this $[...] when the same product is available for $[...] here?<br />http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
I like Kettle chips but was really disappointed with this order because they were over cooked.  I bought a bigger sized bag from Costco and the chips were all cooked perfectly. So I don't know why these were different.<br />Will never buy these chips from here again.
These cookies are both yummy and well priced. I recommend them. Other cookies I like are Mrs. Fields(packaged), Chips ahoy soft baked, fig newtons, oreos, and elf cookies.
It felt weird ordering 24 bags of chips off the internet that I had never even tried before, but I am SO GLAD I DID. They are delicious! I've tried 4 flavors so far & have loved each one. These are a win. :)
I like the slight pineapple flavor in this better than the plain coconut water. Good 

In [31]:
# filtering is required
import re
sent_0 = re.sub(r"http\S+","",sent_0)
sent_1000 = re.sub(r"http\S+","",sent_1000)
sent_1500 = re.sub(r"http\S+","",sent_1500)
sent_2757 = re.sub(r"http\S+","",sent_2757)
sent_4900 = re.sub(r"http\S+","",sent_4900)
print(sent_0)
print("*******"*10)
print(sent_1000)
print("*******"*10)
print(sent_1500)
print("*******"*10)
print(sent_2757)
print("*******"*10)
print(sent_4900)
print("*******"*10)

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
**********************************************************************
I like Kettle chips but was really disappointed with this order because they were over cooked.  I bought a bigger sized bag from Costco and the chips were all cooked perfectly. So I don't know why these were different.<br />Will never buy these chips from here again.
**********************************************************************
These cookies are both yummy and well priced. I recommend them. Other cookies I like are Mrs. Fields(packaged), Chips ahoy soft baked, fig newtons, oreos, and elf cookies.
**********************************************************************
It felt weird ordering 24 bags of chips off the internet that I had never even tried before, but I am SO GLAD I DID. They are delicious! I've tried 

In [32]:
from bs4 import BeautifulSoup

In [33]:
soup = BeautifulSoup(sent_0, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

Why is this $[...] when the same product is available for $[...] here? />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [70]:
text = re.sub('[^a-zA-Z0-9]+',' ', text)
text = text.lower()
print(text)

why is this when the same product is available for here the victor m380 and m502 traps are unreal of course total fly genocide pretty stinky but only right nearby 


In [64]:
def decontracted(phrase):
    # spesific - you may need SME support
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r" can\'t", "cannot", phrase)
    phrase = re.sub(r"don\'t", "do not", phrase)
    phrase = re.sub(r"n\'t", "not", phrase)
    phrase = re.sub(r"\'s'", "is", phrase)
    phrase = re.sub(r"\'d", "would", phrase)
    phrase = re.sub(r"\'ll", "will", phrase)
    phrase = re.sub(r"\'ve", "have", phrase)
    phrase = re.sub(r"\'t", "not", phrase)
    phrase = re.sub(r"\'m", "am", phrase)
    phrase = re.sub(r"\'re", "are", phrase)
    return phrase

In [65]:
sent_1000 = final['Text'].values[1000]
print(sent_1000)
print("="*50)

I like Kettle chips but was really disappointed with this order because they were over cooked.  I bought a bigger sized bag from Costco and the chips were all cooked perfectly. So I don't know why these were different.<br />Will never buy these chips from here again.


In [66]:
sent_1000 = decontracted(sent_1000)
print(sent_1000)
print("="*50)

I like Kettle chips but was really disappointed with this order because they were over cooked.  I bought a bigger sized bag from Costco and the chips were all cooked perfectly. So I do not know why these were different.<br />Will never buy these chips from here again.


In [67]:
sent_0 = final['Text'].values[0]
print(sent_0)

Why is this $[...] when the same product is available for $[...] here?<br />http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [68]:
# combine all the text 
sent_0 = re.sub('[^a-zA-Z0-9]+',' ', sent_0)
print(sent_0)

Why is this when the same product is available for here br http www amazon com VICTOR FLY MAGNET BAIT REFILL dp B00004RBDY br br The Victor M380 and M502 traps are unreal of course total fly genocide Pretty stinky but only right nearby 


In [72]:
final.head(1)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
2546,2774,B00002NCJC,A196AJHU9EASJN,Alex Chaffee,0,0,1,1282953600,thirty bucks?,Why is this $[...] when the same product is av...


In [73]:
# combining all the above text preprocessing
from tqdm import tqdm

preprocessed_review = []

for sentences in tqdm(final['Text'].values):
    sentences = re.sub(r"http\S+","",sentences)
    sentences = BeautifulSoup(sentences, 'lxml').get_text()
    sentences = decontracted(sentences)
    sentences = re.sub('[^a-zA-Z0-9]+',' ', sentences)
    sentences = sentences.lower()
    preprocessed_review.append(sentences.strip())

100%|████████████████████████████████████████████████████████████████████████████| 4994/4994 [00:02<00:00, 1980.25it/s]


In [74]:
preprocessed_review[0]

'why is this when the same product is available for here the victor m380 and m502 traps are unreal of course total fly genocide pretty stinky but only right nearby'

In [75]:
preprocessed_review[1500]

'these cookies are both yummy and well priced i recommend them other cookies i like are mrs fields packaged chips ahoy soft baked fig newtons oreos and elf cookies'

In [76]:
preprocessed_review[1000]

'i like kettle chips but was really disappointed with this order because they were over cooked i bought a bigger sized bag from costco and the chips were all cooked perfectly so i do not know why these were different will never buy these chips from here again'