# Amazon Fine Food Reviews Analysis

Amazon Fine Food Reviews Analysis
Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review

#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be considered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is neutral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

In [1]:
import sys
!{sys.executable} -m pip install gensim

^C


In [1]:
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os
%matplotlib inline



In [2]:
# Loading the dataset
df = pd.read_csv("Reviews.csv")
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [3]:
df.shape

(568454, 10)

In [4]:
# discarding neutral reviews i.e rows with the Score value = 3

df = df[df["Score"]!=3]
df.shape

(525814, 10)

In [5]:
# function to partition the reviews into positive or negative

def reviews(x):
    if x>3:
        return 1;
    else:
        return 0;

In [6]:
# filtering the data
actual = df["Score"]
pos_neg = actual.map(reviews)
df["Score"] = pos_neg
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [7]:
print(df.shape)

(525814, 10)


# Exploratory Data Analysis

## 1. Data cleaning: Remove duplicates

In [8]:
a = df["UserId"].value_counts()
print(a.shape)

(243414,)


As it can be seen that total unique user id's are less than the total user id's in the dataset. So we conclude that dupicates
are present.

In [9]:
df[df["UserId"]=="AR5J8UI46CURR"]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
73790,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,1,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
78444,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,1,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
138276,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,1,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
138316,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,1,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
155048,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,1,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As can be seen above the same user has multiple reviews of the with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text  and on doing analysis it was found that <br>
<br> 
ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)<br>
<br> 
ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on<br>

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.<br>

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [10]:
# Sorting data according to ProductId in ascending order
sorted = df.sort_values('ProductId',axis=0,inplace=False,)

In [11]:
# Removing the duplicate entries
final = sorted.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},keep="first",inplace=False)
print(final.shape)

(364173, 10)


In [12]:
# Percentage of data retained after removing duplicates
print(((final['Id'].size*1.0)/(df['Id'].size*1.0))*100)

69.25890143662969


<b>Observation:-</b> It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions.

In [13]:
final[final["HelpfulnessNumerator"]>final["HelpfulnessDenominator"]]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
64421,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,1,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
44736,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,1,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [14]:
final = final[final["HelpfulnessNumerator"]<=final["HelpfulnessDenominator"]]

In [15]:
print(final["Score"].value_counts())

1    307061
0     57110
Name: Score, dtype: int64


# 1. Text Preprocessing.

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [17]:
# Lets have a look at some of the reviews
for i in range(0,5):
    print(final["Text"].values[i])
    print('='*50)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
I grew up reading these Sendak books, and watching the Really Rosie movie that incorporates them, and love them. My son loves them too. I do however, miss the hard cover version. The paperbacks seem kind of flimsy and it takes two hands to keep the pages open.
This is a fun way for children to learn their months of the year!  We will learn all of the poems throughout the school year.  they like the handmotions which I invent for each poem.
This is a great little book to read aloud- it has a nice rhythm as well as good repetition that little ones like, in the lines about "chicken soup with rice".  The child gets to go

In [18]:
from bs4 import BeautifulSoup   # It will remove all html tags from a text

In [25]:
from nltk.stem import SnowballStemmer

In [19]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [21]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [22]:
stop = set(stopwords.words('english'))
print(stop)

{'having', 'each', 'won', 'y', 'its', 'did', "you've", "that'll", 'your', 'there', 'ma', 'mightn', 'have', 'below', 'off', 'ourselves', 'hadn', "isn't", 'didn', "don't", 'of', 'shouldn', 'under', 'yours', 'for', 'once', "you'll", 're', 'more', 'against', 'doing', 'hers', 'she', 'any', 'or', "wouldn't", 'can', 'are', 'most', 'wasn', 'be', 'weren', 'on', 'an', 'after', 'my', 'what', 'why', 'to', 'from', 'wouldn', 'couldn', 'theirs', 'mustn', "mightn't", 'they', "it's", "she's", 'where', 'few', 'should', "aren't", 'about', 'which', 'further', 'out', 'don', 'we', 'yourself', 'these', "couldn't", "haven't", "mustn't", "shan't", 'all', 'a', 'does', 's', 'was', 'if', 'them', 'and', 'hasn', 'while', "weren't", 'nor', 'when', "hadn't", 'only', 'd', "hasn't", 'now', "doesn't", 'no', 'over', 'needn', 'between', 'own', "needn't", 'being', 'whom', 'were', 'doesn', 'i', 'before', 'o', 'aren', 'same', 'by', 'some', 'himself', 'itself', 'do', "wasn't", 'up', 't', 'other', 'so', 'this', 'isn', 'our', '

In [24]:
# we are removing boundry case words like not,nor,no from our stopwords list
lst = ["no","nor","not"]
for word in lst:
    stop.remove(word)

In [30]:
preprocessed_reviews = []

for sentance in final["Text"].values:
    sentance = re.sub(r"http\S+","",sentance)              # remove all URL's
    sentance = BeautifulSoup(sentance).get_text()          # remove all html tags
    sentance = decontracted(sentance)
    sentance = re.sub(r"\S*\d\S*","",sentance).strip()     # removing words containing numbers in them
    sentance = re.sub(r"[^a-zA-Z]+",' ',sentance)           # removing one or more special characters
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stop)
    preprocessed_reviews.append(sentance.strip())
    

In [32]:
preprocessed_reviews[100]

'pros dog anything treat not smell bad many treats easy break smaller pieces nothing artificial easy digestion cons costly dog treats overall great product expensive dog anything treat several phobias including getting car walking doorways ignores fears get treat'

In [41]:
# Lets divide the reviews into positive and negative based on the "Score" feature-
pos_reviews = []
neg_reviews = []
i=0
for score in final["Score"].values:
    if score == 1:
        pos_reviews.append(preprocessed_reviews[i])
    else:
        neg_reviews.append(preprocessed_reviews[i])
    i=i+1

In [44]:
print(f"Number of positive reviews: {len(pos_reviews)}")
print(f"Number of negative reviews: {len(neg_reviews)}")

Number of positive reviews: 307061
Number of negative reviews: 57110


# Featurization

## 1. Bag of Words

In [35]:
count_vec = CountVectorizer()
final_counts = count_vec.fit_transform(preprocessed_reviews)

In [36]:
type(final_counts)

scipy.sparse.csr.csr_matrix

In [37]:
print(final_counts.get_shape())

(364171, 116757)


Here each sentence is represented as a 116757 dimension vector where each dimension refers to a distinct word.

In [38]:
print(final_counts.get_shape()[1])

116757


## 2. bi-grams , n-grams

In [46]:
# Note: Removing words like not should be avoided before applying n-grams
count_vect = CountVectorizer(ngram_range=(1,2))
final_bigrams = count_vect.fit_transform(preprocessed_reviews)

In [47]:
final_bigrams.get_shape()

(364171, 3923391)

## 3. TF-IDF

In [49]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2),min_df=10)
final_tf_idf = tf_idf_vect.fit_transform(preprocessed_reviews)

In [50]:
print(final_tf_idf.get_shape())

(364171, 203036)


So in the above sparse matrix, the dimension of each vector is very less as compared to Bow(bi-grams).

## 4. Word2Vec

**Note :** Here we will be building our own w2v model using the dataset of amazon fine food reviews.

### gensim docs :https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4 

In [52]:
list_of_sentences = []
for sentance in preprocessed_reviews:
    list_of_sentences.append(sentance.split())    # each review will be converted into a list of words

In [56]:
print(list_of_sentences[0:2])

[['witty', 'little', 'book', 'makes', 'son', 'laugh', 'loud', 'recite', 'car', 'driving', 'along', 'always', 'sing', 'refrain', 'learned', 'whales', 'india', 'drooping', 'roses', 'love', 'new', 'words', 'book', 'introduces', 'silliness', 'classic', 'book', 'willing', 'bet', 'son', 'still', 'able', 'recite', 'memory', 'college'], ['grew', 'reading', 'sendak', 'books', 'watching', 'really', 'rosie', 'movie', 'incorporates', 'love', 'son', 'loves', 'however', 'miss', 'hard', 'cover', 'version', 'paperbacks', 'seem', 'kind', 'flimsy', 'takes', 'two', 'hands', 'keep', 'pages', 'open']]


In [57]:
w2v_model = Word2Vec(list_of_sentences,min_count=10,vector_size=50)

In [75]:
print(w2v_model.wv.most_similar('tasty'))
print('='*50)
print(w2v_model.wv.most_similar('worst'))

[('satisfying', 0.8210705518722534), ('delicious', 0.8176749348640442), ('tastey', 0.8020315766334534), ('yummy', 0.7951156497001648), ('filling', 0.7417910695075989), ('flavorful', 0.7311116456985474), ('nutritious', 0.6694338321685791), ('hearty', 0.659784197807312), ('nice', 0.6545760035514832), ('surprisingly', 0.6527333855628967)]
[('nastiest', 0.862835705280304), ('greatest', 0.771416962146759), ('disgusting', 0.7435508966445923), ('vile', 0.7294278144836426), ('best', 0.7288384437561035), ('horrible', 0.7157167792320251), ('terrible', 0.7150712609291077), ('awful', 0.7039044499397278), ('weakest', 0.7007498741149902), ('horrid', 0.687414824962616)]


In [80]:
w2v_words = w2v_model.wv
# print("number of words that occured minimum 10 times ",len(w2v_words))
# print("sample words ", w2v_words[0:50])

In [86]:
w2v_model.wv.key_to_index["witty"]

22696

## 5. Average Word2Vec

In [73]:
a = w2v_model.wv.get_vecattr("drooping","count")
a

10

In [87]:
# compute average word2vec for each review
sent_vectors = []     # the avg w2v for each review is stored in this list
for sentence in list_of_sentences:
    vec = np.zeros(50)          # the dimension of each vector is 50 as shown above
    cnt_words = 0               # no of words with a valid vector in the sentence/review
    for word in sentence:       # for each word in the sentence/review
        try :
            vector=w2v_model.wv[word]
            vec+=vector
            cnt_words+=1
        except :
            continue
    if cnt_words!=0:
        sent_vectors.append(vec/cnt_words)

In [88]:
print(len(sent_vectors))
print(sent_vectors[0])

363192
[ 3.05343097e-01 -2.51927575e-01  2.44831819e-01  6.64331497e-02
 -8.17800929e-01 -6.83222758e-02 -1.48967075e-01  5.25359204e-01
  9.68132051e-02 -6.10650596e-01  1.20988605e-01  3.88652331e-01
  3.04124480e-01  5.97012126e-01  3.46680455e-01  1.42247000e-01
 -4.78030783e-01  3.30011004e-01  8.25699644e-03  1.95626044e-01
  5.57158574e-01  5.42458074e-02  7.91113347e-02 -5.03733936e-01
  1.52047142e-01 -2.37758904e-01  2.94700058e-03 -2.59543861e-01
  7.28126620e-01  4.70374918e-01 -1.91287728e-01 -6.65873332e-01
  5.66748748e-01 -1.64712197e-01 -1.64926932e-02 -1.33539044e-01
  2.29899972e-01  2.08916660e-01  4.64938128e-01  4.97668109e-01
 -4.85685967e-01  3.97745898e-01  3.84096739e-01 -7.29525856e-01
 -4.38981841e-01  1.45476821e-01 -3.14960632e-04  5.58972559e-02
  3.92719044e-01  1.33777236e-01]
