# Amazon Fine Food Reviews Analysis


[Data Source](https://www.kaggle.com/snap/amazon-fine-food-reviews)




The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br>
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




# Download Data

In [1]:
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
api.dataset_download_files("snap/amazon-fine-food-reviews",path="data/")

# Load Data

In [2]:
from zipfile import ZipFile

zip_file = ZipFile("data/amazon-fine-food-reviews.zip")
zip_file.namelist()

['Reviews.csv', 'database.sqlite', 'hashes.txt']

In [3]:
zip_file.extract("database.sqlite",path="data/")

'data\\database.sqlite'

In [3]:
import pandas as pd
import sqlite3

con = sqlite3.connect("data/database.sqlite")
pd.read_sql_query("select name from sqlite_master where type ='table'",con)

Unnamed: 0,name
0,Reviews


In [4]:
con.execute("pragma table_info(Reviews)").fetchall()

[(0, 'Id', 'INTEGER', 0, None, 1),
 (1, 'ProductId', 'TEXT', 0, None, 0),
 (2, 'UserId', 'TEXT', 0, None, 0),
 (3, 'ProfileName', 'TEXT', 0, None, 0),
 (4, 'HelpfulnessNumerator', 'INTEGER', 0, None, 0),
 (5, 'HelpfulnessDenominator', 'INTEGER', 0, None, 0),
 (6, 'Score', 'INTEGER', 0, None, 0),
 (7, 'Time', 'INTEGER', 0, None, 0),
 (8, 'Summary', 'TEXT', 0, None, 0),
 (9, 'Text', 'TEXT', 0, None, 0)]

In [5]:
filtered_data = pd.read_sql_query("select * from Reviews where Score != 3 limit 5000" ,con)
filtered_data.shape

(5000, 10)

In [6]:
filtered_data["Score"].head()

0    5
1    1
2    4
3    2
4    5
Name: Score, dtype: int64

In [7]:
def partition(x):
    if x < 3:
        return 0
    return 1

filtered_data["Score"] = filtered_data["Score"].map(partition)
filtered_data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
4995,5423,B00622CYVS,A17ASMX6QMO6XY,E. Harvill,0,1,0,1277424000,Not so tasty...,"My baby didn't seem into these dinners, so I t..."
4996,5424,B00622CYVS,A32DHN8U74GCAR,"Granola Girl ""michele j.""",0,1,1,1240790400,Food Delivery,This is great! Organic baby food options - de...
4997,5425,B00622CYVS,A2YHXAZLCLDT8D,"Mark Smith ""Food lover""",0,1,1,1236988800,Dinner time is Earths Best TIme !!,My little guy loves to try new foods..so this ...
4998,5426,B00622CYVS,A2NYT3UXUTBY23,C&GHoll,1,3,0,1249603200,Wrong item shipped,We ordered the Earth's best 2nd dinner variety...


# EDA

In [8]:
display = pd.read_sql_query("""select * from reviews where score != 3 and userid ='AR5J8UI46CURR' order by productid""",con)

In [9]:
display.shape

(5, 10)

In [10]:
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


In [11]:
# let remove the duplicate value
final_data = filtered_data.drop_duplicates(subset=["UserId","ProfileName","Time","Text"],keep="first",)
final_data.shape

(4986, 10)

In [13]:
(len(final_data)/len(filtered_data))*100

99.72

In [14]:
final_data = final_data[final_data["HelpfulnessNumerator"]<=final_data["HelpfulnessDenominator"]]
final_data.shape

(4986, 10)

In [18]:
print(final_data["Score"].value_counts())

1    4178
0     808
Name: Score, dtype: int64


In [17]:
print(final_data["Score"])

0       1
1       0
2       1
3       0
4       1
       ..
4995    0
4996    1
4997    1
4998    0
4999    0
Name: Score, Length: 4986, dtype: int64


# Text Preprocessing

In [19]:
# printing some random reviews
sent_0 = final_data['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = final_data['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1500 = final_data['Text'].values[1500]
print(sent_1500)
print("="*50)

sent_4900 = final_data['Text'].values[4900]
print(sent_4900)
print("="*50)

I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
This is not jerky, this is processed, hard like a rock, very greasy and stale smelling stripe of something that you can't break into anything smaller than 2 inches long and that certainly is not the size of a training treat! The dogs- 45lb dogs that will eat anything- were not impressed, it was hard to chew, and it sounded like they were crunching rocks, most of them spat it out after a few chews, left it there, this would be the first time they would not eat something in their entire lives, these dogs will work for lettuce. Where is a zero star button?
Aboulutely love Popchips!I first tried these healthy chips at a marathon i did in California. I like this variety pack because i got to try alot of the flavors ive never had.
M

In [21]:
# remove the http
import re
re.sub(r"http\S+","",sent_0)

'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

In [22]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [23]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
sent_0 = re.sub("\S*\d\S*", "", sent_0).strip()
print(sent_0)

I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.


In [24]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent_1500 = re.sub('[^A-Za-z0-9]+', ' ', sent_1500)
print(sent_1500)

Aboulutely love Popchips I first tried these healthy chips at a marathon i did in California I like this variety pack because i got to try alot of the flavors ive never had 


In [25]:
stopwords= {'br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's",
            'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
            'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was',
            'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
            'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with',
            'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to',
            'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
            'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
            'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
            "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't",
            'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
            "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan',
            "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn',
            "wouldn't"}

In [34]:
from tqdm import tqdm
from bs4 import BeautifulSoup


preprocessed_reviews = []


for sentence in tqdm(final_data["Text"].values):
    sentence = re.sub(r"http\S+","",sentence)
    sentence = BeautifulSoup(sentence,"lxml").get_text()
    sentence = decontracted(sentence)
    sentence = re.sub("\S*\d\S*","",sentence).strip()
    sentence= re.sub("[^A-Za-z]+"," ",sentence)
    sentence = " ".join([e.lower() for e in sentence.split() if e.lower() not in stopwords])
    preprocessed_reviews.append(sentence.strip())

100%|██████████| 4986/4986 [00:02<00:00, 2270.37it/s]


# Bag of Words

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
count_vect.fit(preprocessed_reviews)

final_counts = count_vect.transform(preprocessed_reviews)
print(final_counts.shape)
print(type(final_counts))

(4986, 12997)
<class 'scipy.sparse.csr.csr_matrix'>


In [37]:
print(count_vect.get_feature_names_out())

['aa' 'aahhhs' 'aback' ... 'zucchini' 'zupas' 'zuppa']


# bi-grams and n- grams

In [38]:
count_vect = CountVectorizer(ngram_range=(1,2))
final_bigrams = count_vect.fit_transform(preprocessed_reviews)
final_bigrams.shape

(4986, 137837)

In [40]:
print(count_vect.get_feature_names_out())

['aa' 'aa sumatra' 'aahhhs' ... 'zupas pathetic' 'zuppa' 'zuppa engelesia']


# TF-IDF

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(preprocessed_reviews)

final_tf_idf.shape

(4986, 137837)

In [46]:
print(final_tf_idf[1].data)

[0.18512007 0.18512007 0.18512007 0.18512007 0.18512007 0.18512007
 0.10770806 0.18512007 0.18512007 0.18512007 0.18512007 0.18512007
 0.18512007 0.18512007 0.18512007 0.18512007 0.18512007 0.13901079
 0.18512007 0.16589149 0.13386683 0.15602835 0.08619514 0.03415098
 0.15355665 0.12566441 0.0854555  0.08827067 0.25520578 0.13211705
 0.35322257 0.15602835 0.09672757 0.11223086]


In [49]:
print(final_tf_idf[1].indices)

[ 95211 101171  61471 130863  38597 117897  81903 128605 109910 110495
    676  89048  89069 104096  62867  64600   5921  94975 101170  61468
 130854  38589 117847  80803 128601 109882 110348    558  89047 104083
  62865  64598   5878  94949]


In [52]:
print(final_tf_idf[1])

  (0, 95211)	0.18512006600681238
  (0, 101171)	0.18512006600681238
  (0, 61471)	0.18512006600681238
  (0, 130863)	0.18512006600681238
  (0, 38597)	0.18512006600681238
  (0, 117897)	0.18512006600681238
  (0, 81903)	0.1077080556285421
  (0, 128605)	0.18512006600681238
  (0, 109910)	0.18512006600681238
  (0, 110495)	0.18512006600681238
  (0, 676)	0.18512006600681238
  (0, 89048)	0.18512006600681238
  (0, 89069)	0.18512006600681238
  (0, 104096)	0.18512006600681238
  (0, 62867)	0.18512006600681238
  (0, 64600)	0.18512006600681238
  (0, 5921)	0.18512006600681238
  (0, 94975)	0.13901078705098982
  (0, 101170)	0.18512006600681238
  (0, 61468)	0.16589148764645154
  (0, 130854)	0.13386683412402858
  (0, 38589)	0.15602834999785767
  (0, 117847)	0.08619514175990735
  (0, 80803)	0.03415098092479426
  (0, 128601)	0.15355664505546718
  (0, 109882)	0.1256644119194074
  (0, 110348)	0.08545549736806055
  (0, 558)	0.08827066721492412
  (0, 89047)	0.2552057759081932
  (0, 104083)	0.13211705128161333
  (0

# Word 2 Vec

In [58]:
list_of_word = [sentence.split() for sentence in preprocessed_reviews]
list_of_word[:2]

[['bought',
  'several',
  'vitality',
  'canned',
  'dog',
  'food',
  'products',
  'found',
  'good',
  'quality',
  'product',
  'looks',
  'like',
  'stew',
  'processed',
  'meat',
  'smells',
  'better',
  'labrador',
  'finicky',
  'appreciates',
  'product',
  'better'],
 ['product',
  'arrived',
  'labeled',
  'jumbo',
  'salted',
  'peanuts',
  'peanuts',
  'actually',
  'small',
  'sized',
  'unsalted',
  'not',
  'sure',
  'error',
  'vendor',
  'intended',
  'represent',
  'product',
  'jumbo']]

In [60]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(list_of_word,min_count=2,workers=4,vector_size=50)
w2v_model.wv.most_similar("great")

[('excellent', 0.9902483820915222),
 ('alternative', 0.9895371198654175),
 ('tasty', 0.9892227053642273),
 ('regular', 0.989077627658844),
 ('snack', 0.9886667132377625),
 ('chewy', 0.9885573983192444),
 ('licorice', 0.9884496331214905),
 ('though', 0.9884100556373596),
 ('either', 0.9883403778076172),
 ('amazing', 0.9883047938346863)]

In [61]:
w2v_model.wv.most_similar("like")

[('bitter', 0.9948434829711914),
 ('strong', 0.9935378432273865),
 ('taste', 0.9934603571891785),
 ('sweet', 0.9915247559547424),
 ('flavor', 0.9908263683319092),
 ('tastes', 0.9895326495170593),
 ('rich', 0.9842400550842285),
 ('dark', 0.981838583946228),
 ('really', 0.9803654551506042),
 ('smooth', 0.9781405925750732)]

In [64]:
w2v_model.wv.key_to_index

{'not': 0,
 'like': 1,
 'good': 2,
 'great': 3,
 'taste': 4,
 'one': 5,
 'product': 6,
 'would': 7,
 'flavor': 8,
 'love': 9,
 'coffee': 10,
 'food': 11,
 'chips': 12,
 'tea': 13,
 'no': 14,
 'really': 15,
 'get': 16,
 'best': 17,
 'much': 18,
 'amazon': 19,
 'use': 20,
 'time': 21,
 'buy': 22,
 'also': 23,
 'tried': 24,
 'little': 25,
 'find': 26,
 'make': 27,
 'price': 28,
 'better': 29,
 'bag': 30,
 'try': 31,
 'even': 32,
 'mix': 33,
 'well': 34,
 'chocolate': 35,
 'hot': 36,
 'eat': 37,
 'free': 38,
 'water': 39,
 'dog': 40,
 'first': 41,
 'made': 42,
 'could': 43,
 'found': 44,
 'used': 45,
 'bought': 46,
 'box': 47,
 'sugar': 48,
 'cup': 49,
 'flavors': 50,
 'sweet': 51,
 'recommend': 52,
 'brand': 53,
 'delicious': 54,
 'since': 55,
 'store': 56,
 'order': 57,
 'way': 58,
 'many': 59,
 'go': 60,
 'think': 61,
 'two': 62,
 'favorite': 63,
 'still': 64,
 'know': 65,
 'gluten': 66,
 'salt': 67,
 'nice': 68,
 'tastes': 69,
 'add': 70,
 'got': 71,
 'makes': 72,
 'drink': 73,
 'bit':