# Amazon Fine Food Reviews Analysis
### Context
This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

###### Information about dataset
###### Reviews from Oct 1999 - Oct 2012
###### 568,454 reviews
###### 256,059 users
###### 74,258 products
###### 260 users with > 50 reviews

## Attribution Information
1. ID
2. ProductId
3. UserId
4. ProfileName
5. HelpfulnessNumerator - Number of users who found the review helpful
6. HelpfulnessDenominator - Number of users who indicated whether they found the review helpful or not
7. Score - Rating between 1 and 5 ****
8. Time - Timestamp for the review
9. Summary - Brief summary of the review
10. Text - Text of the review *****


In [1]:
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
import nltk
import string
import sqlite3
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

In [3]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [4]:
con= sqlite3.connect(r"F:\AI-ML\Adv Ai&ML WKDY-MR 2023\4. NLP\5th April 2023\database.sqlite")
con

<sqlite3.Connection at 0x1c45ea3ba40>

In [5]:
filtered_data= pd.read_sql_query("""select * from reviews where score!= 3 limit 5000""", con)
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [6]:
def partition(x):
    if x<3:
        return 0
    return 1

In [7]:
# applying mapping
actual= filtered_data['Score']
PositiveNegative= actual.map(partition)
filtered_data['Score']= PositiveNegative
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [8]:
sorted_data= filtered_data.sort_values(by='ProductId', kind= 'quicksort', ascending=True)

In [9]:
sorted_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
2546,2774,B00002NCJC,A196AJHU9EASJN,Alex Chaffee,0,0,1,1282953600,thirty bucks?,Why is this $[...] when the same product is av...
2547,2775,B00002NCJC,A13RRPGE79XFFH,reader48,0,0,1,1281052800,Flies Begone,We have used the Victor fly bait for 3 seasons...
1145,1244,B00002Z754,A3B8RCEI0FXFI6,B G Chase,10,10,1,962236800,WOW Make your own 'slickers' !,I just received my shipment and could hardly w...
1146,1245,B00002Z754,A29Z5PI9BW2PU3,Robbie,7,7,1,961718400,Great Product,This was a really good idea and the final prod...
2942,3204,B000084DVR,A1UGDJP1ZJWVPF,"T. Moore ""thoughtful reader""",1,1,1,1177977600,Good stuff!,I'm glad my 45lb cocker/standard poodle puppy ...


In [10]:
final= sorted_data.drop_duplicates(subset= {'UserId', 'ProductId', 'Text'}, keep='first')
final.shape

(4993, 10)

In [11]:
final['Score'].value_counts()

Score
1    4183
0     810
Name: count, dtype: int64

In [12]:
final= final[final['HelpfulnessNumerator']<= final['HelpfulnessDenominator']]
final.shape

(4993, 10)

In [13]:
samples= []
for i in range(5):
    samples.append(final['Text'][np.random.randint(0, final.shape[0])])
samples

["These are so nice and creamy!  Usually I don't justify buying k-cups for things like cocoa that you can get cheaper in a packet, but this is so convenient and very good tasting .. way better than most powdered cocoa packets I've tried.  As far as price comparison, it is equivalent to gourmet-types of cocoa that would easily cost $1+ per packet.  I also like that it comes in a variety pack with 3 flavor options.  Highly recommend",
 'Melitta Pods are an inexpensive alternative to k-cups  that also save in the amount of material you have to throw away. Buyer Beware: Pods vary GREATLY in the amount of coffee they have in them. These allow for a good size cup of coffee.',
 'I enjoy my coffee with plenty of sugar and a flavored creamer. This provided a good base for that experience. As a non-connoisseur, this is just about right: not too strong, not bitter, good "coffee" flavor.<br /><br />For the sake of comparison, I most often drink the ground version of <a href="http://www.amazon.com/

In [14]:
# Text preprocessing
import re
from bs4 import BeautifulSoup

In [15]:
soup= BeautifulSoup(samples[-1], 'html')
soup.get_text()

'I have 4 adult Shih-Tzus and while they do prefer it when I cook for them...... they do eat this and it has helped eliminate the brownish tear stains they get when I feed them some dry foods.'

In [16]:
def decontracted(phrase):
    # specific
    phrase= re.sub(r"won't", "will not", phrase)
    phrase= re.sub(r"can't", "can not", phrase)
    phrase= re.sub(r"don't", "do not", phrase)
    phrase= re.sub(r"n't", "not", phrase)
    phrase= re.sub(r"won't", "will not", phrase)
    phrase= re.sub(r"'ve", "have", phrase)
    phrase= re.sub(r"'m", "am", phrase)
    phrase= re.sub(r"'re", "are", phrase)
    phrase= re.sub(r"'t", "have", phrase)
    phrase= re.sub(r"'ll", "will", phrase)

    return phrase

In [17]:
decontracted("i 'm at the cafe, i 've done my project 'll you come along with me")

'i am at the cafe, i have done my project will you come along with me'

In [33]:
# combining all preprocessing
from tqdm import tqdm
from nltk.tokenize import word_tokenize

lamm= WordNetLemmatizer()

preprocessed_review= []
for sent in tqdm(final['Text'].values):
    sent= sent.lower()
    sent= re.sub(r'https\S+',"", sent)
    sent= re.sub(r'http\S+',"", sent)
    sent= BeautifulSoup(sent, 'html').get_text()
    sent= decontracted(sent)
    sent= re.sub(r'[^a-zA-Z]+'," ", sent)
    sent= ' '.join([lamm.lemmatize(word) for word in word_tokenize(sent) if word not in stopwords.words('english')])
    preprocessed_review.append(sent.strip())

100%|██████████████████████████████████████████████████████████████████████████████| 4993/4993 [01:56<00:00, 42.93it/s]


In [34]:
preprocessed_review[:10]

['product available victor trap unreal course total fly genocide pretty stinky right nearby',
 'used victor fly bait season beat great product',
 'received shipment could hardly wait try product love slicker call instead sticker removed easily daughter designed sign printed reverse use car window printed beautifully havehe print shop program going lot fun product window everywhere surface like tv screen computer monitor',
 'really good idea final product outstanding use decal car window everybody asks bought decal made two thumb',
 'iam glad lb cocker standard poodle puppy love stuff trust brand superior nutrition compare label previous feed pedigree mostly corn little dude healthy happy high energy glossy coat also superior nutrition produce smaller compact stool',
 'using food month find excellent fact two dog coton de tulear lb standard poodle puppy lb love food thriving coat excellent condition overall structure perfect good tasting dog good good deal owner around best food ever us

## Building ML algo 
### Bag of word / ngram / tfidf
### Naive bayes / random forest / xgboost

### Word2Vec / Glove / Bert -- DL

In [35]:
# Bag of word
from sklearn.feature_extraction.text import CountVectorizer

count_vect= CountVectorizer()
count_vect.fit(preprocessed_review)
print("some features name", count_vect.get_feature_names_out()[:20])
print("="*30)
final_counts= count_vect.transform(preprocessed_review)
print(type(final_counts))
print(final_counts.shape)
print(final_counts.shape[1])


some features name ['aa' 'aahhhs' 'aback' 'abandon' 'abates' 'abbott' 'abby' 'abdominal'
 'abiding' 'ability' 'able' 'abor' 'aboulutely' 'absence' 'absent'
 'absoloutely' 'absolute' 'absolutely' 'absolutley' 'absolutly']
<class 'scipy.sparse._csr.csr_matrix'>
(4993, 11716)
11716


In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf= TfidfVectorizer()
tfidf.fit(preprocessed_review)
print("some features name", tfidf.get_feature_names_out()[:20])
print("="*30)
final_counts_tf= tfidf.transform(preprocessed_review)
print(type(final_counts_tf))
print(final_counts_tf.shape)
print(final_counts_tf.shape[1])

some features name ['aa' 'aahhhs' 'aback' 'abandon' 'abates' 'abbott' 'abby' 'abdominal'
 'abiding' 'ability' 'able' 'abor' 'aboulutely' 'absence' 'absent'
 'absoloutely' 'absolute' 'absolutely' 'absolutley' 'absolutly']
<class 'scipy.sparse._csr.csr_matrix'>
(4993, 11716)
11716


In [40]:
vectors= pd.DataFrame(final_counts_tf.toarray(), columns=tfidf.get_feature_names_out())
vectors

Unnamed: 0,aa,aahhhs,aback,abandon,abates,abbott,abby,abdominal,abiding,ability,...,zippy,zito,zola,zomg,zon,zoo,zpkm,zucchini,zupas,zuppa
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4988,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4989,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4990,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4991,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
y= final['Score'].values

In [47]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(vectors, y, test_size=0.2, random_state=101)

In [50]:
y_train

array([1, 1, 1, ..., 1, 0, 1], dtype=int64)

In [52]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

def evaluate_model(actual, predicted):
    accuracy= accuracy_score(actual, predicted)
    con_mat= confusion_matrix(actual, predicted)
    classi_report= classification_report(actual, predicted)

    return accuracy, con_mat, classi_report

In [53]:
from sklearn.naive_bayes import MultinomialNB

ratings= MultinomialNB().fit(x_train, y_train)
y_pred_train_nb= ratings.predict(x_train)
y_pred_test_nb= ratings.predict(x_test)    

In [54]:
evaluate_model(y_train, y_pred_train_nb)

(0.8345017526289434,
 array([[   2,  660],
        [   1, 3331]], dtype=int64),
 '              precision    recall  f1-score   support\n\n           0       0.67      0.00      0.01       662\n           1       0.83      1.00      0.91      3332\n\n    accuracy                           0.83      3994\n   macro avg       0.75      0.50      0.46      3994\nweighted avg       0.81      0.83      0.76      3994\n')

In [55]:
evaluate_model(y_test, y_pred_test_nb)

(0.8518518518518519,
 array([[  0, 148],
        [  0, 851]], dtype=int64),
 '              precision    recall  f1-score   support\n\n           0       0.00      0.00      0.00       148\n           1       0.85      1.00      0.92       851\n\n    accuracy                           0.85       999\n   macro avg       0.43      0.50      0.46       999\nweighted avg       0.73      0.85      0.78       999\n')