# Amazon Fine Food Reviews

## Sentiment Analysis

https://www.kaggle.com/snap/amazon-fine-food-reviews

## Context
This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Data includes:
- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews

Attribute Info:
- id : reviewer id
- Product id : unique id of the prod
- UserId : unique id of user
- ProfileName : Name of the user
- HelpfulnessNumerator : no.of users who found the review helpful
- HelpfulnessDenominator : no.of users who doesn't found the  review helpful.
- Score : rating from 1-5
- Time : timestamp for th review
- Summary : Brief summary of the review
- Text : Text of the review

#### We Will eliminate some of the feautrs such as id and score.

### Objective:
Given a review, we determine whether its a positive(rating 4,5) or negative(rating 1,2)

### Importing Packages

In [1]:
import sqlite3
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer,CountVectorizer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer

  return f(*args, **kwds)


In [2]:
# using sqlite to connecct with Sql dataset
connection=sqlite3.connect("database.sqlite")

In [32]:
filtered_data=pd.read_sql_query("""
Select * from Reviews where Score != 3
""",connection)

In [33]:
filtered_data.shape

(525814, 10)

In [34]:
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


### Removing duplicates

In [35]:
filtered_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},keep="first",inplace=True)

In [36]:
filtered_data.shape

(364173, 10)

### Cleaning data with common sence Scenarios

always HelpfulnessNumerator <= HelpfullnessDenominator i.e HelpfulnessNumerator is ntg but ThumbsUp clicked for the review and HelpfullnessDenom is ntg but both ThumbsUp and ThumbsDown given to the Review.

<br>
So lets check if there are any data points deviating this condition so that we can remove them.

In [37]:
filtered_data=filtered_data[filtered_data.HelpfulnessNumerator<=filtered_data.HelpfulnessDenominator]

In [38]:
filtered_data.shape

(364171, 10)

In [39]:
filtered_data.Score.value_counts()

5    250963
4     56098
1     36306
2     20804
Name: Score, dtype: int64

### Replacing scores with positive or negative

In [None]:
#Give review with score>3 as positive review and Score<3 as negative review

In [40]:
def f(num):
    if num <3:
        return "negative"
    else:
        return "positive"

In [41]:
filtered_data["Score"]=filtered_data["Score"].apply(f)

In [43]:
filtered_data.Score.value_counts()

positive    307061
negative     57110
Name: Score, dtype: int64

In [46]:
## Highly Imbalanced Dataset

## Text To d-dim Vector

#### Why to convert?

If we convert Text to vector Then we can Use Linear Algebra Techniques To Classify and Visualize the data.
By using Linear Algebra we can classify the points like this i.e we will find a plane/line such that it divides the positive reviews to one side and negative reviews to other side

# [3].  Text Preprocessing.

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [53]:
sent_250000=filtered_data["Text"].values[250000]
print(sent_250000)

Since this product's launch my cat has been hooked.  He gets the best of the best and the Fancy Feast Medleys line is most assuredly that (I purchase all but the soufle varieties, which my cat seems to not care for as much).  In my local stores, a 12-pack of this would run me anywhere from $10 - $13.  It was worth that expense, but finding them on Amazon definitely has been a huge savings to me!  the subscribe and save option is perfect - I pay just a few dollars more for 24 cans than I would have for half that!  I won't even consider purchasing these in a store again - so long as Amazon has this phenomenal price and program, they have my business.<br /><br />On a side note - if you haven't ever seen these in person, the food actually looks quasi-edible for humans!  I take pride in serving my cat these, knowing he's getting a tasty and nutrious snack.  Also, though my cat doesn't care for the soufle variety, I know plenty that do.  Try them all and find Your cat's favorites!


In [54]:
from bs4 import BeautifulSoup

In [63]:
soup=BeautifulSoup(sent_250000,"lxml")
text=soup.get_text()
print(text)

Since this product's launch my cat has been hooked.  He gets the best of the best and the Fancy Feast Medleys line is most assuredly that (I purchase all but the soufle varieties, which my cat seems to not care for as much).  In my local stores, a 12-pack of this would run me anywhere from $10 - $13.  It was worth that expense, but finding them on Amazon definitely has been a huge savings to me!  the subscribe and save option is perfect - I pay just a few dollars more for 24 cans than I would have for half that!  I won't even consider purchasing these in a store again - so long as Amazon has this phenomenal price and program, they have my business.On a side note - if you haven't ever seen these in person, the food actually looks quasi-edible for humans!  I take pride in serving my cat these, knowing he's getting a tasty and nutrious snack.  Also, though my cat doesn't care for the soufle variety, I know plenty that do.  Try them all and find Your cat's favorites!


In [64]:
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [67]:
text=decontracted(text)
print(text)

Since this product is launch my cat has been hooked.  He gets the best of the best and the Fancy Feast Medleys line is most assuredly that (I purchase all but the soufle varieties, which my cat seems to not care for as much).  In my local stores, a 12-pack of this would run me anywhere from $10 - $13.  It was worth that expense, but finding them on Amazon definitely has been a huge savings to me!  the subscribe and save option is perfect - I pay just a few dollars more for 24 cans than I would have for half that!  I will not even consider purchasing these in a store again - so long as Amazon has this phenomenal price and program, they have my business.On a side note - if you have not ever seen these in person, the food actually looks quasi-edible for humans!  I take pride in serving my cat these, knowing he is getting a tasty and nutrious snack.  Also, though my cat does not care for the soufle variety, I know plenty that do.  Try them all and find Your cat is favorites!


In [68]:
#remove words with numbers python:
text = re.sub("\S*\d\S*", "", text).strip()
print(text)

Since this product is launch my cat has been hooked.  He gets the best of the best and the Fancy Feast Medleys line is most assuredly that (I purchase all but the soufle varieties, which my cat seems to not care for as much).  In my local stores, a  of this would run me anywhere from  -   It was worth that expense, but finding them on Amazon definitely has been a huge savings to me!  the subscribe and save option is perfect - I pay just a few dollars more for  cans than I would have for half that!  I will not even consider purchasing these in a store again - so long as Amazon has this phenomenal price and program, they have my business.On a side note - if you have not ever seen these in person, the food actually looks quasi-edible for humans!  I take pride in serving my cat these, knowing he is getting a tasty and nutrious snack.  Also, though my cat does not care for the soufle variety, I know plenty that do.  Try them all and find Your cat is favorites!


In [69]:
#remove spacial character: 
text = re.sub('[^A-Za-z0-9]+', ' ', text)
print(text)

Since this product is launch my cat has been hooked He gets the best of the best and the Fancy Feast Medleys line is most assuredly that I purchase all but the soufle varieties which my cat seems to not care for as much In my local stores a of this would run me anywhere from It was worth that expense but finding them on Amazon definitely has been a huge savings to me the subscribe and save option is perfect I pay just a few dollars more for cans than I would have for half that I will not even consider purchasing these in a store again so long as Amazon has this phenomenal price and program they have my business On a side note if you have not ever seen these in person the food actually looks quasi edible for humans I take pride in serving my cat these knowing he is getting a tasty and nutrious snack Also though my cat does not care for the soufle variety I know plenty that do Try them all and find Your cat is favorites 


In [70]:
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [74]:
# Combining all the above 
from tqdm import tqdm
sno =nltk.stem.SnowballStemmer('english')
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(filtered_data['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(sno.stem(e.lower()) for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentance.strip())

100%|█████████████████████████████████████████████████████████████████████████| 364171/364171 [07:56<00:00, 763.77it/s]


In [75]:
preprocessed_reviews[5000]

'latest updat past babi food stage may miss formal announc recent heard switch packag late complet bpa free confirm custom servic leav remain review wife made babi food daughter first eight month daughter go begin daycar certain conveni factor send jar babi food bought earth best organ figur closest bought brand make varieti pack nice easi way stock differ kind babi enjoy enjoy sever flavor though hate well pleas babi seem pleas needless say pretti surpris read review bpa lid spent time tri verifi lot differ blog state could not find anyth definit bother clear earth best know peopl concern bpa lid facebook site litter peopl ask question admin answer way food safeti babi health highest prioriti call us discuss slight paraphras go websit zero articl bpa none faq section either noth not worri enough issu stop use purchas would felt whole lot better compani product would acknowledg parent concern tell parent someth hey check food laboratori test regular amount bpa found food x low no healt

In [79]:
import pickle
pickle_out=open("CleanedData.pickle","wb")
pickle.dump(preprocessed_reviews,pickle_out)
pickle_out.close()

In [81]:
pickle_in=open("CleanedData.pickle","rb")
data=pickle.load(pickle_in)

In [85]:
data[5000]

'latest updat past babi food stage may miss formal announc recent heard switch packag late complet bpa free confirm custom servic leav remain review wife made babi food daughter first eight month daughter go begin daycar certain conveni factor send jar babi food bought earth best organ figur closest bought brand make varieti pack nice easi way stock differ kind babi enjoy enjoy sever flavor though hate well pleas babi seem pleas needless say pretti surpris read review bpa lid spent time tri verifi lot differ blog state could not find anyth definit bother clear earth best know peopl concern bpa lid facebook site litter peopl ask question admin answer way food safeti babi health highest prioriti call us discuss slight paraphras go websit zero articl bpa none faq section either noth not worri enough issu stop use purchas would felt whole lot better compani product would acknowledg parent concern tell parent someth hey check food laboratori test regular amount bpa found food x low no healt

# Bag of Words(BOW)

Bag of words is a technique Where the Text is converted to vectors.<br>
Its Most Widely used for Classification/Filtering problems.<br>
Mostly the Frequency of words is mentioned in the vectors through which we can find the similarity between the vectors and classify.<br>

Its most widely used IR technique.

In [87]:
count_vect = CountVectorizer() #scikit-learn

final_counts = count_vect.fit_transform(data)

In [89]:
print(type(final_counts))
print(final_counts.shape)

<class 'scipy.sparse.csr.csr_matrix'>
(364171, 86290)


In [90]:
count_vect = CountVectorizer(ngram_range=(1,2))
final_uni_bigrams_count = count_vect.fit_transform(data)
print(type(final_uni_bigrams_count))
print(final_uni_bigrams_count.shape)

<class 'scipy.sparse.csr.csr_matrix'>
(364171, 2924350)


Observation: here we have 29lakhs unique uni and bigrams In the case of bag of words we got 86290 unique words

In [91]:
pickle_out = open("n_grams.pickle","wb")
pickle.dump(final_uni_bigrams_count,pickle_out)
pickle_out.close()

# TF-IDF

### Term Frequency - Inverse Doc Frequency

Term Frequency = (occ of word wi in the document/no of words in the doc)

Inverse Doc Frequency = log(Number of Words in the Doc Corpus i.e all docs / occ of word in that doc)


tfIdf = tf * idf <br>

Term Freq increases if the occ of the word is more.

Inverse Doc Freq increases if the word is rare in the Doc corpous.

In [92]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(data)

In [93]:
features = tf_idf_vect.get_feature_names()
len(features)

2924350

### Function to retrieve top 25 features/words for a given review

In [96]:
def top_tfidf_feats(row,features,top_n=25):
    
    #here argsort will return the top 25 features indices
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i],row[i]) for i in topn_ids]
    
    df = pd.DataFrame(top_feats)
    df.columns = ['features','tfidf']
    return df

top_tfidf = top_tfidf_feats(final_tf_idf[10,:].toarray()[0],features,10)

In [97]:
top_tfidf

Unnamed: 0,features,tfidf
0,sauc,0.255037
1,hot sauc,0.182221
2,tequila,0.166049
3,love hot,0.147328
4,inclan realiz,0.127044
5,know cactus,0.127044
6,citi bum,0.127044
7,incred servic,0.127044
8,bum magic,0.127044
9,tasteless burn,0.127044


# Word2Vec

In Word to vec we will consider a word and its converted to Vector while in Bag of words every sentence is converted to vector <br/>
https://www.tensorflow.org/tutorials/word2vec <br>
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

In [104]:
#!pip install gensim

In [101]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

  return f(*args, **kwds)


In [103]:
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

In [106]:
model["king"]

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [107]:
model.wv.similarity('woman','man')

  """Entry point for launching an IPython kernel.


0.76640123

In [109]:
model.wv.most_similar('woman')

  """Entry point for launching an IPython kernel.


[('man', 0.7664012312889099),
 ('girl', 0.7494640946388245),
 ('teenage_girl', 0.7336829900741577),
 ('teenager', 0.631708562374115),
 ('lady', 0.6288785934448242),
 ('teenaged_girl', 0.6141784191131592),
 ('mother', 0.607630729675293),
 ('policewoman', 0.6069462299346924),
 ('boy', 0.5975908041000366),
 ('Woman', 0.5770983099937439)]

# Avg W2v, TFIDF-W2v

Avergae W2V is ntg but if given  a sentence it calculates the w2v of every word in the sentence and sum it up then divide by no.of words which give me the avg W2V of the 'SENTENCE'.<br>

Note: Avg W2V is used to get the Sentence Vector.
<br>

TFIDF weighted W2V we will calc w2v of a word in the sentece and multiply it with tfidf of that word, this is done for every word and is summed up and divided by sum of tfidf of every word.

Avg W2v = sum( w2v(wi) ) / (no.of words in the sentence)
<br>
Tf-IDF Weighted W2V = sum( ti * w2v(wi) ) / sum(ti) 
<br>
Here 'ti' is tfidf of the word.

In [113]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in data: # for each review/sentence
    sent_vec = np.zeros(300) # as word vectors are of 50 length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors.append(sent_vec)

print(len(sent_vectors)) #no.of reviews
print(len(sent_vectors[0])) #no.of dimensions

  if __name__ == '__main__':
  


364171
300


In [None]:
# TF-IDF weighted Word2Vec
tfidf_feat = tf_idf_vect.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in list_of_sent[0:1000]: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            
            vec = w2v_model.wv[word]
            
            # obtain the tf_idfidf of a word in a sentence/review
            tfidf = final_tf_idf[row, tfidf_feat.index(word)]
            
            sent_vec += (vec * tfidf)
            
            weight_sum += tfidf
        except:
            pass
    sent_vec /= weight_sum
   
    tfidf_sent_vectors.append(sent_vec)
    row += 1
    