# Amazon Food Products Reviews Analysis

Data source : https://www.kaggle.com/snap/amazon-fine-food-reviews

## Context

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Reviews.csv: Pulled from the corresponding SQLite table named Reviews in database.sqlite
database.sqlite: Contains the table 'Reviews'

Data includes:
- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews

Attribute Information:
1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review

#### Objective:
Given a review, we'll try to determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

# Loading the data (from SQLite)

The dataset is available in two forms
1. .csv file
2. SQLite Database

SQLite dataset is easier to query and visualise efficiently. So we are going to load our data from SQLite and not .csv. 
Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If rating is 4 or 5 then the recommendation wil be set to "positive". Otherwise (1 or 2), it will be set to "negative".

In [1]:
%matplotlib inline

In [3]:
import sqlite3

In [3]:
# we create a connection to our database.
con = sqlite3.connect("database.sqlite")
# Now we can start fetching Data

In [4]:
import pandas as pd
filtered_dataset = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE score != 3
""",con)

In [5]:
filtered_dataset.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [6]:
filtered_dataset.shape

(525814, 10)

In [7]:
def assignsentiment(x):
    if x < 3:
        return "negative"
    else: return "positive"
# Let's map our function on score column
filtered_dataset["Score"] = filtered_dataset["Score"].map(assignsentiment)

In [8]:
filtered_dataset.head()
# Scores have been replaced by sentiments (+ or -)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


# Exploratory Data Analysis (EDA)

## Data Cleaning : Deduplication (removing duplicates)

In [9]:
pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE score!=3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductId
"""    
,con)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


In [10]:
pd.read_sql_query("""SELECT *
From Reviews
WHERE score !=3 AND UserId="ABXLMWJIXXAIN"
ORDER BY ProductId
"""
,con)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,320691,B000CQ26E0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",0,0,4,1187740800,"Fast, Easy and organic","For speed and wholesome goodness, Annie's can ..."
1,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
2,468954,B004DMGQKE,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",0,0,5,1351209600,Awesome service and great products,We sent this product as a gift to my husband's...


In [11]:
#A1UQRSCLF8GW1T
pd.read_sql_query(""" SELECT *
FROM Reviews
WHERE score != 3 AND UserId = "A1UQRSCLF8GW1T"
ORDER BY ProductId    
"""    
,con)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,483943,B003XDH6M6,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1315699200,"Yum, tasty",Didn't know what to expect from Newman's own l...
1,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [12]:
sorted_dataset = filtered_dataset.sort_values(by = ["ProductId"],axis=0,ascending=True,inplace = False, kind="quicksort",na_position="last")

In [13]:
final_dataset = sorted_dataset.drop_duplicates(subset = {"UserId","ProfileName","Time","Text"}, keep = "first", inplace = False)

In [14]:
final_dataset.shape

(364173, 10)

In [15]:
#How much % of the original data we have got in final_dataset
print((final_dataset.shape[0]/sorted_dataset.shape[0])*100)

69.25890143662969


In [16]:
# HelpfulnessNumerator can not logically be greater than HelpfulnessDenominator
# So let's remove data points where HelpfulnessNumerator > HelpfulnessDenominator
pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE score !=3 AND HelpfulnessNumerator > HelpfulnessDenominator
"""
,con)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
1,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...


In [17]:
final_dataset = final_dataset[final_dataset["HelpfulnessNumerator"]<=final_dataset["HelpfulnessDenominator"]]

In [18]:
final_dataset.shape

(364171, 10)

In [19]:
# How many positive and negative reviews are present inour dataset
final_dataset["Score"].value_counts()

positive    307061
negative     57110
Name: Score, dtype: int64

# Text preprocessing: Stemming, stop-words removal, lemmatization

Before we go on further with analysis and prediction models our Data still needs some text preprocessing and cleaning.
In this pre-processing phase we pursue the following steps:
1. Removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or #, !, ? etc...
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (better than Porter Stemming)

#### Let's import the packages that we will use during the following tasks.

In [7]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np

import string
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

import pickle

In [21]:
import re
stopwrd = set(stopwords.words("english")) #set of stopwords
snoball = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer
# Let's define two important functions :cleanhtml and cleanpunc
def cleanhtml(sentence): # function to clean the word of any html tag
    htmlpattern = re.compile('<.*?>')
    cleantext = re.sub(htmlpattern,' ',sentence)
    return cleantext
def cleanpunc(sentence): # function to clean the word of any punctuation or special chars
    cleaned = re.sub(r'[!|?|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|/|\|(|)|]',r' ',cleaned)
    return cleaned

In [22]:
print(stopwrd)
print("--------------------------------")
print(snoball.stem("delicious"))

{'ourselves', 'in', 'when', 'own', 'him', 'very', 'did', "it's", 'with', 'between', 'our', 'as', 'does', 've', 'for', 'my', 'at', 'will', 'while', 'herself', "haven't", 'he', 'your', 'itself', 'some', 'into', 'didn', 'off', 'were', 'are', "hadn't", 'am', "you've", 'weren', 'but', 'can', 'should', 'all', 'hasn', 'now', 'was', "wasn't", 'aren', "should've", "mightn't", "she's", 'such', "you'd", 'because', 'how', 'down', 'then', 'than', 'who', 'the', 'having', 'won', 'too', 'those', 'y', 'out', 'of', 't', 'shouldn', 'his', 'hadn', 'do', 'being', 'mustn', "mustn't", 'don', 'more', 'if', "weren't", 'hers', 'same', 'not', "isn't", 'has', 'yours', 'm', 'we', 'ma', 'doesn', 'is', 'under', 'both', "shan't", 'isn', 'during', 's', 'you', 'wouldn', 'through', 'i', 'to', 'mightn', 'himself', 'she', 'they', 'had', 'an', 'against', 'until', "hasn't", 'from', 'be', 'yourself', 'wasn', "won't", 'again', 're', 'myself', 'haven', 'll', "shouldn't", 'above', 'me', 'it', 'each', 'below', 'couldn', 'just', 

In [23]:
#Code for implementing step-by-step the checks mentioned in the pre-processing phase
# this code takes a while to run as it needs to run on 500k sentences.
import time
start_time = time.clock()
i=0
str1=' '
final_string=[]
all_positive_words=[] # store words from +ve reviews here
all_negative_words=[] # store words from -ve reviews here.
s=''
for sent in final_dataset['Text'].values:
    filtered_sentence=[]
    #print(sent);
    sent=cleanhtml(sent) # remove HTMl tags
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stopwrd):
                    s=(snoball.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if (final_dataset['Score'].values)[i] == 'positive': 
                        all_positive_words.append(s) #list of all words used to describe positive reviews
                    if(final_dataset['Score'].values)[i] == 'negative':
                        all_negative_words.append(s) #list of all words used to describe negative reviews reviews
                else:
                    continue
            else:
                continue 
    #print(filtered_sentence)
    str1 = b" ".join(filtered_sentence) #final string of cleaned words  
    final_string.append(str1)
    i+=1
print("It took ",time.clock() - start_time, "seconds")

It took  663.0993470728447 seconds


In [24]:
final_dataset["CleanedText"] = final_string
final_dataset['CleanedText']=final_dataset['CleanedText'].str.decode("utf-8")

#### our reviews ("Text") column in our dataset has been cleaned! 
#### Our reviews are cleaned and stemmed

In [25]:
final_dataset["CleanedText"][0] # New Text(review) column

'bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better'

In [26]:
final_dataset["Text"][0] # Old Text(review) column

'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

* Let's save our final cleaned dataset to an SQLite table so that it can be leveraged in the future

In [27]:
conn = sqlite3.connect("final.sqlite")
c=conn.cursor()
conn.text_factory = str
final_dataset.to_sql('Reviews', conn,  schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)

# Bag of Words (BoW) 

In [5]:
connn = sqlite3.connect("final.sqlite")
import pandas as pd
final_dataset = pd.read_sql_query("""
SELECT *
FROM Reviews
""",connn)

In [8]:
count_vect = CountVectorizer()
BoW_sparse_matrix = count_vect.fit_transform(final_dataset["CleanedText"].values)

In [9]:
BoW_sparse_matrix.get_shape()

(364171, 71624)

In [10]:
count = 0
for i in range(71624):
    if BoW_sparse_matrix[1,i]!=0:
        count +=1
print(count)

26


In [31]:
# As we can see our BoW matrix is extremely sparse !

In [11]:
print("the type of count vectorizer ",type(BoW_sparse_matrix)) # sparse matrix storage/representation technique
print("the shape of out text BOW vectorizer ",BoW_sparse_matrix.get_shape())
print("the number of unique words ", BoW_sparse_matrix.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (364171, 71624)
the number of unique words  71624


# Bi-Grams and n-Grams

In [34]:
start_time = time.clock()
freq_dist_positive=nltk.FreqDist(all_positive_words)
freq_dist_negative=nltk.FreqDist(all_negative_words)
print("Most Common Positive Words : ",freq_dist_positive.most_common(20))
print("Most Common Negative Words : ",freq_dist_negative.most_common(20))
print("It took ",time.clock() - start_time, "seconds")

Most Common Positive Words :  [(b'like', 139429), (b'tast', 129047), (b'good', 112766), (b'flavor', 109624), (b'love', 107357), (b'use', 103888), (b'great', 103870), (b'one', 96726), (b'product', 91033), (b'tri', 86791), (b'tea', 83888), (b'coffe', 78814), (b'make', 75107), (b'get', 72125), (b'food', 64802), (b'would', 55568), (b'time', 55264), (b'buy', 54198), (b'realli', 52715), (b'eat', 52004)]
Most Common Negative Words :  [(b'tast', 34585), (b'like', 32330), (b'product', 28218), (b'one', 20569), (b'flavor', 19575), (b'would', 17972), (b'tri', 17753), (b'use', 15302), (b'good', 15041), (b'coffe', 14716), (b'get', 13786), (b'buy', 13752), (b'order', 12871), (b'food', 12754), (b'dont', 11877), (b'tea', 11665), (b'even', 11085), (b'box', 10844), (b'amazon', 10073), (b'make', 9840)]
It took  10.572324116275013 seconds


Here we see that bi-grams are efficient in our case because of the many common words between negative and positive reviews.

In [13]:
# bi-grams , tri-grams and n-grams
import time
start_time = time.clock()
count_vect2 = CountVectorizer(ngram_range=(1,2))
Final_bigrams = count_vect2.fit_transform(final_dataset["CleanedText"].values)
print(time.clock()-start_time," seconds")

60.092445783170895  seconds


In [14]:
Final_bigrams.shape

(364171, 2923725)

In [15]:
type(Final_bigrams)

scipy.sparse.csr.csr_matrix

# TF-IDF

In [16]:
time_start = time.clock()
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tfidf = tf_idf_vect.fit_transform(final_dataset["CleanedText"].values)
print(time.clock()-time_start," seconds")
print("The type of countvectorizer: ",type(final_tfidf))
print("The size of tfidf vectorizer: ", final_tfidf.get_shape())
print("The number of unique uni-grams and bi-grams in our dataset: ",final_tfidf.get_shape()[1])

91.91368483393458  seconds
The type of countvectorizer:  <class 'scipy.sparse.csr.csr_matrix'>
The size of tfidf vectorizer:  (364171, 2923725)
The number of unique uni-grams and bi-grams in our dataset:  2923725


In [21]:
features = tf_idf_vect.get_feature_names()
print(len(features))

2923725


In [18]:
features[10:20]

['aaa magazin',
 'aaa perfect',
 'aaa plus',
 'aaa rate',
 'aaa spelt',
 'aaa tue',
 'aaaa',
 'aaaaa',
 'aaaaa kid',
 'aaaaa start']

In [27]:
# Let's see the 10th line (10th review) 
print(final_tfidf[10,:].toarray()[0])

[0. 0. 0. ... 0. 0. 0.]


In [20]:
# source: https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df
# Convert the second line(review) of our final_tfidf to an array then apply top_tfidf_feats
top_tfidf = top_tfidf_feats(final_tfidf[2,:].toarray()[0],features,25)
top_tfidf

Unnamed: 0,feature,tfidf
0,poem,0.38315
1,handmot,0.239019
2,throughout school,0.239019
3,invent poem,0.239019
4,learn poem,0.239019
5,like handmot,0.239019
6,handmot invent,0.239019
7,poem throughout,0.239019
8,way children,0.231628
9,learn month,0.226384


# Word2Vec

### 1. Google-News based Word2Vec

In [33]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
# We are going to use google news Word2Vectors
# We load google word2vec model
time_start = time.clock()
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print(time.clock()-time_start," seconds!")

113.97020279370645  seconds!


In [35]:
model.wv["business"]

array([ 1.03759766e-02, -4.85839844e-02, -1.26953125e-01, -1.08886719e-01,
        3.02734375e-02,  2.23388672e-02, -7.32421875e-04,  9.91210938e-02,
        8.20312500e-02,  1.80664062e-02,  6.59179688e-02,  1.04492188e-01,
       -1.00097656e-01, -1.21093750e-01, -1.06933594e-01,  9.52148438e-02,
        7.56835938e-02,  1.96289062e-01,  1.37695312e-01, -4.95605469e-02,
       -9.22851562e-02, -6.64062500e-02, -7.47070312e-02, -1.04003906e-01,
        2.46582031e-02,  1.08398438e-01,  6.68945312e-02,  1.03515625e-01,
        1.01074219e-01, -5.00488281e-02,  4.24804688e-02, -2.25585938e-01,
       -3.56445312e-02,  5.95703125e-02,  1.44531250e-01,  1.53320312e-01,
        2.01171875e-01, -4.71191406e-02, -1.59912109e-02, -2.62451172e-02,
        1.27929688e-01,  1.78710938e-01, -8.44726562e-02,  1.39648438e-01,
       -1.87500000e-01, -1.85546875e-01,  1.02539062e-01,  6.25000000e-02,
       -6.98242188e-02,  2.87109375e-01, -2.29492188e-02,  3.39355469e-02,
       -3.34472656e-02,  

In [47]:
model.wv.similarity("business","money")

0.23376924

In [48]:
model.wv.most_similar("business") #give the most similar words to "business"

[('businesses', 0.6623775362968445),
 ('busines', 0.6080313324928284),
 ('busi_ness', 0.5612965226173401),
 ('PETER_PASSI_covers', 0.5530025959014893),
 ('Business', 0.5466139316558838),
 ('businesss', 0.5441080331802368),
 ('Sopris_supplemental_solutions', 0.5252544283866882),
 ('company', 0.5192004442214966),
 ('entrepreneurial', 0.5077816247940063),
 ('buiness', 0.5039401650428772)]

In [55]:
model.wv.most_similar("tasti") # this gives ERROR because our stemmed word tasti does not exist 

KeyError: "word 'tasti' not in vocabulary"

In [56]:
model.wv.most_similar("taste")

[('tastes', 0.6838271021842957),
 ('flavor', 0.6630197763442993),
 ('tasted', 0.6162089109420776),
 ('Harry_Potter_butterbeer', 0.589458703994751),
 ('tasting', 0.5604723691940308),
 ('tangy_taste', 0.5567916035652161),
 ('aftertaste', 0.5558385252952576),
 ('bitter_taste', 0.5491952896118164),
 ('carbonated_cough_syrup', 0.5455324649810791),
 ('taste_buds', 0.5368086099624634)]

### 2. Train our own Word2Vec model using our own text corpus (final_dataset)

In [57]:
list_of_sent = []
for sent in final_dataset["CleanedText"].values:
    list_of_sent.append(sent.split())

In [58]:
print(final_dataset["CleanedText"].values[0])

witti littl book make son laugh loud recit car drive along alway sing refrain hes learn whale india droop love new word book introduc silli classic book will bet son still abl recit memori colleg


In [59]:
print(list_of_sent[0])

['witti', 'littl', 'book', 'make', 'son', 'laugh', 'loud', 'recit', 'car', 'drive', 'along', 'alway', 'sing', 'refrain', 'hes', 'learn', 'whale', 'india', 'droop', 'love', 'new', 'word', 'book', 'introduc', 'silli', 'classic', 'book', 'will', 'bet', 'son', 'still', 'abl', 'recit', 'memori', 'colleg']


In [60]:
import gensim
perso_word2vec_model = gensim.models.Word2Vec(list_of_sent,min_count = 5,size=50,workers=4)

In [63]:
w2v_vocab = list(perso_word2vec_model.wv.vocab)
print("number of words that appear min 5 times: ",len(w2v_vocab))
print("sample words [0:50]:",w2v_vocab[0:50])

number of words that appear min 5 times:  21938
sample words [0:50]: ['deni', 'penc', 'vervein', 'rosemari', 'stretchi', 'anis', 'supermart', 'snapshot', 'benchmark', 'maloney', 'kernal', 'nuk', 'nan', 'loophol', 'boomi', 'indic', 'tastybit', 'sneek', 'canola', 'noah', 'dubbl', 'magician', 'brisk', 'pint', 'creamer', 'rasberri', 'voluptu', 'wheaten', 'thicken', 'almost', 'duller', 'kirkland', 'monstros', 'intriqu', 'afterthought', 'retent', 'maltipoo', 'wer', 'honest', 'pau', 'redmond', 'fire', 'armpit', 'remain', 'fad', 'stereo', 'watercress', 'sleepytim', 'amozon', 'vial']


In [64]:
perso_word2vec_model.wv.most_similar("tasti")

[('delici', 0.8074445724487305),
 ('yummi', 0.7940266728401184),
 ('tastey', 0.7560350894927979),
 ('good', 0.7038044333457947),
 ('satisfi', 0.6868366599082947),
 ('nice', 0.6649437546730042),
 ('hearti', 0.6585618853569031),
 ('nutriti', 0.6512829065322876),
 ('crunchi', 0.6447467803955078),
 ('terrif', 0.6427789926528931)]

In [65]:
perso_word2vec_model.wv["tasti"]

array([-0.02701611, -0.10889673, -1.0262495 , -0.46574706,  0.44552073,
       -0.8640214 , -0.85324043, -1.5809108 , -0.99243325, -0.59499544,
        0.7233709 , -0.29276377, -1.7476803 , -0.6396086 ,  0.8851758 ,
       -1.0584092 , -2.700595  , -1.1784486 ,  0.8550985 ,  0.5342878 ,
        0.86521477,  0.1449451 , -2.479155  ,  0.43289283,  1.1408037 ,
       -0.44291678, -1.304791  ,  3.484737  ,  0.49329612,  1.9975141 ,
        1.9462025 , -0.7166388 ,  0.4743421 ,  1.2148111 , -1.1655399 ,
        0.4190834 , -2.4263234 ,  2.6667461 , -1.2812158 , -0.44299066,
        2.9221117 , -0.13341868,  1.4185519 , -1.6105909 , -0.6283417 ,
        0.90079725,  4.4634657 ,  2.0880275 , -1.9010131 ,  2.3108134 ],
      dtype=float32)

In [69]:
perso_word2vec_model.wv.similarity("tasti","delici")

0.80744463

In [74]:
perso_word2vec_model.wv.most_similar("tast")

[('aftertast', 0.770721435546875),
 ('flavor', 0.7697957754135132),
 ('tase', 0.734691858291626),
 ('flavour', 0.686099112033844),
 ('textur', 0.6777291297912598),
 ('overpow', 0.6550370454788208),
 ('overbear', 0.6543506979942322),
 ('hint', 0.6433494091033936),
 ('smell', 0.6336493492126465),
 ('underton', 0.624985933303833)]

## Avg Word2Vec, TFIDF-Word2Vec

In [77]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in list_of_sent: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_vocab:
            vec = perso_word2vec_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

KeyboardInterrupt: 

In [None]:
# TF-IDF weighted Word2Vec
tfidf_feat = tf_idf_vect.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in list_of_sent: # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_vocab:
            vec = perso_word2vec_model.wv[word]
            # obtain the tf_idfidf of a word in a sentence/review
            tf_idf = final_tfidf[row, tfidf_feat.index(word)]
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1