# Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




## Loading the data

The dataset is available in two forms
1. .csv file
2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualise the data efficiently.
<br> 

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

[nltk_data] Downloading package stopwords to /home/malhar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# using the SQLite Table to read data.
con = sqlite3.connect('database.sqlite') 

# filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 """, con) 
filtered_data['Score'] = filtered_data['Score'].map({1:'negative', 2:'negative', 4:'positive', 5:'positive'})
print("Number of data points in our data", filtered_data.shape)
print()
print(filtered_data['Score'].value_counts())
filtered_data.head(3)

Number of data points in our data (525814, 10)

positive    443777
negative     82037
Name: Score, dtype: int64


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


#  Exploratory Data Analysis

## Data Cleaning: Deduplication

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.  Following is an example:

In [3]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
display.head(2)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As can be seen above the same user has multiple reviews of the with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text  and on doing analysis it was found that <br>
<br> 
ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)<br>
<br> 
ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on<br>

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.<br>

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [4]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId')

#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"})
print(final.shape)

#Checking to see how much % of data still remains
print((final['Id'].size)/(filtered_data['Id'].size)*100)

del filtered_data
del sorted_data

(364173, 10)
69.25890143662969


<b>Observation:-</b> It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions

In [5]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", con)

display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [6]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(364171, 10)


positive    307061
negative     57110
Name: Score, dtype: int64

## 7.2.3  Text Preprocessing: Stemming, stop-word removal and Lemmatization.

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Remove URL from reviews
2. Remove any HTML Tags
3. Decontract words in reviews like you'll -> you will.
4. Remove any punctuations or limited set of special characters like , or . or # etc.
5. Check if the word is made up of english letters and is not alpha-numeric
6. Convert the word to lowercase
7. Remove Stopwords
8. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming) This is optional

After which we collect the words used to describe positive and negative reviews

In [7]:
# Decontract english sentences
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase


test = "Hey I'm Yann, how're you and how's it going ? That's interesting: I'd love to hear more about it."
print(decontracted(test))

Hey I am Yann, how are you and how is it going ? That is interesting: I would love to hear more about it.


In [8]:
#function to clean the word of any html-tags
def cleanhtml(review): 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', review)
    return cleantext.strip()

test = '''<div><br />
<h1 >Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>'''
print(cleanhtml(test))

Title 
 A long text........  
  a link


In [9]:
# Function to remove any URls
def cleanurl(review):
    cleanr = re.compile("(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?")
    cleantext = re.sub(cleanr, ' ', review)
    return cleantext.strip()

test = "text1\ntext2\nhttp://url.com/bla1/blah1/\ntext3\ntext4\nhttp://url.com/bla2/blah2/\ntext5\ntext6"
print(cleanurl(test))

text1
text2
 
text3
text4
 
text5
text6


In [10]:
#function to clean the review of any punctuation or special characters
def cleanpunc(review): 
    cleanr = re.compile("[^a-zA-Z0-9]")
    cleaned = re.sub(cleanr, ' ', review)
    return  cleaned.strip()
test = 'hi, my n$ame is @malhar.'
print(cleanpunc(test))

hi  my n ame is  malhar


In [11]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [12]:
#Code for implementing step-by-step the checks mentioned in the pre-processing phase
# this code takes a while to run as it needs to run on 500k sentences.

if not os.path.isfile('final.sqlite'):
    final_string = []
    
    for review in tqdm(final['Text']):
        filtered_review = []
        review = cleanurl(review)
        review = cleanhtml(review)
        review = decontracted(review)
        review = cleanpunc(review)
        for word in review.split():
            if (word.isalpha()) & (len(word)>2) & (word.lower() not in stopwords):
                # stemmed_word = sno.stem(word.lower())
                filtered_review.append(word.lower())
        
        str1 = ' '.join(filtered_review)  
        final_string.append(str1)
        
    #adding a column of CleanedText which displays the data after pre-processing of the review    
    final['CleanedText'] = final_string 
    
    # store final table into an SQlLite table for future.
    conn = sqlite3.connect('final.sqlite')
    final.to_sql('Reviews', conn,  schema=None, if_exists='replace', \
                 index=False, index_label=None, chunksize=None, dtype=None)
    conn.close()
    
else:
    print('SQL table already exists')
    conn = sqlite3.connect('final.sqlite')
    final = pd.read_sql_query(""" SELECT * FROM Reviews """, conn)
    conn.close()                

100%|██████████| 364171/364171 [00:43<00:00, 8355.00it/s]


In [13]:
print(final.shape)
final.head()

(364171, 11)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,witty little book makes son laugh loud recite ...
138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,positive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",grew reading sendak books watching really rosi...
138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,positive,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...,fun way children learn months year learn poems...
138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,positive,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...,great little book read aloud nice rhythm well ...
138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,positive,1018396800,A great way to learn the months,This is a book of poetry about the months of t...,book poetry months year goes month cute little...


# [7.2.2] Bag of Words (BoW)

In [88]:
#BoW
count_vect = CountVectorizer() #in scikit-learn
final_counts = count_vect.fit_transform(final['CleanedText'])
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

# Dump in a pickle file
if not os.path.isfile('./pickle_files/bow.pkl'):
    with open('./pickle_files/bow.pkl', 'wb') as f:
        pickle.dump(final_counts, f)

if os.path.isfile('./pickle_files/bow.pkl'):
    print('exists')

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (364171, 102831)
the number of unique words  102831
exists


## [7.2.4] Bi-Grams and n-Grams.

In [83]:
#bi-gram, tri-gram and n-gram
#removing stop words like "not" should be avoided before building n-grams

count_vect = CountVectorizer(ngram_range=(1,2) ) #in scikit-learn
final_bigram_counts = count_vect.fit_transform(final['CleanedText'])
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (364171, 3909269)
the number of unique words including both unigrams and bigrams  3909269


# [7.2.5] TF-IDF

In [90]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(final['CleanedText'])
print("the type of count vectorizer ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text TFIDF vectorizer  (364171, 3938108)
the number of unique words including both unigrams and bigrams  3938108


In [91]:
features = tf_idf_vect.get_feature_names()
features = np.array(features)
print("some sample features(unique words in the corpus)",features[100000:100010])

some sample features(unique words in the corpus) ['always commercials' 'always committed' 'always company' 'always compare'
 'always comparing' 'always comparison' 'always compatible'
 'always competing' 'always competitive' 'always complain']


In [97]:
# extract the top 25 tfidf values and corresponding features for row_index=1
first_row = final_tf_idf[1,:].toarray()[0]
top_25_idx = np.argsort(first_row)[::-1][:25]
print('Top 25 tfidf values :')
print(first_row[top_25_idx])
print()
print('Top 25 features')
print(features[top_25_idx])

Top 25 tfidf values :
[0.1832815  0.1832815  0.1832815  0.1832815  0.1832815  0.1832815
 0.1832815  0.1832815  0.1832815  0.1832815  0.1832815  0.1832815
 0.17761395 0.17761395 0.17761395 0.17761395 0.17359276 0.16792521
 0.16792521 0.16390402 0.16390402 0.16225766 0.15823647 0.15511739
 0.15256892]

Top 25 features
['incorporates love' 'movie incorporates' 'reading sendak' 'flimsy takes'
 'pages open' 'rosie movie' 'paperbacks seem' 'books watching'
 'cover version' 'grew reading' 'keep pages' 'version paperbacks'
 'sendak books' 'really rosie' 'paperbacks' 'miss hard' 'watching really'
 'hard cover' 'kind flimsy' 'however miss' 'hands keep' 'seem kind'
 'sendak' 'rosie' 'two hands']


In [98]:
# source: https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = features[topn_ids]
    top_tfidf = row[topn_ids]
    df = pd.DataFrame()
    df['feature'] = top_feats
    df['tfidf'] = top_tfidf
    return df

top_tfidf = top_tfidf_feats(final_tf_idf[1,:].toarray()[0],features,25)

In [99]:
top_tfidf

Unnamed: 0,feature,tfidf
0,incorporates love,0.183282
1,movie incorporates,0.183282
2,reading sendak,0.183282
3,flimsy takes,0.183282
4,pages open,0.183282
5,rosie movie,0.183282
6,paperbacks seem,0.183282
7,books watching,0.183282
8,cover version,0.183282
9,grew reading,0.183282


# [7.2.6] Word2Vec

In [None]:
# Using Google News Word2Vectors

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory 
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict , 
# and it contains all our courpus words as keys and  model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin" 
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.


# http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W17SRFAzZPY
# you can comment this whole cell
# or change these varible according to your need
# is_your_ram_gt_16g=False
# want_to_read_sub_set_of_google_w2v = True
# want_to_read_whole_google_w2v = True
# if not is_your_ram_gt_16g:
#     if want_to_read_sub_set_of_google_w2v and  os.path.isfile('google_w2v_for_amazon.pkl'):
#         with open('google_w2v_for_amazon.pkl', 'rb') as f:
#             model is dict object, you can directly access any word vector using model[word]
#             model = pickle.load(f)
# else:
#     if want_to_read_whole_google_w2v and os.path.isfile('GoogleNews-vectors-negative300.bin'):
#         model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

# print("the vector representation of word 'computer'",model.wv['computer'])
# print("the similarity between the words 'woman' and 'man'",model.wv.similarity('woman', 'man'))
# print("the most similar words to the word 'woman'",model.wv.most_similar('woman'))
# this will raise an error
# model.wv.most_similar('tasti')  # "tasti" is the stemmed word for tasty, tastful

In [116]:
# Train your own Word2Vec model using your own text corpus
list_of_sent=[]
for sent in tqdm(final['CleanedText']):
    list_of_sent.append(sent.split())

100%|██████████| 364171/364171 [00:02<00:00, 125897.89it/s]


In [101]:
print(final['CleanedText'][0])
print()
print(list_of_sent[0])

witty little book makes son laugh loud recite car driving along always sing refrain learned whales india drooping roses love new words book introduces silliness classic book willing bet son still able recite memory college

['witty', 'little', 'book', 'makes', 'son', 'laugh', 'loud', 'recite', 'car', 'driving', 'along', 'always', 'sing', 'refrain', 'learned', 'whales', 'india', 'drooping', 'roses', 'love', 'new', 'words', 'book', 'introduces', 'silliness', 'classic', 'book', 'willing', 'bet', 'son', 'still', 'able', 'recite', 'memory', 'college']


In [117]:
# min_count = 5 considers only words that occured atleast 5 times
w2v_model=Word2Vec(list_of_sent,min_count=5,size=50, workers=4)

In [118]:
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])

number of words that occured minimum 5 times  33330
sample words  ['witty', 'little', 'book', 'makes', 'son', 'laugh', 'loud', 'recite', 'car', 'driving', 'along', 'always', 'sing', 'refrain', 'learned', 'whales', 'india', 'drooping', 'roses', 'love', 'new', 'words', 'introduces', 'silliness', 'classic', 'willing', 'bet', 'still', 'able', 'memory', 'college', 'grew', 'reading', 'sendak', 'books', 'watching', 'really', 'rosie', 'movie', 'incorporates', 'loves', 'however', 'miss', 'hard', 'cover', 'version', 'seem', 'kind', 'flimsy', 'takes']


In [119]:
w2v_model.wv.most_similar('tasty')

[('delicious', 0.8256422281265259),
 ('tastey', 0.821725606918335),
 ('satisfying', 0.7980048656463623),
 ('yummy', 0.7775086164474487),
 ('flavorful', 0.7342123985290527),
 ('filling', 0.725837767124176),
 ('nice', 0.6583796143531799),
 ('good', 0.6512887477874756),
 ('nutritious', 0.6473573446273804),
 ('enjoyable', 0.6380774974822998)]

In [105]:
w2v_model.wv.most_similar('like')

[('weird', 0.7790811061859131),
 ('okay', 0.7028170824050903),
 ('gross', 0.696906566619873),
 ('kind', 0.6921790838241577),
 ('ok', 0.6859133243560791),
 ('prefer', 0.6795724630355835),
 ('resemble', 0.6708050966262817),
 ('good', 0.664630651473999),
 ('appealing', 0.6597133874893188),
 ('odd', 0.6571176648139954)]

# [7.2.7] Avg W2V, TFIDF-W2V

In [106]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sent): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

# Dump in a pickle file
if not os.path.isfile('./pickle_files/avg_w2v.pkl'):
    with open('./pickle_files/avg_w2v.pkl', 'wb') as f:
        pickle.dump(sent_vectors, f)

100%|██████████| 364171/364171 [28:13<00:00, 215.07it/s]


364171
50


In [120]:
model = TfidfVectorizer()
tf_idf_matrix = model.fit_transform(final['CleanedText'])
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(model.get_feature_names(), list(model.idf_)))

# Dump in a pickle file
if not os.path.isfile('./pickle_files/tfidf.pkl'):
    with open('./pickle_files/tfidf.pkl', 'wb') as f:
        pickle.dump(tf_idf_matrix, f)

In [121]:
# TF-IDF weighted Word2Vec
tfidf_feat = model.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_sent): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
#             to reduce the computation we are 
#             dictionary[word] = idf value of word in whole courpus
#             sent.count(word) = tf values of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1

if not os.path.isfile('./pickle_files/tfidf_w2v.pkl'):
    with open('./pickle_files/tfidf_w2v.pkl', 'wb') as f:
        pickle.dump(tfidf_sent_vectors, f)

100%|██████████| 364171/364171 [31:58<00:00, 189.78it/s] 


In [31]:
tf_idf_matrix=TfidfVectorizer()
final_tf_idf = tf_idf_matrix.fit_transform(["hi how are how you","how is going on"])

# print(final_tf_idf.todense())
dictionary = dict(zip(tf_idf_matrix.get_feature_names(), list(tf_idf_matrix.idf_)))
print("idf value of 'you' as per tf_idf_matrix.idf_:",dictionary['you'])
feature_tfidf=tf_idf_matrix.get_feature_names()
print(feature_tfidf)
print('number of features in tf idf=',len(feature_tfidf))
print(final_tf_idf[0,feature_tfidf.index('you')])
print(dictionary['you']*1)
print(np.log((1+2)/(1+1))+1)

idf value of 'you' as per tf_idf_matrix.idf_: 1.4054651081081644
['are', 'going', 'hi', 'how', 'is', 'on', 'you']
number of features in tf idf= 7
0.4461008073765536
1.4054651081081644
1.4054651081081644
