# Amazon Fine Food Reviews

Amazon Fine Food Reviews consist of fine food reviews from Amazon.<br>

    Number of reviews: 568,454<br>
    Number of Users: 256,059<br>
    Number of products: 74,258<br>
    Timespan: Oct 1999 - oct 2012<br>
    Number of Attributes: 10<br>

Attributes Information<br>

1. Id<br>
2. ProductId --->Unique id for product<br>
3. UserId ---> Unique id for use<br>
4. ProfileName<br>
5. HelpfullnessNumerator ---> number of the users who founded the review helpful<br>
6. HelpfullnessDenominator ---> number of users who indicated whether they found review helpful or not<br>
7. Score ---> Rating in between 5 and 1<br>
8. Time ---> timestamp for the review<br>
9. Summary ---> brief summary of the review<br>
10. Text ---> text of the review<br>

#### Objective

Given a review, determine whether the review is (positive (for score 4 and 5) or negative (for score 1 and 2)).

<br>
[Q]. How to determine if a review is postive or negative?<br>
<br>
[Ans]. Using the score attribute scores which are in range of 4 and 5 consider it as positive rating and for scores 1 and 2 consider them as negative rating. This is the approximate way for dividing polarity (pos / neg).

# Loading Data

Dataset is of two types

1. csv format
2. SQLite Database

We work with both the files in order to analysis the data.



In [1]:

import warnings 
warnings.filterwarnings('ignore')

# Libraries to analysis the data
import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
sn.set()

In [300]:
# Loading data using sqlite
con = sqlite3.connect('database.sqlite')

# Select the reviews where the score is not equal to 3

filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 """, con)

filtered_data.head(3)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [301]:
filtered_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525814 entries, 0 to 525813
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      525814 non-null  int64 
 1   ProductId               525814 non-null  object
 2   UserId                  525814 non-null  object
 3   ProfileName             525814 non-null  object
 4   HelpfulnessNumerator    525814 non-null  int64 
 5   HelpfulnessDenominator  525814 non-null  int64 
 6   Score                   525814 non-null  int64 
 7   Time                    525814 non-null  int64 
 8   Summary                 525814 non-null  object
 9   Text                    525814 non-null  object
dtypes: int64(5), object(5)
memory usage: 40.1+ MB


In [302]:
# Give reivew postive(1) rating for score < 3 and negative(0) rating for score > 3\

def partition(x):
    if x < 3:
        return "Negative"
    return "Postive"


actual_score = filtered_data['Score']
PostiveNegative = actual_score.map(partition)
filtered_data['Score'] = PostiveNegative
print("Number of data points are ", filtered_data.shape[0])
filtered_data.head(3)

Number of data points are  525814


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,Postive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,Negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,Postive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


# Exploratory Data Analysis

## [2] Data Cleaning : Data Duplication

It is observed that there is duplicate in the data, it is necessary to remove the duplicate values to remove biasness while training the model.

In [303]:
display = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3  
                                 AND UserID = "AR5J8UI46CURR"
                                  ORDER BY ProductID """, con)

display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


* **As we can see above same user had multiple reviews of the with same values of HelpfullNumerator, HelpfullDenominator, Score Time, Summary and text and on doing analysis we found it**<br>

* **ProductID = B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookie, 8.82-Ounce (Pack of 8)**<br>

* **ProductID = B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookie, 8.82 Ounce (Pack of 8) and so on**<br>

* **Reason for same reviews with parameters excluding Product ID are same, this is beacuse  of Seller of the product Loacker Had metioned both the products are same like they are wafer of wafer type, but falovour was diifernt. this lead to same review with same Summary, Text and score of different product.**<br>

* **If we neglect those type of duplicate values then it may lead to biasness, where Algorithm while training it bends towards the data that have biasness.**<br>

* **Inorder to remove this type of duplicate values first sort them in ascending oreder then remove the duplicate values using pandas pre built method**


In [304]:
# Sorting data according ProductID in ascending order

sorted_data=filtered_data.sort_values('ProductId', axis = 0, ascending=True, kind = 'quicksort', na_position = 'last')

In [305]:
# Removing Duplicate values

final = sorted_data.drop_duplicates( subset = {"UserId", "ProfileName", "Time", "Text"}, keep = 'first')
final.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,Postive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...
138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,Postive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc..."
138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,Postive,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...
138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,Postive,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...
138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,Postive,1018396800,A great way to learn the months,This is a book of poetry about the months of t...


In [306]:
# Checking how much of data is still remaining

(final['Id'].size*1.0) / (filtered_data['Id'].size*1.0)

0.6925890143662968

### Observation

It is observed that HelpfullnessNumerator is less than Helpfullness Denominator, it is impossible in parctical, we should remove those values

In [307]:
noise = final[ final.HelpfulnessNumerator > final.HelpfulnessDenominator]

In [308]:
noise

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
59301,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,Postive,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
41159,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,Postive,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [309]:
# we sould not consider those values

final = final[ final.HelpfulnessNumerator <= final.HelpfulnessDenominator]

In [310]:
print("Number of datapoints after basic EDA is ", final.shape[0])

print("\nNumber of positive and negative rating ")
final.Score.value_counts()

Number of datapoints after basic EDA is  364171

Number of positive and negative rating 


Postive     307061
Negative     57110
Name: Score, dtype: int64

In [311]:
# Librarie to perform NLP text to vector conversion

import nltk
import string

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc
from sklearn import metrics

from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

#from gensim.models import Word2Vec
#from gensim.models import KeyedVectors

import pickle
import tqdm
import os
import re           # Tutorial about regular expressions:: https://pymotw.com/2/re/

In [312]:
import time

# [7.2.2] Bag of Words (BoW)

1. **Countvectorizer is the library that is used to convert text into vecrtor using Bag of Words technique.**

In [313]:
# BoW

count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Text'].values)

In [314]:
type(final_counts)

scipy.sparse.csr.csr_matrix

1. **sparse matrix atype of matrix, atmost elements of the matrix are zeros.**
2. **In a vector the word that appears in text are coded to non-zeros value or frequency of the word, wheras the words that are not occuring in the text are coded as zero.**
3. **There are 1000  unique words in corpus, then vector lenght is 1000 in a document/review let the words be 10-15 words,so only 15 cells of vector are coded with non zeros and rest 990-985 cells consist of zero values.**
4. **Hence matrix returned by countvectorizer is a sparse matrix.**

In [315]:
final_counts.get_shape()

(364171, 115281)

In [316]:
final_counts.shape

(364171, 115281)

# [7.2.3] Text preprocessing

Before converting text into vector we must clean the text, it makes text much efficient or meaningful while converting text into vector.

1. **Removing HTML tags <br /> break line tags.**
2. **Removing punctuations or limited set of special charcters like #, *...etc.**
3. **checking if a word is mada up of english word not alpha-numeric.**
4. **check to see that length of the word is not greater then 2(it is researched that no adjective is greater than of length 2).**
5. **Convert the word to lowercase.**
6. **Remove stopwords.**
7. **Picking up the stem(parent) word of words they are in superlative degree or any other degree form, we perform stemming by Snowball Steming(it was observed to be better than Porter Steming).**

After we collect words used to describe review positive or negative.

In [317]:
nltk.download('stopwords') # downloading stopwords to overcome error

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mani\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [318]:
# We use re(RegularExpression) module to remove all of the trash from text
# re module find the patterns in data compile method pre compile the pattern and sub method replace the pattern with string 
# mentioned  by user.


stop = set(stopwords.words('english'))
print("Number of stop words in english are ", len(stop))

# initialising the snowball stemmer
sno = nltk.stem.SnowballStemmer('english') 
word = 'tasty'
print("\nStem word of ",word," is ", sno.stem(word).encode('utf8'))

Number of stop words in english are  179

Stem word of  tasty  is  b'tasti'


In [319]:
stop.remove('not')
stop.remove('very')
stop

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's'

### Cleaning HTML tags

1. **Defining function to clean HTML tags.**
2. **re.compile method is used to pre compile the pattern.**
3. **re.sub method is used to replace the pattern with string mentioned by user in a given sentence.**

In [320]:
start_time = time.time()
def clean_html(sentence):
    cleaner = re.compile('<.*?>')
    cleaned_text = re.sub(cleaner, ' ', sentence)
    return cleaned_text
clean_html("my name is manivarsh <br /> in my class")

'my name is manivarsh   in my class'

In [321]:
final.Text = final.Text.map(clean_html)
final.Text[0]

'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

### Cleaning Punctuations

1. **clean_punc method removes punctuations in sentence.**
2. **Punctuations doesn;t make any sense after converting into vector.**

In [322]:
def clean_punc(sentence):
    cleaned = re.sub(r'[?|!|\'|"|#]', r' ', sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]', r' ', cleaned)
    return cleaned
clean_punc("what's your name, are you aware of dogs? in campus.")

'what s your name  are you aware of dogs  in campus '

In [323]:
final.Text = final.Text.map(clean_punc)
final.Text[0]

'I have bought several of the Vitality canned dog food products and have found them all to be of good quality  The product looks more like a stew than a processed meat and it smells better  My Labrador is finicky and she appreciates this product better than  most '

In [324]:
lenght = 0
for sentence in final.Text:
    lenght += len(sentence.split())
lenght

29074266

### **Removing words with numbers, whose length is less than 2 and stopwords.**

In [325]:
def Filter(x):
    filtered_sentence = []
    final_string = []
    for word in x.split():
        if ((word.isalpha()) & (len(word) > 2) & (word.lower() not in stop)):
            s = (sno.stem(word.lower()).encode('utf8'))
            filtered_sentence.append(s)
    str1 = b" ".join(filtered_sentence)
    final_string.append(str1)
    return final_string[0]

In [326]:
final.Text = final.Text.map(Filter)
final.Text[0]

b'bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better'

In [327]:
def Pos_Neg_Analysis(x, y):
    all_positive_words = []
    all_negative_words = []
    i = 0
    for sentence in x:
        if y.values[i] == "Postive":
            for word in sentence.split():
                all_positive_words.append(word)
        if y.values[i] == "Negative":
            for word in sentence.split():
                all_negative_words.append(word)
        i += 1
    return (all_positive_words, all_negative_words)

In [328]:
analysis = Pos_Neg_Analysis(final.Text, final.Score)
end_time = time.time()
print("Time taken to execution is ", end_time - start_time)

Time taken to execution is  249.35561299324036


In [329]:
final.Text[1]

b'product arriv label jumbo salt peanut peanut actual small size unsalt not sure error vendor intend repres product jumbo'

# [7.2.4] uni-grams, n-grams

### Motivation

1. **We have seperated positive words and negative words lets analyse them to motivate for n-grams.**
2. **Checking out the frequency distribution of words that mostly appears in postive and negative reviews.**
3. **FreqDist Module return list of top words that occurs most frequently in positive and negative review.**

In [330]:
freq_dist_pos = nltk.FreqDist(analysis[0])
freq_dist_neg = nltk.FreqDist(analysis[1])
print("Top 20 words occuring positive reviews are ", freq_dist_pos.most_common(20))
print("\nTop 20 words frequent in negative reviews are ", freq_dist_neg.most_common(20))

Top 20 words occuring positive reviews are  [(b'not', 146807), (b'like', 139432), (b'tast', 129068), (b'good', 112789), (b'flavor', 109637), (b'love', 107396), (b'great', 103924), (b'use', 103893), (b'one', 96740), (b'product', 91054), (b'veri', 90849), (b'tri', 86803), (b'tea', 83915), (b'coffe', 78830), (b'make', 75111), (b'get', 72130), (b'food', 64811), (b'would', 55759), (b'time', 55275), (b'buy', 54202)]

Top 20 words frequent in negative reviews are  [(b'not', 54382), (b'tast', 34592), (b'like', 32331), (b'product', 28227), (b'one', 20569), (b'flavor', 19580), (b'would', 18064), (b'tri', 17756), (b'veri', 17013), (b'use', 15303), (b'good', 15044), (b'coffe', 14720), (b'get', 13786), (b'buy', 13752), (b'order', 12871), (b'food', 12756), (b'tea', 11667), (b'even', 11085), (b'box', 10845), (b'amazon', 10076)]


### Observation

1. **Words like tast and like are frequent in postive review and negative review.**
2. **Reason is that in negative review words like "not like" and "not taste" occurs combiningly, but count vectorizer accept them as single words like "not" and "like" or "taste".**
3. **By deafult count vectorizer accept as uni-grams.**
4. **we can overcome this problem by bi-gram.**

In [331]:
# Bi-gram, tri-gram and so on n-gram

# removing stopword "not" should be avoided before building n-grams
count_vect = CountVectorizer(ngram_range = (1, 2))    # Documnetation is in sklearn.feature_extraction
final_bigram_counts = count_vect.fit_transform(final.Text.values)

In [332]:
final_bigram_counts.get_shape()

(364171, 2863047)


1. **for uni gram countvectorizer dimensions are about 115k dimensions.**
2. **But for bi gram countvectorizer dimension are about 2.9 miilion dimension**
3. **ngram_range is (1, 2) it does unigram and bigram if ngram_range id (1,3) then we get unigram, bigram and trigram.**
4. **ngram_range = (min_value, max_value)**

# [7.2.5] TF - IDF (Term Frequency - Inverse Document Frequency) 

In [335]:
tf_idf_vect = TfidfVectorizer(ngram_range = (1, 2))
final_tf_idf = tf_idf_vect.fit_transform(final.Text.values)

In [336]:
final_tf_idf.get_shape()

(364171, 2863047)

In [340]:
# getting the features names after text is converted into vector using tf-idf

features = tf_idf_vect.get_feature_names()
len(features)

2863047

In [341]:
# ten featrues

features[:10]

['aa',
 'aa pleas',
 'aaa',
 'aaa aaa',
 'aaa class',
 'aaa condit',
 'aaa hockey',
 'aaa job',
 'aaa magazin',
 'aaa perfect']

In [345]:
# Printing 3 rd vector of tf-idf vectorizer by converting sparse to array

type(final_tf_idf[3,:])
final_tf_idf[3,:].toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

In [365]:
# Get top 10 tfidf values in row and return them with their corresponding tfidf values
# np.argsort return array of indices after sorting default sorting order is ascending
# [::-1]  reverse the elements of array, becuase we require top values, after sorting maximum values are at the end of the array
# after reversing considering only top_n values for next operation [:top_n]
# top_feats is the list containing tuples of feature name and corresponding tfidf value of top_n values

def top_tfidf_feat(row, features, top_n = 25):
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row [i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ["features", "tfidf"]
    return df

In [366]:
top_tfidf = top_tfidf_feat(final_tf_idf[1,:].toarray()[0], features, 25)

In [368]:
# It is containing bi grams and unigrams

top_tfidf

Unnamed: 0,features,tfidf
0,keep page,0.192299
1,paperback seem,0.192299
2,grew read,0.192299
3,movi incorpor,0.192299
4,incorpor love,0.192299
5,read sendak,0.192299
6,rosi movi,0.192299
7,page open,0.192299
8,version paperback,0.192299
9,flimsi take,0.192299


# [7.2.6]  Word2Vec

1. **word2Vec is done by google data but it is lenght one of size 1.9 gb, where system needs high end sources to compute.**
2. **System must have minimum 12 gb ram inorder to compute data og 1.9 gb.**

###  Building word2vec on amazon data

1. **performing word2vec on the data where steming is not performed.**
2. **So, we are considering the dataset again as a dresh one because previous in pervious dataset we performed stemin.**

In [370]:
import gensim

In [371]:
# Loading data using sqlite
con = sqlite3.connect('database.sqlite')

# Select the reviews where the score is not equal to 3

filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 """, con)

filtered_data.head(3)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [372]:
# Give reivew postive(1) rating for score < 3 and negative(0) rating for score > 3\

def partition(x):
    if x < 3:
        return "Negative"
    return "Postive"


actual_score = filtered_data['Score']
PostiveNegative = actual_score.map(partition)
filtered_data['Score'] = PostiveNegative
print("Number of data points are ", filtered_data.shape[0])
filtered_data.head(3)

Number of data points are  525814


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,Postive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,Negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,Postive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [373]:
sorted_data=filtered_data.sort_values('ProductId', axis = 0, ascending=True, kind = 'quicksort', na_position = 'last')

In [374]:
final = sorted_data.drop_duplicates( subset = {"UserId", "ProfileName", "Time", "Text"}, keep = 'first')

In [375]:
final = final[ final.HelpfulnessNumerator <= final.HelpfulnessDenominator]

In [380]:
i = 0
list_of_sent = []
for sent in final.Text.values:
    filtered_sentence = []
    sent = clean_html(sent)
    for w in sent.split():
        for cleaned_words in clean_punc(w).split():
            if (cleaned_words.isalpha()):
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue
    list_of_sent.append(filtered_sentence)

In [382]:
print(final.Text.values[0])
print("====================================PROCESSESED===============================")
print(list_of_sent[0])

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
['this', 'witty', 'little', 'book', 'makes', 'my', 'son', 'laugh', 'at', 'loud', 'i', 'recite', 'it', 'in', 'the', 'car', 'as', 'we', 're', 'driving', 'along', 'and', 'he', 'always', 'can', 'sing', 'the', 'refrain', 'he', 's', 'learned', 'about', 'whales', 'india', 'drooping', 'i', 'love', 'all', 'the', 'new', 'words', 'this', 'book', 'introduces', 'and', 'the', 'silliness', 'of', 'it', 'all', 'this', 'is', 'a', 'classic', 'book', 'i', 'am', 'willing', 'to', 'bet', 'my', 'son', 'will', 'still', 'be', 'able', 'to', 'recite', 'from', 'memory', 'when', 'he', 'is', 'in', 'college']


In [386]:
# corpus_file = list_of_sent
# min_count = 5, ignore the words, where frequency is less than 5
# size = Dimensionality of the word vectors
# workers = USe these many worker threads to train the model( faster training with multicore threads)

word2vec_model = gensim.models.Word2Vec(list_of_sent, min_count = 5, size = 50, workers = 4)

In [393]:
# list of words in in model

words = list(word2vec_model.wv.vocab)
print(len(words))

33294


In [394]:
# Similar words related to tatsy , parllel relationship

word2vec_model.wv.most_similar('tasty')

[('tastey', 0.9198199510574341),
 ('yummy', 0.8654152154922485),
 ('satisfying', 0.8420820832252502),
 ('filling', 0.8388378620147705),
 ('delicious', 0.816381573677063),
 ('flavorful', 0.8044866323471069),
 ('addicting', 0.7936882972717285),
 ('tasteful', 0.756055474281311),
 ('versatile', 0.7468998432159424),
 ('delish', 0.7440531849861145)]

In [390]:
word2vec_model.wv.most_similar('like')

[('resemble', 0.6897900104522705),
 ('prefer', 0.6688624620437622),
 ('dislike', 0.6548731923103333),
 ('mean', 0.6270785331726074),
 ('overpower', 0.6257542371749878),
 ('alright', 0.6105120778083801),
 ('think', 0.6041259765625),
 ('enjoy', 0.5896117687225342),
 ('weird', 0.5778809785842896),
 ('gross', 0.5769407153129578)]

In [392]:
count_vect_feat = count_vect.get_feature_names()      # list of words from Bag of Words
count_vect_feat.index('like')
print(count_vect_feat[64055])

along vacat


# [7.2.7] Average Word2Vec and TF-IDF * word2Vec

In [402]:
# compute average word2vec of each review

# the avg w2v score for each sentence/review is stored in this list
sent_vectors = []
for sent in list_of_sent:  # for each review and sentence
    
    # as word vectors are of 50 length so we store them in sent_vec for each in 50
    sent_vec = np.zeros(50)
    
    
    # In order to perform average of the word2vec, create a variable that count number of words in the review
    word_count = 0
    
    # for each word in a review/sentence
    for word in sent:
        try:
            vec = word2vec_model.wv[word]
            sent_vec += vec
            word_count += 1
        except:
            pass
    
    sent_vec /= word_count
    sent_vectors.append(sent_vec)
print(sent_vectors)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



###  TF-IDF * Word2Vec

In [None]:
# TF-IDF weighted Word2Vec

tfidf_feat = tf_idf_vect.get_feature_names()  # list of unique words/ feature names of tf-idf

# final_tf_idf is the spare matrix

# the tf-idf w3v for each sentence/reviews is stored inn this list
tfidf_sent_vectors = []
row = 0

# for each sentence/review 
for sent in list_of_sent:
    sent_vec = np.zeros(50) # there are 50 dimension of vector
    weight_sum = 0
    
    # for each word in sentence/review
    for word in sent:
        try:
            vec = word2vec_model.wv[word]
            
            # obtain the tf-idf of each word in the sentence / review
            tfidf = final_tf_idf[row, tfidf_feat.index(word)]
            sent_vec += (vec * tfidf)
            weight_sum += tf_idf
        except:
            pass
    sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1