## Amazon Fine Food Review Analysis

Attribute Information:
    1. Id
    2. ProductID - unique identifier for the product
    3. UserID - unique identifier for the user
    4. ProfileName
    5. HelpfulnessNumerator - number of users who found the review useful
    6. HelpfulnessDenominator - number of users indicating whether they found the review helpful or not
    7. Score - rating between 1 & 5
    8. Time - timestamp of the review
    9. Summary - brief summary of the review
    10. Text - text of the review.

In [5]:
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jayraj/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<b>Task:</b>

Given a review, determine whether the review is positive (Rating of 4 / 5) or negative (Rating of 1 / 2).

In [6]:
# Using SQLite Table to read the data
con =sqlite3.connect('/Users/jayraj/Applied_AI_Course/Applied_ai_course/Datasets/Amazon Fine Dine Reviews/database.sqlite')

In [7]:
# Filtering only positive and negative reviews i.e.
# Not taking into considerations those reviews with Score=3
filtered_data=pd.read_sql_query("""
SELECT *
FROM REVIEWS
WHERE Score !=3
""", con)

In [8]:
# Creating a function to give reviews with score > 3, a positive rating, and reviews with a score < 3, a negative rating 
def partition(x):
    if x<3:
        return 'negative'
    return 'positive'

In [9]:
# Changing the scores with score less than 3 positive and vice versa
actualScore=filtered_data['Score']
positiveNegative=actualScore.map(partition)
filtered_data['Score']=positiveNegative

In [10]:
filtered_data
print(filtered_data.head())

   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   

   HelpfulnessNumerator  HelpfulnessDenominator     Score        Time  \
0                     1                       1  positive  1303862400   
1                     0                       0  negative  1346976000   
2                     1                       1  positive  1219017600   
3                     3                       3  negative  1307923200   
4                     0                       0  positive  1350777600   

                 Summary                                               Text  
0  Good Quality Dog Food  I have bought several of the Vitalit

## Data cleaning and deduplication of the data

In the real world, when we do machine learning we spend 20% to 30% time in Data Cleaning & Preprocessing. 

The Dataset has many duplicate rows.

Hence it is necessary to remove duplicates in order to get unbiased results for the analysis of the data. It is not adding any value to the system.

In [None]:
display=pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score !=3 AND UserId='AR5J8UI46CURR'
ORDER BY ProductID
""", con)
print(display)
print(display.shape)

       Id   ProductId         UserId      ProfileName  HelpfulnessNumerator  \
0   78445  B000HDL1RQ  AR5J8UI46CURR  Geetha Krishnan                     2   
1  138317  B000HDOPYC  AR5J8UI46CURR  Geetha Krishnan                     2   
2  138277  B000HDOPYM  AR5J8UI46CURR  Geetha Krishnan                     2   
3   73791  B000HDOPZG  AR5J8UI46CURR  Geetha Krishnan                     2   
4  155049  B000PAQ75C  AR5J8UI46CURR  Geetha Krishnan                     2   

   HelpfulnessDenominator  Score        Time  \
0                       2      5  1199577600   
1                       2      5  1199577600   
2                       2      5  1199577600   
3                       2      5  1199577600   
4                       2      5  1199577600   

                             Summary  \
0  LOACKER QUADRATINI VANILLA WAFERS   
1  LOACKER QUADRATINI VANILLA WAFERS   
2  LOACKER QUADRATINI VANILLA WAFERS   
3  LOACKER QUADRATINI VANILLA WAFERS   
4  LOACKER QUADRATINI VANILLA WAFERS

<h3> Sorting the data according to the product Id in ascending order</h3>

In [31]:
sorted_data=filtered_data.sort_values("ProductId", axis=0, ascending=True)
sorted_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...
138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,positive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc..."
138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,positive,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...
138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,positive,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...
138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,positive,1018396800,A great way to learn the months,This is a book of poetry about the months of t...


<h3>Deduplication of entries</h3>

In [32]:
final=sorted_data.drop_duplicates(subset={'UserId', 'ProfileName', 'Time', 'Text'}, keep='first', inplace=False)
final.shape


(364173, 10)

In [None]:
# Checking how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

69.25890143662969

69% of data is remaining after removing the duplicates

In [None]:
display=pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score !=3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", con)
print(display)
print(display.shape)

      Id   ProductId          UserId              ProfileName  \
0  64422  B000MIDROQ  A161DK06JJMCYF  J. E. Stephens "Jeanne"   
1  44737  B001EQ55RW  A2V0I904FH7ABY                      Ram   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     3                       1      5  1224892800   
1                     3                       2      4  1212883200   

                                        Summary  \
0             Bought This for My Son at College   
1  Pure cocoa taste with crunchy almonds inside   

                                                Text  
0  My son loves spaghetti so I didn't hesitate or...  
1  It was almost a 'love at first bite' - the per...  
(2, 10)


<h4>Observation 2</h4> HelpfulnessNumerator should always be lesser than HelpfulnessDenominator.

In [42]:
# Keeping the data wherein HelpfulnessNumerators<=HelpfulnessDenominator
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
print(final.shape)

(364171, 10)


In [None]:
final.Score.value_counts()

positive    307061
negative     57110
Name: Score, dtype: int64

<h4>Next Phase of Preprocessing</h4>

Given the 8 features as input we have to predict the sentiment polarity (+ve/-ve).
We are determining the polarity by the Score.

The most useful features are Summary and Text.

Given any problem if we can convert into the problem of vectors, we can leverage the power of Linear Algebra.

How do you convert text into numerical vectors?

Convert Review Text into d-D vector in d-D space.

Suppose we have many vectors. Each point represents a d-D representation of a review in d-D space.

Draw a hyperplane 'pi' separating all positive reviews and all negative reviews.

1. Converting Review-text into a d-D vector
2. Finding a plane to separate the reviews.

<h4>Rules/Properties of this conversion </h4>

Suppose we have 3 reviews - $r_{1}$, $r_{2}$ and $r_{3}$. d-D representation of vectors for $r_{1}$ -> $v_{1}$, $r_{2}$ -> $v_{2}$, $r_{3}$ -> $v_{3}$.

If $r_{1}$ and $r_{2}$ are more similar semantically that $r_{1}$ and $r_{3}$, i.e. Eng. sim($r_{1}$, $r_{2}$) > Eng. sim($r_{1}$, $r_{3}$) then the dist($v_{1}$, $v_{2}$) < dist($v_{1}$, $v_{3}$). 

If $r_{1}$ & $r_{2}$ are more similar, $v_{1}$ and $v_{2}$ must be close i.e. <b>length($v_{1}$  - $v_{2}$) < length($v_{1}$ - $v_{3}$)</b>

<b>find {text -> d-D vector} such that similiar text must be closer geometrically.</b>

## Bag of words (BoW)

Simplest Technique to convert text to a numerical vector is <b>Bag Of Words(BoW)</b>

$r_{1}$: This pasta is very tasty and affordable.

$r_{2}$: This pasta is not tasty and is affordable.

$r_{3}$: This pasta is delicious and cheap.

$r_{4}$: Pasta is tasty and pasta tastes good.

In NLP, a review is known as a document. Set of documents is called<b> corpus</b>.

1. Constructing a dictionary - set of all unique words in the reviews. 
{This, pasta, ...}
2. Construct Vector $v_{i}$ of size 'd'. Each word is a different dimension and each cell corresponds to # of times the word occurs in the review/document $r_{i}$.

$v_{i}$ is a sparse vector - most of the elements are zero.

<b>Objective of BoW:</b> Similar text must result as closer vectors.

BoW is thought of counting the common words when all the values exist only once. How many common words exist? 

BoW does not work very well when there are small changes in the terminology we are using.

<b>Binary BoW</b> or <b>Boolean BoW</b> is a variation of BoW. Instead of putting count, we put 1 if the word occurs atleast once and 0 if the word doesn't exist. 

||$v_{1}$ - $v_{2}$|| = $\sqrt number of different words$ between documents/reviews $r_{1}$ and $r_{2}$.

All the words like {This, is, and} do not matter much. What matters the most is the non-trivial words.

Removing the trivial words <b>Stop-words</b>.

If I remove the Stop-words, BoW vector will be smaller and more meaningful. You throw these Stop-words while constructing the vector.

In English 'not' is also considered as a Stop-word.

<b>Note: So, removing the stop-words is not always the best choice.</b>

<h4>Text Pre-processing steps</h4>
1. Removing <b>Stop-words</b>.
2. Convert all your words <b>lowercase</b>.
3. <b>Stemming</b>: words coming from the same base word in English. Eg. tastes, tasful, tasty -> tast. Convert all these words into their common form i.e. taste and replace them with the common form. Related words are considered as single root word.
Stemming algorithms - PorterStemmer, SnowballStemmer 
4. <b>Lemmatization</b>: breaking up a sentence into words. A space is used to break the sentence into words. 
Eg. This pasta is very tasty. This is the best in New York.
But there can be complex words like New York. It is a location. 
Often times we break the sentence but there are lemmatizers available which will group New York into 1 word. 
5. Tasty and delicious are synonyms - very similar in meaning. But in BoW, we are considering them as 2 different words which are nowhere related because they are 2 different dimensions. In BoW we are not taking semantic meaning of words into consideration. A technique called <b>Word2Vec</b> where we try to get semantic meaning of these words into consideration when we build vectors of text.  

<b>BoW + Text Preprocessing </b>
Converting text to a d-D vector which doesn't guarantee semantic meaning of words will be at the same place. 
$r_{1}$ and $r_{3}$ are semantically same because our algorithm still think they are different. 

<b>The <u>drawback</u> of BoW is <i>it doesn't take semantic meaning into consideration</i>.</b>

In [44]:
count_vect=CountVectorizer() # scikit learn
final_counts=count_vect.fit_transform(final.Text.values)

<h3> Example to understand countvectorizer() </h3>

In [56]:
corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
X=count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
print(X.toarray())

count_vect2=CountVectorizer(analyzer='word', ngram_range=(2,2)) # bigrams
Y=count_vect2.fit_transform(corpus)
print(count_vect2.get_feature_names())
print(Y.toarray())


['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the']
[[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]


In [67]:
type(final_counts)

scipy.sparse.csr.csr_matrix

Its a sparse matrix (dictionary). The space from (m x n) has been reduced to space (k)

In [68]:
final_counts.shape
# 364171 rows and 115281 column containing bag of words

(364171, 115281)

## Text Preprocessing: Stemming, stop word removal and Lemmatization
- Removal of html tags
- Remove any punctuations or limited set of special characters like, or. or # etc
- Check if the word is made up of english letters and isnot alpha numeric
- Check to see if the length of word is greater than 2 
- Convert the words to lowercase
- Remove stopwords
- Finally snowball stemming the word

In [69]:
# To find sentences containing HTML tags
import re
i=0
for sent in final['Text'].values:
    if(len(re.findall('<.*?>', sent))):
        print(i)
        print(sent)
        break
    i=i+1


6
I set aside at least an hour each day to read to my son (3 y/o). At this point, I consider myself a connoisseur of children's books and this is one of the best. Santa Clause put this under the tree. Since then, we've read it perpetually and he loves it.<br /><br />First, this book taught him the months of the year.<br /><br />Second, it's a pleasure to read. Well suited to 1.5 y/o old to 4+.<br /><br />Very few children's books are worth owning. Most should be borrowed from the library. This book, however, deserves a permanent spot on your shelf. Sendak's best.


In [70]:
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

# Set of stop words
stop=set(stopwords.words('english')) 

# Initializing the snowball stemming
sno=nltk.stem.SnowballStemmer('english')

def cleanhtml(sentence):
    cleanr=re.compile('<.*?')
    cleantext=re.sub(cleanr, ' ', sentence)
    return cleantext

def cleanpunc(sentence):
    cleaned=re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned=re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return cleaned
print(stop)
print('************************************')
print(sno.stem('tasty'))

{'we', 'who', 'why', 'between', 'o', 'you', 'own', "shouldn't", 'no', 'up', 'm', 'hadn', 'been', 'a', 'your', 'have', 'the', 'off', 'myself', 'there', 'for', 'wouldn', 'by', 'until', 'ain', 'having', 'am', 'this', 'nor', 'as', 'that', 're', 'how', 'mustn', 'they', 'were', 'about', 'yourselves', 'too', 'wasn', 'because', 'or', 'but', 'of', 'its', 'my', 't', 'him', 'below', 'don', 'more', 'few', "wasn't", 'i', 'some', "needn't", 'be', 'further', 'an', 'itself', 'very', 'do', 's', 'can', 'being', "it's", 'down', "should've", 'isn', 'whom', "that'll", 'against', 'hasn', 'when', 'has', "wouldn't", 'while', 'before', 'did', 'couldn', 'd', 'yourself', 'y', 'our', 'it', 'such', "haven't", 'will', 'here', "mustn't", "she's", 'through', 'theirs', 'are', 'in', 'any', 'shouldn', 'herself', 'she', "couldn't", 'what', 'shan', 'themselves', 'each', "doesn't", 'to', 'so', 'doesn', 'ma', 'now', "mightn't", "hadn't", 'doing', 'aren', 'yours', 'is', 'other', "isn't", 'does', 'needn', 'them', 'was', 'out'

In [71]:
# Code for implementing step-by-step preprocessing

i = 0
str1 = ' '
final_string = []
all_positive_words = []
all_negative_words = []
s = ''
for sentence in final['Text'].values:
    filtered_sentence = []
    #print(sentence)
    sentence = cleanhtml(sentence)
    for w in sentence.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words) > 2)):
                if(cleaned_words.lower() not in stop):
                    s = (sno.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if(final['Score'].values)[i] == 'positive':
                        all_positive_words.append(s)
                    if(final['Score'].values)[i] == 'negative':
                        all_negative_words.append(s)
                else:
                    continue
            else:
                continue
    str1 = b" ".join(filtered_sentence)
    
    final_string.append(str1)
    i += 1   


In [72]:
# CleanedText column addition
final['CleanedText']=final_string

In [None]:
# Store final table into an SQlite table for future
conn=sqlite3.connect('final.sqlite')
c=conn.cursor()
conn.text_factory=str
final.to_sql('Reviews', conn, schema=None, if_exists='replace')

## Bi Grams and n-Grams

$r_{1}$: This pasta is very tasty and affordable.

$r_{2}$: This pasta is not tasty and is affordable.

After removing stop-words $v_{1}$ and $v_{2}$ are exactly the same => $r_{1}$ and $r_{2}$ are very similar which is not TRUE. 

$r_{1}$ and $r_{2}$ are completely opposite. 

<b>Uni-gram</b>: Each word is considered as a dimension.
<b>Bi-gram</b>: Pairs of consecutive words is considered as a dimension.
<b>Tri-gram</b>: 3 consecutive words is considered as a dimension.
<b>n-gram</b>: n consecutive words is considered as a dimension.

<b>Why n-gram?</b> Uni-gram based BoW discards the sequence information. But using bi-gram, tri-gram or n-gram we are trying to retain some of the partial sequence information. 

Bi-gram, Tri-gram or n-gram can be easily encorporated into BoW.  

<b># of bi-grams >= # of uni-grams</b> because the number of pairs of consecutive words is greater than or equal to uni-grams. 

<b># of n-grams >= ... >= # of tri-grams >= # of bi-grams >= # of uni-grams</b>

For n-grams, where n > 1, dimensionality 'd' increases drastically.

In [73]:
freq_dist_positive=nltk.FreqDist(all_positive_words)
freq_dist_negative=nltk.FreqDist(all_negative_words)
print("Most common positive words: ", freq_dist_positive.most_common(20))
print("Most common negative words: ", freq_dist_negative.most_common(20))

Most common positive words:  [(b'like', 139150), (b'tast', 128631), (b'good', 112216), (b'flavor', 109473), (b'love', 107034), (b'use', 103627), (b'great', 102818), (b'product', 99504), (b'one', 95360), (b'tri', 86237), (b'tea', 83824), (b'coffe', 78610), (b'make', 74835), (b'get', 71962), (b'food', 64752), (b'amazon', 57832), (b'would', 55297), (b'time', 55225), (b'buy', 53903), (b'realli', 52569)]
Most common negative words:  [(b'tast', 34489), (b'like', 32284), (b'product', 29504), (b'one', 20420), (b'flavor', 19561), (b'would', 17901), (b'tri', 17676), (b'use', 15275), (b'good', 14977), (b'coffe', 14677), (b'get', 13758), (b'buy', 13690), (b'order', 12846), (b'food', 12742), (b'dont', 11683), (b'tea', 11657), (b'amazon', 11258), (b'even', 10983), (b'box', 10841), (b'make', 9816)]


From the above results positive and negative words overlap so it is better to consider pair of words i.e. bigrams, trigrams and n grams of words must be created.

In [74]:
# Bi-gram, tri-gram and n-gram

# Removing stop words like "not" should be avoided before building n-grams
count_vect=CountVectorizer(ngram_range=(1,2)) # in Scikit learn
final_bigram_counts=count_vect.fit_transform(final['Text'].values)


In [75]:
final_bigram_counts.get_shape()

(364171, 2910192)

## TF-IDF

Variation of BoW

Let us assume we have 'N' documents / reviews. Each review is a combination of words.

Let us assume $r_{1}$ has some words. Similarly, other documents too.

$r_{1}$: $W_{1}$, $W_{2}$, $W_{3}$, $W_{2}$, $W_{5}$             --> 5 words

$r_{2}$: $W_{1}$, $W_{3}$, $W_{4}$, $W_{5}$, $W_{6}$, $W_{2}$    --> 6 words

$r_{3}$: 

.
.
.

$r_{N}$:

TF($W_{i}$, $r_{j}$) = # of times $W_{i}$ occurs in $r_{j}$ / total number of words in $r{j}$
TF($W_{2}$, $r_{1}$) = 2 / 5

<b>0 <= TF($W_{i}$, $r_{j}$) <= 1 </b> Can be interpreted as Probability.

<u>BoW</u> and <u>TF-IDF</u> are techniques done on the text for <i><b>Information Retrieval</b></i> (sub-area of NLP).

TF can be thought of as how often does $W_{i}$ occur in $r_{j}$. If it has all the same words then it has a TF of 1 else if the word occurs a very few times, the TF has a very small value. <b> More often the word occurs, the higher the frequency. </b>

<b>Term Frequency can be thought of as the probability of finding a word $W_{i}$ in a document $r_{j}$. </b>

<b><i>IDF- Inverse Document Frequency</i></b> is for a word $W_{i}$ in a corpus.

Suppose Dataset/Corpus ($D_{c}$) has the following documents:
 
$r_{1}$: $W_{1}$, $W_{2}$, $W_{3}$, $W_{2}$, $W_{5}$             --> 5 words

$r_{2}$: $W_{1}$, $W_{3}$, $W_{4}$, $W_{5}$, $W_{6}$, $W_{2}$    --> 6 words

$r_{3}$: 

.
.
.

$r_{N}$:

<b>IDF($W_{i}$, $D_{c}$) = log(N/$n_{i}$)</b>, where N is the number of documents and $n_{i}$ is the number of documents which contain the word $W_{i}$

<b>Since $n_{i}$ <= N, N/$n_{i}$ >= 1. So, log(N/$n_{i}$) >= 0</b>

1. IDF >= 0
2. If $n_{i}$ increases, then N/$n_{i}$ decreases. Here monotonic function log(N/$n_{i}$) decreases. 
<b>If $W_{i}$ is more frequent in $D_{c}$, the IDF is lower.</b> Hence, if IDF increases, $n_{i}$ decreases and vice-versa.

<b><i>If $W_{i}$ is more frequent, IDF will be low and if $W_{i}$ is very rare, IDF will be high.</i></b>

Given documents {$r_{1}$, $r_{2}$, $r_{3}$,..., $r_{j}$} in $D_{c}$, <b>TF-IDF: TF($W_{i}$, $r_{j}$) * IDF($W_{i}$, $D_{c}$)</b>, TF($W_{i}$, $r_{j}$) is higher if $W_{i}$ is frequent in $r_{j}$ and IDF($W_{i}$, $D_{c}$) is higher when $W_{i}$ is rare in $D_{c}$.

<b>TF-IDF gives
- gives more importance to rarer words in $D_{c}$.
- gives more importance if a word is more frequent in a document/review.</b>

But TF-IDF has a <u>drawback</u> that it <i><b>does not</b> take semantic meaning of words</i>.

In [76]:
tf_idf_vect=TfidfVectorizer(ngram_range=(1,2))
final_tf_idf=tf_idf_vect.fit_transform(final['Text'].values)

In [77]:
final_tf_idf.get_shape()

(364171, 2910192)

In [78]:
features=tf_idf_vect.get_feature_names()
len(features)

2910192

In [79]:
features[100000:100010]

['ales until',
 'ales ve',
 'ales would',
 'ales you',
 'alessandra',
 'alessandra ambrosia',
 'alessi',
 'alessi added',
 'alessi also',
 'alessi and']

In [119]:
# Convert a row in sparse matrix to numpy array
print(final_tf_idf)
print(final_tf_idf[3,:].toarray()[0])

  (0, 1268863)	0.08275523980718573
  (0, 1322643)	0.05736320423975409
  (0, 1181493)	0.05904930622559946
  (0, 2815806)	0.06353114992572337
  (0, 1562605)	0.12941275579745923
  (0, 1032033)	0.11573055882069516
  (0, 2075535)	0.12941275579745923
  (0, 2616462)	0.12541096971801555
  (0, 49126)	0.04475667962178877
  (0, 283308)	0.054444217381069276
  (0, 2381603)	0.0769491583240197
  (0, 2837880)	0.07781013221409352
  (0, 2324657)	0.0937743722525583
  (0, 324043)	0.11573055882069516
  (0, 2608442)	0.10301992784348538
  (0, 2838349)	0.06715422842681543
  (0, 126268)	0.0893377308667213
  (0, 361978)	0.11856987122963351
  (0, 552188)	0.12941275579745923
  (0, 1319489)	0.0927612148772383
  (0, 2578785)	0.026787698091940104
  (0, 105321)	0.06840813514421003
  (0, 1333169)	0.053575946913953454
  (0, 1739789)	0.046544793491389236
  (0, 2255427)	0.12941275579745923
  :	:
  (364170, 1254867)	0.03283894855017248
  (364170, 2614666)	0.05504239102854844
  (364170, 2897041)	0.040848095932348116
  (364

In [82]:
def top_tfidf_features(row, features, top_n = 25):
    '''Get top n tfidf values in row and return them with their corresponding values'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_features = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_features)
    df.columns = ['feature', 'tfidf']
    return df

top_tfidf = top_tfidf_features(final_tf_idf[1,:].toarray()[0], features, 25)


In [None]:
top_tfidf

Unnamed: 0,feature,tfidf
0,sendak books,0.173437
1,rosie movie,0.173437
2,paperbacks seem,0.173437
3,cover version,0.173437
4,these sendak,0.173437
5,the paperbacks,0.173437
6,pages open,0.173437
7,really rosie,0.168074
8,incorporates them,0.168074
9,paperbacks,0.168074


## Word2Vec

<b>Word2Vec</b> takes semantic meaning of words into consideration.

This algorithm takes a word and converts it into a d-D vector where d is typically, 50, 100, 200 or 300. But this is not a sparse vector. But BoW / TF-IDF represented sentences into sparse vectors. 

Consider a 300-D vector. The higher the dimensions, more powerful is the representation.

1. If $W_{1}$ and $W_{2}$ are semantically similar, then $v_{1}$ and $v_{2}$ are closer.
2. In Word2Vec, it satisfies the relationships. 

<b>($V_{man}$ - $V_{woman}$) || ($V_{king}$ - $V_{queen}$) </b>

Word2Vec learns relationships automatically from raw-text.

Word2Vec takes a very large text Corpus as input and for every word it builds a vector. 

Larger dimensions --> more information rich the vector is. If we have a higher dimensional vector it can learn far more complex relationships. 

If $D_{c}$ is large, the higher is the dimensionality. 

Word2Vec looks at sequence information of words. Intuitively, for any word Word2Vec looks at neighborhood of that word. 

N($W_{i}$) is very similar to N($W_{j}$), then $v_{i}$ is very similar to $v_{j}$.

In [122]:
import gensim
i = 0
list_of_sentences = []
for sentence in final['Text'].values:
    filtered_sentence = []
    sentence = cleanhtml(sentence)
    for w in sentence.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue
    list_of_sentences.append(filtered_sentence)

In [123]:
print(final['Text'].values[0])
print('----------------------------')
print(list_of_sentences[0])

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
----------------------------
['this', 'witty', 'little', 'book', 'makes', 'my', 'son', 'laugh', 'at', 'loud', 'i', 'recite', 'it', 'in', 'the', 'car', 'as', 'were', 'driving', 'along', 'and', 'he', 'always', 'can', 'sing', 'the', 'refrain', 'hes', 'learned', 'about', 'whales', 'india', 'drooping', 'i', 'love', 'all', 'the', 'new', 'words', 'this', 'book', 'introduces', 'and', 'the', 'silliness', 'of', 'it', 'all', 'this', 'is', 'a', 'classic', 'book', 'i', 'am', 'willing', 'to', 'bet', 'my', 'son', 'will', 'still', 'be', 'able', 'to', 'recite', 'from', 'memory', 'when', 'he', 'is', 'in', 'college']


In [124]:
print(final['Text'].values[1])
print('----------------------------')
print(list_of_sentences[1])

I grew up reading these Sendak books, and watching the Really Rosie movie that incorporates them, and love them. My son loves them too. I do however, miss the hard cover version. The paperbacks seem kind of flimsy and it takes two hands to keep the pages open.
----------------------------
['i', 'grew', 'up', 'reading', 'these', 'sendak', 'books', 'and', 'watching', 'the', 'really', 'rosie', 'movie', 'that', 'incorporates', 'them', 'and', 'love', 'them', 'my', 'son', 'loves', 'them', 'too', 'i', 'do', 'however', 'miss', 'the', 'hard', 'cover', 'version', 'the', 'paperbacks', 'seem', 'kind', 'of', 'flimsy', 'and', 'it', 'takes', 'two', 'hands', 'to', 'keep', 'the', 'pages', 'open']


In [125]:
w2v_model = gensim.models.Word2Vec(list_of_sentences, min_count = 5, vector_size = 50, workers = 4)

In [126]:
words = list(w2v_model.wv.index_to_key)
print(len(words))

33521


In [None]:
w2v_model.wv.most_similar('tasty')

[('tastey', 0.9309494495391846),
 ('yummy', 0.8578251004219055),
 ('satisfying', 0.8419342637062073),
 ('filling', 0.8285698890686035),
 ('delicious', 0.8193663954734802),
 ('flavorful', 0.7889677286148071),
 ('nutritious', 0.7660471200942993),
 ('delish', 0.7587419748306274),
 ('addicting', 0.756447970867157),
 ('versatile', 0.7515951991081238)]

In [None]:
w2v_model.wv.most_similar('like')

[('resemble', 0.6977817416191101),
 ('mean', 0.6769933700561523),
 ('dislike', 0.6450912952423096),
 ('prefer', 0.6284522414207458),
 ('overpower', 0.6175543069839478),
 ('think', 0.6041595339775085),
 ('enjoy', 0.5950704216957092),
 ('expect', 0.5933476090431213),
 ('miss', 0.576910674571991),
 ('fake', 0.5741640329360962)]

In [None]:
count_vect_feature = count_vect.get_feature_names()
count_vect_feature.index('like')
print(count_vect_feature[64055])

activity great


# Avg-Word2Vec, tf-idf weighted Word2Vec

<h4> Avg-Word2Vec </h4>

Word2Vec takes a word and converts it into a d-D vector. 

But $r_{i}$ is a sequence of words/sentences.

How do I convert my sentences to a vector using Word2Vec?

Suppose we have a review $r_{1}$ containing words
$r_{1}$: $W_{1}$, $W_{2}$, $W_{1}$, $W_{3}$, $W_{4}$, $W_{5}$

Suppose I want to convert $r_{1}$ to $v_{1}$, take the Avg Word2Vec representation. 
Take the first word $W_{1}$, convert it into a vector as $v_{1}$

For each word in $r_{1}$, I am getting a vector representation.

W2V($W_{1}$) + W2V($W_{2}$) + W2V($W_{1}$) + W2V($W_{3}$) + W2V($W_{4}$) + W2V($W_{5}$)

Each of these vectors will be d-D. Add all these vectors and then divide the sum by the number of words. 

Suppose in $r_{1}$, there are $n_{1}$ words then $v_{1}$ becomes <b>1/$n_{1}$[W2V($W_{1}$) + W2V($W_{2}$) + W2V($W_{1}$) + W2V($W_{3}$) + W2V($W_{4}$) + W2V($W_{5}$)]</b>

$v_{1}$ is the vector representation of review $r_{1}$. This is known as Avg-Word2Vec. It is not perfect but it works well. This is the simplest way to leverage Word2Vec to build sentence vectors. 

<h4> tf-idf weighted Word2Vec </h4>

Suppose we have a review $r_{1}$ containing words
$r_{1}$: $W_{1}$, $W_{2}$, $W_{1}$, $W_{3}$, $W_{4}$, $W_{5}$

We compute tf-idf for $r_{1}$ as $t_{1}$, $t_{2}$, $t_{3}$, $t_{4}$, $t_{5}$.

When we compute tf-idf-weighted Word2Vec of $r_{1}$

<b>tfidf-W2V($r_{1}$) = [$t_{1}$ * W2V($W_{1}$) + $t_{2}$ * W2V($W_{2}$) + $t_{3}$ * W2V($W_{3}$) + $t_{4}$ * W2V($W_{4}$) + $t_{5}$ * W2V($W_{5}$)] / ($t_{1}$ + $t_{2}$ + $t_{3}$ + $t_{4}$ + $t_{5}$)</b> where $t_{i}$ is tf-idf of the word $w_{i}$ in review $r_{1}$ or <b> $t_{i}$ = tf-idf($w_{i}$, $r_{1}$)
    
    
<b>If all $t_{i}$'s are 1, then tfidf-W2V is same as Avg-Word2Vec.</b>

<h3>Avg-Word2Vec and tf-idf weighted Word2Vec are simple weighting strategies to convert sentences/paragraphs to vectors.</h3>

In [None]:
# Average word2vec
# Compute average word2vec for each review
sent_vectors=[]
for sent in list_of_sentences:
    sent_vec=np.zeros(50)
    cnt_words=0
    for word in sent:
        try:
            vec=w2v_model.wv[word]
            sent_vec+=vec
            cnt_words+=1
        except:
            pass
    sent_vec/=cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

    

  sent_vec/=cnt_words


364171
50


In [None]:
# TF-IDF weighted Word2Vec
tfidf_feat=tf_idf_vect.get_feature_names()

tfidf_sent_vectors=[]
row=0
for sent in list_of_sentences:
    sent_vec=np.zeros(50)
    weight_sum=0
    for word in sent:
        try:
            vec=w2v_model.wv[word]

            tfidf =final_tf_idf[row, tfidf_feat.index(word)]
            sent_vec+=(vec*tfidf)
            weight_sum+=tfidf
        except:
            pass
    sent_vec/=weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row+=1




## Applying t-SNE

In [127]:
final_string

[b'witti littl book make son laugh loud recit car drive along alway sing refrain hes learn whale india droop love new word book introduc silli classic book will bet son still abl recit memori colleg',
 b'grew read sendak book watch realli rosi movi incorpor love son love howev miss hard cover version paperback seem kind flimsi take two hand keep page open',
 b'fun way children learn month year learn poem throughout school year like handmot invent poem',
 b'great littl book read nice rhythm well good repetit littl one like line chicken soup rice child get month year wonder place like bombay nile eat well know get eat kid mauric sendak version ice skate treat rose head long time wont even know came surpris came littl witti book',
 b'book poetri month year goe month cute littl poem along love book realli fun way learn month poem creativ author purpos write book give children fun way learn month children also learn thing poetri rhythm read book',
 b'charm rhyme book describ circumst eat do

In [128]:
final

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
138706,150524,0006641040,ACITT7DI6IDDL,shari zychinski,0,0,positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,b'witti littl book make son laugh loud recit c...
138688,150506,0006641040,A2IW4PEEKO2R0U,Tracy,1,1,positive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",b'grew read sendak book watch realli rosi movi...
138689,150507,0006641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,positive,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...,b'fun way children learn month year learn poem...
138690,150508,0006641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,positive,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...,b'great littl book read nice rhythm well good ...
138691,150509,0006641040,A3CMRKGE0P909G,Teresa,3,4,positive,1018396800,A great way to learn the months,This is a book of poetry about the months of t...,b'book poetri month year goe month cute littl ...
...,...,...,...,...,...,...,...,...,...,...,...
178145,193174,B009RSR8HO,A4P6AN2L435PV,romarc,0,0,positive,1350432000,LOVE!! LOVE!!,"LOVE, LOVE this sweetener!! I use it in all m...",b'love love sweeten use bake unsweeten flavor ...
173675,188389,B009SF0TN6,A1L0GWGRK4BYPT,Bety Robinson,0,0,positive,1350518400,Amazing!! Great sauce for everything!,You have to try this sauce to believe it! It s...,b'tri sauc believ start littl sweet honey tast...
204727,221795,B009SR4OQ2,A32A6X5KCP7ARG,sicamar,1,1,positive,1350604800,Awesome Taste,I bought this Hazelnut Paste (Nocciola Spread)...,b'bought hazelnut past nocciola spread local s...
5259,5703,B009WSNWC4,AMP7K1O84DH1T,ESTY,0,0,positive,1351209600,DELICIOUS,Purchased this product at a local store in NY ...,b'purchas product local store kid love quick e...


## Applying TNSE on Text BOW vectors

In [1]:
from sklearn.preprocessing import StandardScaler
print(final_bigram_counts.shape)
std_data = StandardScaler(with_mean = False).fit_transform(final_bigram_counts)
print(std_data.shape)
type(std_data)
std_data=std_data.todense()
print(type(std_data))

NameError: name 'final_bigram_counts' is not defined