# Amazon Fine Food Reviews Analysis

This dataset consists of reviews of fine foods from amazon. 

Number of reviews : 568,454
    
No. of users : 256,059
    
No. of Products: 74,258 
    
Reviews from Oct 1999 - Oct 2012
    
Number of Attributes/Columns in data: 10
    

Attributes info :

1. Id
2. ProductId : Unique identifier for the product
3. UserId : Unqiue identifier for the user
4. ProfileName: Profile name of the user
5. HelpfulnessNumerator: Number of users who found the review helpful
6. HelpfulnessDenominator : Number of users who indicated whether they found the review helpful or not
7. Score : Rating between 1 and 5
8. Time : Timestamp for the review
9. Summary : Brief summary of the review
10. Text : Text of the review

Objective of Task:
1. Given a review , determine whether a review is positive(rating between 4and 5)  or negative (rating between 1 and 2) 



Q. How to determine if a review is positive or negative? 

Ans. W could use the Score/Rating attribute . 
    A Score of 4 or 5 could be considered as postive review.
    A Score 1 or 3 could be considered as neagtive review.
    A review of 3 is neutral and can be ignored.
    This is an apporximate and proxy way of determining  the polarity (postive/negative) of a review.

# Loading the Data
The dataset if available in two forms 
1. csv file
2. SQLITE Database
In order to load the data ,We have used the SQLITE dataset as it easier to query the data and visualize the data efficienctly.

Here as we only want to get the global sentiment of the recomendation (positive or negative),we will purposefully ignore all scores equal to 3 ,If the Score_id above 3,then the recommendation will be set to "positve" .Otheriwse ,it will be set to "negative"

In [1]:
# importing imp libraries that we will need
import warnings
warnings.filterwarnings('ignore')
import sqlite3
import pandas as pd
import numpy as np 
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve ,auc

import nltk 
from nltk.stem.porter import PorterStemmer
import re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
from tqdm import tqdm

# 1. Reading Data

In [2]:
# using the SQLITE table to read data.
con =sqlite3.connect('database.sqlite')
#filtering only positive and negative reviews and not taking into consideration those reviews with Score=

filtered_data=pd.read_sql_query(""" SELECT  * FROM Reviews WHERE Score !=3 LIMIT 568454""",con)
#print(filtered_data.shape)
def partition(x):
    if x<3:
        return 0
    return 1
#changing reviews with score less than 3 to be positve and vice versa
actualScore=filtered_data['Score']  # it will give only score attributes
Positive_negative = actualScore.map(partition)  
# it will use the partion func we have created and chang the score in 1 or 0and 
#save in variable "Positive_negative"
#print(filtered_data)
filtered_data['Score']=Positive_negative  # it will replace  the score column with 0 and 1 (from 1,2&4,5 to 0 &1)

print('No. of data points in our dataset',filtered_data.shape)
filtered_data.head(3)

No. of data points in our dataset (525814, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [3]:
# counting the no. of time the same user has given the review
display = pd.read_sql_query("""
SELECT UserId,ProductId ,ProfileName,Time,Score,Text,COUNT(*)
FROM Reviews
GROUP BY UserId 
HAVING COUNT(*)>1
""",con)

In [4]:
print(display.shape)
display.head()

(80668, 7)


Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
0,#oc-R115TNMSPFT9I7,B005ZBZLT4,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ESG,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B005ZBZLT4,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ESG,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBEV0,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [5]:
display[display['UserId']=='A2S8CZX8Y19QQX']

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)


In [6]:
display['COUNT(*)'].sum() #total no. of distinct user give the reviews

393063

# Exploratory Data Analysis 
### 2. Data Cleaning Deduplication:
    It is observed (as shown in the data below)that reviews data had many duplicates entries . Hence it was necessary to remove dupicates in order to get unbaised results for the analysis of the data .

In [7]:
display=pd.read_sql_query("""
SELECT * 
FROM Reviews
WHERE Score !=3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductId
""",con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As can be seen above the same user has multiple reviews of the with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that

ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies,8.82-Ounce Packages (Pack of 8)

ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence inorder to reduce reduntancy it was decided to eleminate the rows having same parameters.

The method used for the same was that we first sort the data according to ProductId and then just keep the first similiar product reviews and delete the others .This method ensures that there is only one representatives for each product and deduplication without sorting would lead to possibility of different reprsentatives still existing for the same product.

In [8]:
# sorting data according to Productid in ascending order 
sorted_data= filtered_data.sort_values('ProductId',axis=0 ,ascending=True,inplace=False,kind='quicksort',na_position='last')
sorted_data.shape

(525814, 10)

In [9]:
final_data=sorted_data.drop_duplicates(subset={"UserId","ProfileName" , "Time","Text"},keep="first",inplace=False)
final_data.shape

(364173, 10)

In [10]:
# Checking to see how much % of data still remains 
(final_data['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

69.25890143662969

<b>Observation:-</b> It was also seen that in few rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is practically impossible hence these rows are too removed from calculations.

In [11]:
display=pd.read_sql_query("""
SELECT *
FROM Reviews 
WHERE SCORE !=3  AND Id=44737 OR Id=64422
ORDER BY ProductId 
""",con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [12]:
final_data=final_data[final_data['HelpfulnessNumerator']<=final_data['HelpfulnessDenominator']]

In [13]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final_data.shape)

(364171, 10)


In [14]:
#How many positive and negative reviews are present in our datatset?
final_data['Score'].value_counts()

1    307061
0     57110
Name: Score, dtype: int64

### 3. Text Preprocessing.
Till we have finished data deduplication ,now we will to preprocessing part before we  go for further analysis and making the prediction models .

Hence in Preprocessing part we will do the the following things:
1. Removing of HTML tags.
2. Removing any punctuation or spme special character like , or . or # etc.
3. Check the word is made of made up of english letter and is not alpha-numeric 
4. Check to see if the length of the word is greater than 2(as it was research that there is no adjective in 2 letters).
5. Convert the words to lowercase .
6. Remove the stopwords 
7. Using Snowball Stemming the word (now using Porter Stemming as Snowball Stemming is better than Porter Stemming).

In [15]:
# finding html tag in sentences 
i=0
for review in final_data['Text'].values:
    if (len(re.findall('<.*?>',review))):
        print(i)
        print(review)
        break
    i=i+1

6
I set aside at least an hour each day to read to my son (3 y/o). At this point, I consider myself a connoisseur of children's books and this is one of the best. Santa Clause put this under the tree. Since then, we've read it perpetually and he loves it.<br /><br />First, this book taught him the months of the year.<br /><br />Second, it's a pleasure to read. Well suited to 1.5 y/o old to 4+.<br /><br />Very few children's books are worth owning. Most should be borrowed from the library. This book, however, deserves a permanent spot on your shelf. Sendak's best.


In [16]:
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop = set(stopwords.words('english')) # creating set of stopwords
print(stop) # it will to show all the stopwords in NLTK
excluding_stop = ['against','not','don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't",
             'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 
             'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",'shouldn', "shouldn't", 'wasn',
             "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
print('****'*15)

stop = [word for word in stop if word not in excluding_stop]
print(' ')
print(stop)
snowstem=nltk.stem.SnowballStemmer('english') # intialising the snowball stemmer
print(' ')
print('****'*15)
print("base_word of tasty:", snowstem.stem('tasty')) # it will tell us the base word or do stemming

# creating a function to clean the word of any html-tags. The function will remove the html tag and evrything beteween them with "1" space 
def cleanhtml(sentence):
    cleanr = re.compile('<.8?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
# creating a function to clean the punctuation or special characters .The funvtion will create the punctuation with empty string
def cleanpunc(sentence):
    cleaned= re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned= re.sub(r'[.|,|)|(|\|/]',r'',cleaned)
    return cleaned
 

{'shouldn', 'she', 'under', 'i', "shouldn't", 'few', 'both', 'they', 'myself', 'needn', 'them', 'our', 'other', "won't", 'that', 'this', 'did', 'further', 'each', 'between', "hasn't", 'as', 'of', "didn't", 'own', 'won', "weren't", 'y', 'themselves', 'with', 'mightn', 'how', 'too', 'has', 'you', "wasn't", 'were', 'been', 'nor', 'at', 'was', 'during', "don't", 'why', 'it', 'what', 'your', 's', 'ain', 'off', 'mustn', "you'd", 'ours', 'yours', "she's", 'then', "it's", 'can', 'weren', "aren't", "couldn't", 'herself', 'which', 'into', 'having', 'after', 'some', 'do', 'is', 'their', 'be', 'there', 'him', 'such', 'didn', "that'll", 'itself', 'all', 'had', 'where', 'doing', 'being', 'd', 'below', "hadn't", 'don', 'over', 'down', 'than', 'the', 'he', 'on', 'now', 'in', "haven't", 'an', 'until', 'above', 'any', 'yourselves', 'his', 'most', 'through', 'once', 'no', 've', 'hasn', 'if', 'only', 'hadn', 'doesn', 'have', 'out', 'aren', 'ourselves', 'those', 'when', 'because', "mustn't", 'for', 'more',

In [17]:
print('Print some random reviews')

print(" ")
review_239=final_data['Text'].values[239]
print(review_239)
print("*"*50)

review_1239=final_data['Text'].values[1239]
print(review_1239)
print("*"*50)

review_1500=final_data['Text'].values[1500]
print(review_1500)
print("*"*50)

review_5000=final_data['Text'].values[5000]
print(review_5000)
print("*"*50)

review_25000=final_data['Text'].values[25000]
print(review_25000)
print("*"*50)


Print some random reviews
 
Why is this $[...] when the same product is available for $[...] here?<br />http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
**************************************************
My teething puppy loves this toy.  The nubs that protrude all over the toy are wonderful on his gums.  It it smaller than I thought and my large breed puppy has outgrown it before his teething phase is over.  I still let him chew on it, but only when he is supervised.
**************************************************
Great ingredients although, chicken should have been 1st rather than chicken broth, the only thing I do not think belongs in it is Canola oil. Canola or rapeseed is not someting a dog would ever find in nature and if it did find rapeseed in nature and eat it, it would poison them. Today's Food industries have convinced the masses th

In [19]:
import os

In [20]:
if not os.path.isfile('final1.sqlite'):
    final_string=[]
    all_positive_words=[]   # storing all the words from +ve review .
    all_negative_words=[]   # storing all the words from -ve review .
    for i ,review in enumerate(tqdm(final_data['Text'].values)):
        filter_sentence=[]
        #print(review)
        review=cleanhtml(review) # remove the html tags and data within tags with "1" space
        for word in review.split():
            #Using cleanpunc(w).split(),, one more split function here example:w="abc.def" ,cleanpunc will return "abc def"
            #if we dont use .split() function then we will be considering "abc def" as a single word, 
            #but if we use .split() function we will get "abc",def
            for cleaned_words in cleanpunc(word).split():
                if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):
                    if (cleaned_words.lower() not in stop):
                        # using snowballstemmer ex: tasty or tasteful = "tasti"
                        s=(snowstem.stem(cleaned_words.lower())).encode('utf8')
                        filter_sentence.append(s)
                        if (final_data['Score'].values[i])==1:
                            all_positive_words.append(s)  #list of all words used to describe positive reviews
                        if (final_data['Score'].values[i])==0:
                            all_negative_words.append(s) #list of all words used to describe negative reviews reviews
                    else :
                        continue
                else :
                    continue
        #print(filter_sentence)
        str1 = b" ".join(filter_sentence)  # final words of cleaned words
        #print('*'*20)
        final_string.append(str1)
        i+1


100%|█████████████████████████████████████████████████████████████████████████| 364171/364171 [16:46<00:00, 361.72it/s]


In [21]:
final_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,this witty little book makes my son laugh at l...
138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,1,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc..."
138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,1,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...
138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,1,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...
138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,1,1018396800,A great way to learn the months,This is a book of poetry about the months of t...


In [22]:
#### Storing the data into .sqlite file #####
final_data['Clean_text']=final_string #adding column of Clean_text which display the data after pre-processing of the review.
#print(final_data[['Text','Clean_text']])
final_data['Clean_text']=final_data['Clean_text'].str.decode("utf-8")

In [23]:
# Srore the final table into an SQLITE table for future 
conn= sqlite3.connect('final1.sqlite')
c=conn.cursor()
conn.text_factory=str
final_data.to_sql('reviews',conn,schema=None,if_exists='replace',index=True,index_label=None,chunksize=None,dtype=None)
conn.close()

In [24]:
final_data.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text',
       'Clean_text'],
      dtype='object')

In [25]:
final_data[['Clean_text','Text']]

Unnamed: 0,Clean_text,Text
138706,witti littl book make son laugh loud recit car...,this witty little book makes my son laugh at l...
138688,grew read sendak book watch realli rosi movi i...,"I grew up reading these Sendak books, and watc..."
138689,fun way children learn month year learn poem t...,This is a fun way for children to learn their ...
138690,great littl book read nice rhythm well good re...,This is a great little book to read aloud- it ...
138691,book poetri month year goe month cute littl po...,This is a book of poetry about the months of t...
...,...,...
178145,love love sweeten use bake unsweeten flavor co...,"LOVE, LOVE this sweetener!! I use it in all m..."
173675,tri sauc believ start littl sweet honey tast b...,You have to try this sauce to believe it! It s...
204727,bought hazelnut past nocciola spread local sho...,I bought this Hazelnut Paste (Nocciola Spread)...
5259,purchas product local store kid love quick eas...,Purchased this product at a local store in NY ...


# BOW(Bag of words): A simple technique to convert words to vector `

In [26]:
#Bow
count_vect=CountVectorizer() #in sklearn: Convert a collection of text documents to a matrix of token counts
final_count=count_vect.fit_transform(final_data['Clean_text'].values)

In [27]:
print("The type of count_vetorizer:",type(final_count))
print('The shape of the text Bow vectorizer:',final_count.get_shape())
print('The number of Unique words:',final_count.get_shape()[1])

The type of count_vetorizer: <class 'scipy.sparse.csr.csr_matrix'>
The shape of the text Bow vectorizer: (364171, 118433)
The number of Unique words: 118433


In [28]:
#final_data.shape 

# Bi-Grams And n-Grams
# Motivation
WE have list of words describing positive and negative reviews lets analyze them .

Analysing by getting the frequency distribution of the words as shown belod

In [30]:
freq_dist_positive= nltk.FreqDist(all_positive_words)
print('Most Common Positive words : ',freq_dist_positive.most_common(20) )

Most Common Positive words :  [(b'not', 144959), (b'like', 138317), (b'tast', 125969), (b'good', 109764), (b'love', 106484), (b'flavor', 106298), (b'use', 102853), (b'great', 101091), (b'one', 94365), (b'product', 88408), (b'tri', 85078), (b'tea', 81730), (b'coffe', 76533), (b'make', 74441), (b'get', 71688), (b'food', 62955), (b'would', 55114), (b'buy', 53551), (b'time', 53532), (b'realli', 52384)]


In [31]:
freq_dist_negative= nltk.FreqDist(all_negative_words)
print('Most Common Negative words : ',freq_dist_negative.most_common(20) )

Most Common Negative words :  [(b'not', 53605), (b'tast', 33814), (b'like', 32053), (b'product', 27395), (b'one', 20160), (b'flavor', 18883), (b'would', 17857), (b'tri', 17512), (b'use', 15141), (b'good', 14600), (b'coffe', 14271), (b'get', 13705), (b'buy', 13606), (b'order', 12694), (b'food', 12360), (b'dont', 11599), (b'tea', 11329), (b'even', 10931), (b'box', 10531), (b'make', 9789)]


<b>Observation:-</b>
1. From above we can see that the most common positive and negative word overlap for eg. 'like' and like could be used as 'not like' etc.

    So it is good idea to consider pairs of words(means using bi-gram) or sequence of n consecutive words(n-grams)

In [32]:
#bi-gram  , tri-gram and n-gram
#removing stop words like "not" should be avoided before building n-grams.
#ngram_range of (1, 1)(bydefault) means only unigrams, (1,2) means unigrams and bigrams, and (2, 2) means only bigrams.
count_vect=CountVectorizer(ngram_range=(1,2)) 
final_bigram_count= count_vect.fit_transform(final_data['Clean_text'].values)

In [33]:
print("The type of count_vetorizer:",type(final_bigram_count))
print('The shape of the text Bow vectorizer:',final_bigram_count.get_shape())
print('The number of Unique words including both unigrams and bigrams :',final_bigram_count.get_shape()[1])

The type of count_vetorizer: <class 'scipy.sparse.csr.csr_matrix'>
The shape of the text Bow vectorizer: (364171, 3024895)
The number of Unique words including both unigrams and bigrams : 3024895


# TF-IDF

In [34]:
tf_idf_vect=TfidfVectorizer(ngram_range=(1,2))
final_tf_idf_vect=tf_idf_vect.fit_transform(final_data['Text'].values)

In [35]:
print("The type of tf_idf_vect:",type(final_tf_idf_vect))
print('The shape of the text TF_IDF vectorizer:',final_tf_idf_vect.get_shape())
print('The number of Unique words including both unigrams and bigrams:',final_tf_idf_vect.get_shape()[1]) 

The type of tf_idf_vect: <class 'scipy.sparse.csr.csr_matrix'>
The shape of the text TF_IDF vectorizer: (364171, 2910192)
The number of Unique words including both unigrams and bigrams: 2910192


In [36]:
features= tf_idf_vect.get_feature_names() # to get each of the feature name
len(features)
print('Some sample feature (unique words in corpus):')
print(features[100000:100010])

Some sample feature (unique words in corpus):
['ales until', 'ales ve', 'ales would', 'ales you', 'alessandra', 'alessandra ambrosia', 'alessi', 'alessi added', 'alessi also', 'alessi and']


In [37]:
#Convert a row in sparsematrix to a numpy array
print(final_tf_idf_vect[3,:].toarray()[0]) # it will convert the sparse matrix for review 3 inta array

[0. 0. 0. ... 0. 0. 0.]


In [38]:
#Source : https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feature(row,features,top_n=25):
    """getting top n tfidf values in row and return them with their corresponding features names"""
    top_ids=np.argsort(row)[::-1][:top_n]
    tops_features=[(features[i],row[i]) for i in top_ids]
    df=pd.DataFrame(tops_features)
    df.columns=['fetures','tfidf']
    return df

top_tf_idf_=top_tfidf_feature(final_tf_idf_vect[1,:].toarray()[0],features,25)


In [39]:
top_tf_idf_ #top 25 term in a given sentence of review "1"

Unnamed: 0,fetures,tfidf
0,sendak books,0.173437
1,rosie movie,0.173437
2,paperbacks seem,0.173437
3,cover version,0.173437
4,these sendak,0.173437
5,the paperbacks,0.173437
6,pages open,0.173437
7,really rosie,0.168074
8,incorporates them,0.168074
9,paperbacks,0.168074


# Word2Vec

In [None]:
#Using Google News Word2Vcetors

# in this project we are using a pretrained model by Google,its 3.3G file ,once you load this into your memory 
#it will occupy ~9Gb,so please do this step only if you have > 12Gb of RAM.
#we will provide a pickle file which contains a dict,& it contains all our courpus words as keys & model[word] as values
#To use this code-snippet,download "GoogleNews-vectors-negative300.bin"
#from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
model= KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin')

In [None]:
#it will show the vetor representation of word "computer"
model.wv['computer']

In [None]:
# to see numeric_simlarity between 2 words
model.wv.most_similarity('woman','man')

In [None]:
# to see most similar_words to the word 'woman'
model.wv.most_similar('woman')

In [None]:
# previously we seen that "tasti " is stemmed word for "tasty and tastful"
model.wv.most_similar('tasti')  #it will raise the error as there  word "tasti" not in vocabulary

In [None]:
model.wv.most_similar('tasty')

In [40]:
# Train your own Word2Vec model using your own text corpus
import gensim
i=0
list_of_sent=[]
for sent in final_data['Clean_text'].values:
    list_of_sent.append(sent.split())
    

In [41]:
print(final_data['Clean_text'].values[0])
print('*'*40)
print(list_of_sent[0])

witti littl book make son laugh loud recit car drive along alway sing refrain hes learn whale india droop love new word book introduc silli classic book will bet son still abl recit memori colleg
****************************************
['witti', 'littl', 'book', 'make', 'son', 'laugh', 'loud', 'recit', 'car', 'drive', 'along', 'alway', 'sing', 'refrain', 'hes', 'learn', 'whale', 'india', 'droop', 'love', 'new', 'word', 'book', 'introduc', 'silli', 'classic', 'book', 'will', 'bet', 'son', 'still', 'abl', 'recit', 'memori', 'colleg']


In [44]:
# To Train the word2vec model
W2v_model=gensim.models.Word2Vec(list_of_sent,min_count=5,size=50,workers=4)
#min_count :means if a word doesnt occur atleast 5 times don't create word2vec
# vector_size :is the number of dimensions (N) of the N-dimensional space that gensim Word2Vec maps the words onto.
#workers : the last of the major parameters (full list here) is for training parallelization, to speed up training:

In [46]:
words=list(W2v_model.wv.vocab)
print('number of words that occured minimum 5 times',len(words))

23105


In [48]:
print('Sample words:',words[0:50])

Sample words ['witti', 'littl', 'book', 'make', 'son', 'laugh', 'loud', 'recit', 'car', 'drive', 'along', 'alway', 'sing', 'refrain', 'hes', 'learn', 'whale', 'india', 'droop', 'love', 'new', 'word', 'introduc', 'silli', 'classic', 'will', 'bet', 'still', 'abl', 'memori', 'colleg', 'grew', 'read', 'sendak', 'watch', 'realli', 'rosi', 'movi', 'incorpor', 'howev', 'miss', 'hard', 'cover', 'version', 'paperback', 'seem', 'kind', 'flimsi', 'take', 'two']


In [52]:
W2v_model.wv.most_similar('tasti')

[('delici', 0.8072959184646606),
 ('yummi', 0.7740971446037292),
 ('tastey', 0.7396873235702515),
 ('satisfi', 0.6935549378395081),
 ('hearti', 0.6812673807144165),
 ('nice', 0.6781516671180725),
 ('good', 0.6735666990280151),
 ('nutriti', 0.6584853529930115),
 ('crunchi', 0.6568784713745117),
 ('terrif', 0.6428861618041992)]

In [54]:
W2v_model.wv.most_similar('like')

[('weird', 0.7368110418319702),
 ('okay', 0.7074517011642456),
 ('dislik', 0.7043203115463257),
 ('alright', 0.689321756362915),
 ('prefer', 0.6711335182189941),
 ('appeal', 0.6690617203712463),
 ('funki', 0.6625389456748962),
 ('odd', 0.6554189324378967),
 ('gross', 0.6553378701210022),
 ('resembl', 0.6423805952072144)]

In [55]:
count_vec_feature= count_vect.get_feature_names() # list of words in the BOW 
count_vec_feature.index('like')
#print(count_vec_feature[64055])

1497957

In [56]:
print(count_vec_feature[1497957])

like
