# Amazon Fine Food Reviews Analysis

This dataset consists of reviews of fine foods from amazon. 

Number of reviews : 568,454
    
No. of users : 256,059
    
No. of Products: 74,258 
    
Reviews from Oct 1999 - Oct 2012
    
Number of Attributes/Columns in data: 10
    

Attributes info :

1. Id
2. ProductId : Unique identifier for the product
3. UserId : Unqiue identifier for the user
4. ProfileName: Profile name of the user
5. HelpfulnessNumerator: Number of users who found the review helpful
6. HelpfulnessDenominator : Number of users who indicated whether they found the review helpful or not
7. Score : Rating between 1 and 5
8. Time : Timestamp for the review
9. Summary : Brief summary of the review
10. Text : Text of the review

Objective of Task:
1. Given a review , determine whether a review is positive(rating between 4and 5)  or negative (rating between 1 and 2) 



Q. How to determine if a review is positive or negative? 

Ans. W could use the Score/Rating attribute . 
    A Score of 4 or 5 could be considered as postive review.
    A Score 1 or 3 could be considered as neagtive review.
    A review of 3 is neutral and can be ignored.
    This is an apporximate and proxy way of determining  the polarity (postive/negative) of a review.

# Loading the Data
The dataset if available in two forms 
1. csv file
2. SQLITE Database
In order to load the data ,We have used the SQLITE dataset as it easier to query the data and visualize the data efficienctly.

Here as we only want to get the global sentiment of the recomendation (positive or negative),we will purposefully ignore all scores equal to 3 ,If the Score_id above 3,then the recommendation will be set to "positve" .Otheriwse ,it will be set to "negative"

In [1]:
# importing imp libraries that we will need
import warnings
warnings.filterwarnings('ignore')#It ignores the warning present in the code once you executed it
#%matplotlib inline #sets the backend of matplotlib to the 'inline' backend: 
                    #With this backend, the output of plotting commands is displayed inline within frontends 
                    #like the Jupyter notebook, directly below the code cell that produced it.
                    #The resulting plots will then also be stored in the notebook document
import sqlite3
import pandas as pd
import numpy as np 
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve ,auc

import nltk 
from nltk.stem.porter import PorterStemmer
import re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
from tqdm import tqdm

# 1. Reading Data

In [2]:
# using the SQLITE table to read data.
con =sqlite3.connect('database.sqlite')
#filtering only positive and negative reviews and not taking into consideration those reviews with Score=3

filtered_data=pd.read_sql_query(""" SELECT  * FROM Reviews WHERE Score !=3 LIMIT 568454""",con)
# print(filtered_data.shape)
def partition(x):
    if x<3:
        return 0
    return 1

#changing reviews with score less than 3 to be negative and vice versa
actualScore=filtered_data['Score']  # it will give only score attributes
##posNeg = actual_score.map(lambda x : 'Positive' if x > '3' else 'Negative')
Positive_negative = actualScore.map(partition)  
# it will use the partion func we have created and chang the score in 1 or 0and 
#save in variable "Positive_negative"
#print(filtered_data)
filtered_data['Score']=Positive_negative  # it will replace  the score column with 0 and 1 (from 1,2&4,5 to 0 &1)

print('No. of data points in our dataset',filtered_data.shape)
filtered_data.head(3)

No. of data points in our dataset (525814, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [6]:
filtered_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525814 entries, 0 to 525813
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      525814 non-null  int64 
 1   ProductId               525814 non-null  object
 2   UserId                  525814 non-null  object
 3   ProfileName             525814 non-null  object
 4   HelpfulnessNumerator    525814 non-null  int64 
 5   HelpfulnessDenominator  525814 non-null  int64 
 6   Score                   525814 non-null  int64 
 7   Time                    525814 non-null  int64 
 8   Summary                 525814 non-null  object
 9   Text                    525814 non-null  object
dtypes: int64(5), object(5)
memory usage: 40.1+ MB


In [3]:
# counting the no. of time the same user has given the review
display = pd.read_sql_query("""
            SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*)
            FROM Reviews
            GROUP BY UserId 
            HAVING COUNT(*)>1
            """,con)

In [4]:
print(display.shape)
display.head()

(80668, 7)


Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
0,#oc-R115TNMSPFT9I7,B005ZBZLT4,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ESG,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B005ZBZLT4,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ESG,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBEV0,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [5]:
display[display['UserId']=='A2S8CZX8Y19QQX']

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)


In [6]:
display['COUNT(*)'].sum() #total no. of distinct user give the reviews

393063

# Exploratory Data Analysis 
### 2. Data Cleaning Deduplication:
    It is observed (as shown in the data below)that reviews data had many duplicates entries . Hence it was necessary to remove dupicates in order to get unbaised results for the analysis of the data .

In [7]:
display=pd.read_sql_query("""
SELECT * 
FROM Reviews
WHERE Score !=3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductId
""",con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As can be seen above the same user has multiple reviews of the with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that

ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies,8.82-Ounce Packages (Pack of 8)

ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence inorder to reduce reduntancy it was decided to eleminate the rows having same parameters.

The method used for the same was that we first sort the data according to ProductId and then just keep the first similiar product reviews and delete the others .This method ensures that there is only one representatives for each product and deduplication without sorting would lead to possibility of different reprsentatives still existing for the same product.

In [8]:
# sorting data according to Productid in ascending order 
sorted_data= filtered_data.sort_values('ProductId',axis=0 ,ascending=True,inplace=False,kind='quicksort',na_position='last')
sorted_data.shape

(525814, 10)

In [9]:
final_data=sorted_data.drop_duplicates(subset={"UserId","ProfileName" , "Time","Text"},keep="first",inplace=False)
final_data.shape

(364173, 10)

In [10]:
# Checking how much % of data still remains 
(final_data['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

69.25890143662969

<b>Observation:-</b> It was also seen that in few rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is practically impossible hence these rows are too removed from calculations.

In [11]:
display=pd.read_sql_query("""
SELECT *
FROM Reviews 
    WHERE SCORE !=3  AND Id=44737 OR Id=64422
ORDER BY ProductId 
""",con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [12]:
final_data=final_data[final_data['HelpfulnessNumerator']<=final_data['HelpfulnessDenominator']]

In [13]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final_data.shape)

(364171, 10)


In [14]:
#How many positive and negative reviews are present in our datatset?
final_data['Score'].value_counts()

1    307061
0     57110
Name: Score, dtype: int64

### 3. Text Preprocessing.
Till we have finished data deduplication ,now we will to preprocessing part before we  go for further analysis and making the prediction models .

Hence in Preprocessing part we will do the the following things:
1. Removing of HTML tags.
2. Removing any punctuation or spme special character like , or . or # etc.
3. Check the word is made of made up of english letter and is not alpha-numeric 
4. Check to see if the length of the word is greater than 2(as it was research that there is no adjective in 2 letters).
5. Convert the words to lowercase .
6. Remove the stopwords 
7. Using Snowball Stemming the word (now using Porter Stemming as Snowball Stemming is better than Porter Stemming).

In [15]:
# finding html tag in sentences 
i=0
for review in final_data['Text'].values:
    if (len(re.findall('<.*?>',review))):
        print(i)
        print(review)
        break
    i=i+1

6
I set aside at least an hour each day to read to my son (3 y/o). At this point, I consider myself a connoisseur of children's books and this is one of the best. Santa Clause put this under the tree. Since then, we've read it perpetually and he loves it.<br /><br />First, this book taught him the months of the year.<br /><br />Second, it's a pleasure to read. Well suited to 1.5 y/o old to 4+.<br /><br />Very few children's books are worth owning. Most should be borrowed from the library. This book, however, deserves a permanent spot on your shelf. Sendak's best.


In [16]:
    import re
    import string
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from nltk.stem.wordnet import WordNetLemmatizer

    stop = set(stopwords.words('english')) # creating set of stopwords
    print(stop) # it will to show all the stopwords in NLTK
    excluding_stop = ['against','not','don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't",
                 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 
                 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",'shouldn', "shouldn't", 'wasn',
                 "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
    print('****'*15)

    stop = [word for word in stop if word not in excluding_stop]
    print(' ')
    print(stop)
    snowstem=nltk.stem.SnowballStemmer('english') # intialising the snowball stemmer
    print(' ')
    print('****'*15)
    print("base_word of tasty:", snowstem.stem('tasty')) # it will tell us the base word or do stemming

    # creating a function to clean the word of any html-tags. The function will remove the html tag and evrything between them with "1" space 
    def cleanhtml(sentence):
        cleanr = re.compile('<.*?>')
        cleantext = re.sub(cleanr, ' ', sentence)
        return cleantext
    # creating a function to clean the punctuation or special characters .The function will create the punctuation with empty string
    def cleanpunc(sentence):
        cleaned= re.sub(r'[?|!|\'|"|#]',r'',sentence)
        cleaned= re.sub(r'[.|,|)|(|\|/]',r'',cleaned)
        return cleaned


{"you've", 'the', 'more', 'before', 'd', 'shan', 'very', "won't", 'here', 'each', 'both', 'been', "that'll", 'if', 'out', 'had', 'than', 'any', 'that', 'but', 't', 'were', 'ma', 'all', 'is', "don't", 'other', "should've", 'our', "didn't", 'your', 'we', 'down', 'isn', 'doing', 'where', "you'd", 'its', 'this', 's', 'himself', 'these', 'up', 'their', 'while', 'ain', 'not', 'will', 'hadn', 'doesn', 'o', 'shouldn', 'you', "weren't", 'in', 'same', 'yourselves', 'her', 'itself', "hadn't", 'll', 'she', 'and', 'on', 'mightn', 'don', 'those', 'was', 'what', "haven't", 'his', 'then', 'hasn', "wasn't", 'only', 'haven', 'after', 'can', 'theirs', 'under', 'or', 'few', 'they', 'off', 'didn', "mustn't", "wouldn't", 'until', 'some', 'me', 'won', 'he', 'wouldn', 'i', 'with', 'over', 'ourselves', 'ours', 'above', 'yourself', "it's", 'themselves', 'are', 'aren', 'weren', 'once', 'herself', 'how', "you're", 'couldn', "mightn't", 'yours', 'do', 'between', 'should', 'as', 'myself', 'which', 'most', 'when', '

In [18]:
print('Print some random reviews')

print(" ")
review_0=final_data['Text'].values[0]
print(review_0)
print("*"*50)

review_1239=final_data['Text'].values[1239]
print(review_1239)
print("*"*50)

review_1500=final_data['Text'].values[1500]
print(review_1500)
print("*"*50)

review_5000=final_data['Text'].values[5000]
print(review_5000)
print("*"*50)

review_25000=final_data['Text'].values[25000]
print(review_25000)
print("*"*50)


Print some random reviews
 
this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
**************************************************
My teething puppy loves this toy.  The nubs that protrude all over the toy are wonderful on his gums.  It it smaller than I thought and my large breed puppy has outgrown it before his teething phase is over.  I still let him chew on it, but only when he is supervised.
**************************************************
Great ingredients although, chicken should have been 1st rather than chicken broth, the only thing I do not think belongs in it is Canola oil. Canola or rapeseed is not someting a dog would ever find in nature and if it did f

In [19]:
# Remove urls from tetx python code:
review_0 = re.sub(r"http\S+", "", review_0)
review_1239 = re.sub(r"http\S+", "", review_1239)
review_1500 = re.sub(r"http\S+", "", review_1500)
review_5000 = re.sub(r"http\S+", "", review_5000)

print(review_0)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college


In [1]:
pip install lxml


Note: you may need to restart the kernel to use updated packages.


In [20]:
## https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
from bs4 import BeautifulSoup
soup = BeautifulSoup(review_0,'lxml')
text=soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(review_1239,'lxml')
text=soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(review_1500,'lxml')
text=soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(review_5000,'lxml')
text=soup.get_text()
print(text)
print("="*50)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
My teething puppy loves this toy.  The nubs that protrude all over the toy are wonderful on his gums.  It it smaller than I thought and my large breed puppy has outgrown it before his teething phase is over.  I still let him chew on it, but only when he is supervised.
Great ingredients although, chicken should have been 1st rather than chicken broth, the only thing I do not think belongs in it is Canola oil. Canola or rapeseed is not someting a dog would ever find in nature and if it did find rapeseed in nature and eat it, it would poison them. Today's Food industries have convinced the masses that Canola oil is a sa

In [21]:
import os
import re
def decontracted(phrase): #https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [23]:
#rremove words_with_no. python   # https://stackoverflow.com/a/18082370/4084039
review_0=re.sub("\S\d\S","",review_0).strip()
print(review_0)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college


In [24]:
#remove specail charachter # https://stackoverflow.com/a/5843547/4084039
review_1500 = re.sub('[^A-Za-z0-9]+', ' ', review_1500)
print(review_1500)

Great ingredients although chicken should have been 1st rather than chicken broth the only thing I do not think belongs in it is Canola oil Canola or rapeseed is not someting a dog would ever find in nature and if it did find rapeseed in nature and eat it it would poison them Today s Food industries have convinced the masses that Canola oil is a safe and even better oil than olive or virgin coconut facts though say otherwise Until the late 70 s it was poisonous until they figured out a way to fix that I still like it but it could be better 


In [30]:
final_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,this witty little book makes my son laugh at l...
138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,1,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc..."
138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,1,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...
138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,1,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...
138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,1,1018396800,A great way to learn the months,This is a book of poetry about the months of t...


In [32]:
#Combining all the statement above
from tqdm import tqdm
preprocessed_review=[]
# tqdm is for printing the status bar
for sentence in tqdm(final_data['Text'].values):
    sentence=re.sub(r"http\S+", "",sentence)
    sentence=BeautifulSoup(sentence,'lxml').get_text()
    sentence=decontracted(sentence)
    sentence=re.sub("\S*\d\S*","",sentence).strip()
    sentence=re.sub('[^A-Za-z]+', ' ', sentence)
    # https://gist.github.com/sebleier/554280
    sentence=' '.join(e.lower() for e in sentence.split() if e.lower() not in stop)
    preprocessed_review.append(sentence.strip())
    

100%|████████████████████████████████████████████████████████████████████████| 364171/364171 [05:52<00:00, 1034.13it/s]


# Another way of doing preprocessing
if not os.path.isfile('final.sqlite'):
    final_string=[]
    all_positive_words=[]   # storing all the words from +ve review .
    all_negative_words=[]   
    
    for i ,review in enumerate(tqdm(final_data['Text'].values)):
        filter_sentence=[]
        #print(review)
        review=cleanhtml(review) # remove the html tags and data within tags with "1" space
        for word in review.split():
            #Using cleanpunc(w).split(),, one more split function here example:w="abc.def" ,cleanpunc will return "abc def"
            #if we dont use .split() function then we will be considering "abc def" as a single word, 
            #but if we use .split() function we will get "abc",def
            for cleaned_words in cleanpunc(word).split():
                if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):
                    if (cleaned_words.lower() not in stop):
                        # using snowballstemmer ex: tasty or tasteful = "tasti"
                        s=(snowstem.stem(cleaned_words.lower())).encode('utf8')
                        filter_sentence.append(s)
                        if (final_data['Score'].values[i])==1:
                            all_positive_words.append(s)  #list of all words used to describe positive reviews
                        if (final_data['Score'].values[i])==0:
                            all_negative_words.append(s) #list of all words used to describe negative reviews 
                    else :
                        continue
                else :
                    continue
        #print(filter_sentence)
        str1 = b" ".join(filter_sentence)  # final words of cleaned words
        #print('*'*20)
        final_string.append(str1)
        i+1


In [38]:
final_data['Cleaned_text']=preprocessed_review#adding column of Clean_text which display the data after pre-processing of the review.
#print(final_data[['Text','Clean_text']])

In [39]:
final_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Cleaned_text
138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,witty little book makes son laugh loud recite ...
138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,1,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",grew reading sendak books watching really rosi...
138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,1,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...,fun way children learn months year learn poems...
138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,1,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...,great little book read aloud nice rhythm well ...
138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,1,1018396800,A great way to learn the months,This is a book of poetry about the months of t...,book poetry months year goes month cute little...


In [23]:
# Store the final table into an SQLITE table for future 
conn= sqlite3.connect('final1.sqlite')
c=conn.cursor()
conn.text_factory=str
final_data.to_sql('reviews',conn,schema=None,if_exists='replace',index=True,index_label=None,chunksize=None,dtype=None)
conn.close()

In [40]:
final_data.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text',
       'Cleaned_text'],
      dtype='object')

In [42]:
final_data[['Cleaned_text','Text']]

Unnamed: 0,Cleaned_text,Text
138706,witty little book makes son laugh loud recite ...,this witty little book makes my son laugh at l...
138688,grew reading sendak books watching really rosi...,"I grew up reading these Sendak books, and watc..."
138689,fun way children learn months year learn poems...,This is a fun way for children to learn their ...
138690,great little book read aloud nice rhythm well ...,This is a great little book to read aloud- it ...
138691,book poetry months year goes month cute little...,This is a book of poetry about the months of t...
...,...,...
178145,love love sweetener use baking unsweetened fla...,"LOVE, LOVE this sweetener!! I use it in all m..."
173675,try sauce believe starts little sweet honey ta...,You have to try this sauce to believe it! It s...
204727,bought hazelnut paste nocciola spread local sh...,I bought this Hazelnut Paste (Nocciola Spread)...
5259,purchased product local store ny kids love qui...,Purchased this product at a local store in NY ...


# BOW(Bag of words): A simple technique to convert words to vector `

In [43]:
#Bow
count_vect=CountVectorizer()  
final_count=count_vect.fit_transform(final_data['Cleaned_text'].values)

In [44]:
print("The type of count_vetorizer:",type(final_count))
print('The shape of the text Bow vectorizer:',final_count.get_shape())
print('The number of Unique words:',final_count.get_shape()[1])

The type of count_vetorizer: <class 'scipy.sparse.csr.csr_matrix'>
The shape of the text Bow vectorizer: (364171, 116771)
The number of Unique words: 116771


In [45]:
#final_data.shape 

# Bi-Grams And n-Grams


In [48]:
#bi-gram  , tri-gram and n-gram
#removing stop words like "not" should be avoided before building n-grams.
#ngram_range of (1, 1)(bydefault) means only unigrams, (1,2) means unigrams and bigrams, and (2, 2) means only bigrams.
count_vect=CountVectorizer(ngram_range=(1,2)) 
final_bigram_count= count_vect.fit_transform(final_data['Cleaned_text'].values)

In [49]:
print("The type of count_vetorizer:",type(final_bigram_count))
print('The shape of the text Bow vectorizer:',final_bigram_count.get_shape())
print('The number of Unique words including both unigrams and bigrams :',final_bigram_count.get_shape()[1])

The type of count_vetorizer: <class 'scipy.sparse.csr.csr_matrix'>
The shape of the text Bow vectorizer: (364171, 3934332)
The number of Unique words including both unigrams and bigrams : 3934332


# TF-IDF

In [50]:
tf_idf_vect=TfidfVectorizer(ngram_range=(1,2))
final_tf_idf_vect=tf_idf_vect.fit_transform(final_data['Cleaned_text'].values)

In [51]:
print("The type of tf_idf_vect:",type(final_tf_idf_vect))
print('The shape of the text TF_IDF vectorizer:',final_tf_idf_vect.get_shape())
print('The number of Unique words including both unigrams and bigrams:',final_tf_idf_vect.get_shape()[1]) 

The type of tf_idf_vect: <class 'scipy.sparse.csr.csr_matrix'>
The shape of the text TF_IDF vectorizer: (364171, 3934332)
The number of Unique words including both unigrams and bigrams: 3934332


In [52]:
features= tf_idf_vect.get_feature_names() # to get each of the feature name
print(len(features))
print('Some sample feature (unique words in corpus):')
print(features[100000:100010])

3934332
Some sample feature (unique words in corpus):
['altoids fact', 'altoids falvors', 'altoids fan', 'altoids favorite', 'altoids favorites', 'altoids finding', 'altoids flavor', 'altoids flavors', 'altoids flavour', 'altoids found']


In [53]:
#Convert a row in sparsematrix to a numpy array
print(final_tf_idf_vect[3,:].toarray()[0]) # it will convert the sparse matrix for review 3 into array

[0. 0. 0. ... 0. 0. 0.]


In [54]:
#Source : https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feature(row,features,top_n=25):
    """getting top n tfidf values in row and return them with their corresponding features names"""
    top_ids=np.argsort(row)[::-1][:top_n]
    tops_features=[(features[i],row[i]) for i in top_ids]
    df=pd.DataFrame(tops_features)
    df.columns=['features','tfidf']
    return df

top_tf_idf_=top_tfidf_feature(final_tf_idf_vect[1,:].toarray()[0],features,25)


In [55]:
top_tf_idf_ #top 25 term in a given sentence of review "1"

Unnamed: 0,features,tfidf
0,incorporates love,0.183116
1,rosie movie,0.183116
2,cover version,0.183116
3,paperbacks seem,0.183116
4,flimsy takes,0.183116
5,reading sendak,0.183116
6,movie incorporates,0.183116
7,books watching,0.183116
8,grew reading,0.183116
9,version paperbacks,0.183116


# Word2Vec

In [None]:
#Using Google News Word2Vcetors

# in this project we are using a pretrained model by Google,its 3.3G file ,once you load this into your memory 
#it will occupy ~9Gb,so please do this step only if you have > 12Gb of RAM.
#we will provide a pickle file which contains a dict,& it contains all our courpus words as keys & model[word] as values
#To use this code-snippet,download "GoogleNews-vectors-negative300.bin"
#from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
model= KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin')

In [None]:
#it will show the vetor representation of word "computer"
model.wv['computer']

In [None]:
# to see numeric_simlarity between 2 words
model.wv.most_similarity('woman','man')

In [None]:
# to see most similar_words to the word 'woman'
model.wv.most_similar('woman')

In [None]:
# previously we seen that "tasti " is stemmed word for "tasty and tastful"
model.wv.most_similar('tasti')  #it will raise the error as there  word "tasti" not in vocabulary

In [None]:
model.wv.most_similar('tasty')

In [56]:
# Train your own Word2Vec model using your own text corpus
import gensim
i=0
list_of_sent=[]
for sent in final_data['Cleaned_text'].values:
    list_of_sent.append(sent.split())
    

In [41]:
print(final_data['Cleaned_text'].values[0])
print('*'*40)
print(list_of_sent[0])

witti littl book make son laugh loud recit car drive along alway sing refrain hes learn whale india droop love new word book introduc silli classic book will bet son still abl recit memori colleg
****************************************
['witti', 'littl', 'book', 'make', 'son', 'laugh', 'loud', 'recit', 'car', 'drive', 'along', 'alway', 'sing', 'refrain', 'hes', 'learn', 'whale', 'india', 'droop', 'love', 'new', 'word', 'book', 'introduc', 'silli', 'classic', 'book', 'will', 'bet', 'son', 'still', 'abl', 'recit', 'memori', 'colleg']


In [57]:
    # To Train the word2vec model
W2v_model=gensim.models.Word2Vec(list_of_sent,min_count=5,size=50,workers=4)
#min_count :means if a word doesnt occur atleast 5 times don't create word2vec
# vector_size :is the number of dimensions (N) of the N-dimensional space that gensim Word2Vec maps the words onto.
#workers : the last of the major parameters (full list here) is for training parallelization, to speed up training:

In [66]:
print(W2v_model.wv['great'])

[-0.6107101  -0.67034346 -2.6221585  -1.4425089  -1.3013947   4.1874156
 -0.9494372  -1.0349554  -0.33052203 -2.5994067   2.6629179  -1.655124
  2.3965137   0.58096296 -0.67995054  0.13025503  0.79518604  3.0029774
  2.8996725   1.8602083   0.50494504 -0.21483125 -0.52742594 -0.51032174
  0.8326055  -1.3442563   1.6809897  -1.0757923  -0.44997677 -2.2265508
 -0.3954141  -0.06144696  0.2763246  -2.5573807   1.6016845  -1.7303195
  1.2434839   1.2322925   3.070183   -5.606472    1.1513461   0.32933712
  0.39576283 -1.2074327  -2.2284095  -1.3324424   0.7905092  -1.6390716
  2.5836208   2.6731517 ]


In [59]:
print(W2v_model.wv.most_similar('great'))
print('='*50)
print(W2v_model.wv.most_similar('worst'))

[('fantastic', 0.8883295059204102), ('terrific', 0.8827334046363831), ('awesome', 0.8770381808280945), ('good', 0.8632175922393799), ('excellent', 0.8463320732116699), ('wonderful', 0.8063865303993225), ('perfect', 0.7888157367706299), ('nice', 0.7448451519012451), ('fabulous', 0.7446472644805908), ('amazing', 0.7292959690093994)]
[('nastiest', 0.8599145412445068), ('greatest', 0.7761662006378174), ('disgusting', 0.7535935640335083), ('best', 0.7331047654151917), ('terrible', 0.7247992753982544), ('awful', 0.713337779045105), ('horrible', 0.7061472535133362), ('tastiest', 0.7038686871528625), ('horrid', 0.6958554983139038), ('vile', 0.6948801279067993)]


In [62]:
w2vwords=list(W2v_model.wv.vocab)
print('number of words that occured minimum 5 times',len(words))

number of words that occured minimum 5 times 33584


In [63]:
print('Sample words:',w2vwords[0:50])

Sample words: ['witty', 'little', 'book', 'makes', 'son', 'laugh', 'loud', 'recite', 'car', 'driving', 'along', 'always', 'sing', 'refrain', 'learned', 'whales', 'india', 'drooping', 'roses', 'love', 'new', 'words', 'introduces', 'silliness', 'classic', 'willing', 'bet', 'still', 'able', 'memory', 'college', 'grew', 'reading', 'sendak', 'books', 'watching', 'really', 'rosie', 'movie', 'incorporates', 'loves', 'however', 'miss', 'hard', 'cover', 'version', 'seem', 'kind', 'flimsy', 'takes']


# Converting text into vectors using Avg W2V, TFIDF-W2V

# Avg W2V

In [69]:
list_of_sent[0]

['witty',
 'little',
 'book',
 'makes',
 'son',
 'laugh',
 'loud',
 'recite',
 'car',
 'driving',
 'along',
 'always',
 'sing',
 'refrain',
 'learned',
 'whales',
 'india',
 'drooping',
 'roses',
 'love',
 'new',
 'words',
 'book',
 'introduces',
 'silliness',
 'classic',
 'book',
 'willing',
 'bet',
 'son',
 'still',
 'able',
 'recite',
 'memory',
 'college']

In [71]:
#print('memory' in w2vwords)

True

In [None]:
for

In [72]:
#computing average word2vec for each reviews
sent_vectors=[]
for sent in tqdm(list_of_sent):
    sent_vec=np.zeros(50)
    count_words=0
    for word in sent:
        if word in w2vwords:
            vec=W2v_model.wv[word]
            sent_vec+=vec
            count_words+=1
    if count_words!=0:
        sent_vec/=count_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

100%|█████████████████████████████████████████████████████████████████████████| 364171/364171 [39:57<00:00, 151.89it/s]

364171
50





# *Tf-idf Weighted W2V*

In [79]:
tf_idf_model=TfidfVectorizer()
tf_idf_model.fit(final_data['Cleaned_text'].values)
#creating a dictinory with a word  as a key and the idf as value
dicctionary=dict(zip(tf_idf_model.get_feature_names(),list(tf_idf_model.idf_)))


In [110]:
tfidf_feat=tf_idf_model.get_feature_names()
#final_tf_idf = tf_idf_vect.fit_transform(final['Cleane_text'].values)
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tf_idf_sent_vectors=[]
row=0
for sent in tqdm(list_of_sent):
    sent_vec=np.zeros(50)
    weight_sum=0
    for word in sent:
        if word in w2vwords and word in tfidf_feat:
            vec=W2v_model.wv[word]
            #tf_idf=final_tf_idf[row,tfidf_feat.index(word)]
            # to reduce the computation we are computing 
            #dictionary[word]:idf value of the word in whole corpus
            #sent.count(word):tfvalues of word in this reviews
            tf_idf=dicctionary[word]*(sent.count(word)/len(sent))
            #sent.count(word) gives us the count of word in sentence and
            #then we're dividing this term by total number of words in sent that gives us tf value.
            sent_vec+=(vec*tf_idf)
            weight_sum+=tf_idf
    if weight_sum!=0:
        sent_vec/=weight_sum
    tf_idf_sent_vectors.append(sent_vec)
    row+=1

100%|███████████████████████████████████████████████████████████████████████| 364171/364171 [16:27:21<00:00,  6.15it/s]


In [111]:
print(len(tf_idf_sent_vectors))
print(len(tf_idf_sent_vectors[0]))

364171
50


In [48]:
a='343'

In [54]:
while(n!=0):
    m=n%10
    c=rn*10+m
    n=n//10

2

## 