# Amazon Fine Food Reviews Analysis

Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews<br>

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/

The amazon fine food dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568454<br>
Number of Users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of **Attributes/Columns** : 10<br>

### Attributes Information:


1. ID
2. ProductId : Unique Identifier for the Product
3. UserId : Unique Identifier for the User
4. ProfileName
5. HelpfullnessNumerator- Number of Users who found the review helpfull
6. HelpfullDenominator- NUmber of Users who idicated whether thet found the review heplfull or not
7. Score- Rating between 1 and 5
8. Time - timestamp for the reviews
9. Summary- brief summary of review
10. Text - text of review

### Objectives:
Given a review determine whether the review is positive(Rating of 4 or 5) or negative (Rating of 1 or 2)
[Q] How to determine whether a review is positive or negative?<br>
[Ans]: We could use the score or rating. A rating of 4 or 5 is considered to be the positive and rating of 1 or 2 is considered to be the negative. A review of score 3 is neutral and ignored. This is the aproximation and proxy way of determining the polarity of a review.







# LOADING THE DATASET;
  
  The dataset is available in two forms:
  1. .csv file
  2. SQLite Database

In order to use the data we have used the sqlite as it is easier to query and visualize the data efficiently.

Here, We just want the global sentiment of the recommendations(positive/nagative) therefore we will ignore the score\rating = 3. 

### Importing important library

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

MessageError: ignored

In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer


from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os



In [None]:
# con = sqlite3.connect('database.sqlite')
# using sqlite3 to read data.
con = sqlite3.connect('/content/gdrive/MyDrive/databases/database.sqlite')

#[1.]Reading data

In [None]:
filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 10000""", con)
print("number of data points", filtered_data.shape)

number of data points (10000, 10)


In [None]:
filtered_data.head(2)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...


In [None]:
# filter the data with reviews != 3 as it is neutral and we want to read the global sentiment of users
# filter the data as if review < 3 then it is negative rating[0] and if review > 3 then it is positive rating[1].
# define a  function for above task

def partition(x):
  if x<3:
    return 0
  return 1


# changing the review with score less than 3 to be positive and vice-versa

actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
# updating to dataset
filtered_data['Score'] = positiveNegative
print("number of data points in our data", filtered_data.shape)
filtered_data.head(3)

number of data points in our data (10000, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [None]:
# grouping the data by userId to check if there is more than one record for one userId
display = pd.read_sql_query(
    """SELECT UserId, ProductId, profileName, Time, Score, Text, COUNT(*)
    FROM Reviews GROUP BY UserId
    HAVING COUNT(*)>1
    """,  con)


print(display.shape) # print the number of total distinct userid
display.head()


(80668, 7)


Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
0,#oc-R115TNMSPFT9I7,B007Y59HVM,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ET0,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B007Y59HVM,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ET0,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBE1U,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [None]:
display[display['UserId']=='AZY10LLTJ71NX']

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
80638,AZY10LLTJ71NX,B006P7E5ZI,"undertheshrine ""undertheshrine""",1334707200,5,I was recommended to try green tea extract to ...,5


In [None]:
# getting the total count of record
display['COUNT(*)'].sum()

393063

#Eploratory Data Analysis
#[2.] Data Cleaning:Deduplication

It is observed(as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicate in order to get the unbaised results for the analysis of the data. Following is an example 

In [None]:
# let's filtered the data to check the duplicacy
# Assumption is if the time span, summery and text for same userId are same then they could be duplicate
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As can be seen from above the helpfulnessNumerator,helpfulnessDenominator, time, summary, text all are same for same user And upon doing analysis it was found that

ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8) <br>

ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8)

It was inffered after analysis that reviews with same parameter other than productId belonged to the same product just having different flavour and quality
.Hence in order to reduce redundency it was decided to eliminate all but one row 

METHOD : First we sort the data using ProductId and keep the one and delete other. For eg.just review for B000HDL1RQ will be retained.This method ensure that there is only one representative for each product and duplication without sorting lead to possibility of different representative still exit somewhere in the dataset


In [None]:
# sorting data according to productId in ascending order
sorted_data = filtered_data.sort_values('ProductId', axis = 0, ascending=True, inplace = False, kind = 'quicksort', na_position='last')


In [None]:
final = sorted_data.drop_duplicates(subset={"UserId", "ProfileName","Time","Text"},keep ='first', inplace =False)
final.shape

(9564, 10)

In [None]:
# checking to see how much percentage of data is still restored
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

95.64

In [None]:
# We know that heplfulnessNumerator can not be greater than heplfulnessDenominator
# let's check
display = pd.read_sql_query("""
select * from reviews r
where r.HelpfulnessNumerator>r.HelpfulnessDenominator limit 500
""", con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
1,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...


##observations :
It was seen that in few rows values of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible these rows too are removed 

In [None]:
final = final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

In [None]:
# before starting the next phase let's check number of entry left
print(final.shape)
# checking the number of negative and positive count
final['Score'].value_counts()

(9564, 10)


1    7976
0    1588
Name: Score, dtype: int64

In [None]:
# we have 7976 positive and 1588 negative reviews hence dataset is imbalanced

#[3.] Text Preprocessing.

Now that we have finished deduplication our dat requires some preprocess before we go on further with analysis and making prediction model.

Hence in the Preprocessing phase we do the following in the order below:

  1. Begin by removing html tags
  2. removing punctuation and any limited set of special characters like %,$,. etc
  3. checking alpha-numeric and converting to english word only.
  4. convert the text to lowercase
  5. conver the word to lowercase
  6. Finally snowball Stemming or Porter Stemming the word which one better

After which we will collect the words used to describe the positive and negative



##[3.1] Reviews['Text'] preprocessing :

In [None]:
# printing some ramdom words
sent_10 = final['Text'].values[10]
print(sent_10)
print("="*50)

sent_1000 = final['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1500 = final['Text'].values[1500]
print(sent_1500)
print("="*50)

sent_5000 = final['Text'].values[5000]
print(sent_5000)
print("="*50)


This is my cat's third favorite food.  It's great stuff - the gravy is so very thick and the food looks like bits of steak slicked off for your friend.  My cat licks the bowl clean every time and wants more.  They should send me discounts for as much of this stuff I buy.  I buy it by the case if that gives you some idea!
15 month old loves to eat them on the go! They seem great for a healthy, quick, and easy snack!
These chips are truly amazing. They have it all. They're light, crisp, great tasting, nice texture, AND they're all natural... AND low in fat and sodium! Need I say more? I recently bought a bag of them at a regular grocery store, and couldn't belive my taste buds. That's why I excited why I saw them here on Amazon, and decided to buy a case!
It's amazing how little popularity that Diamond Pet Foods seems to have considering the quality ingredients their foods contain!  I worked at a pet store for about four years. Our number one recommendations on foods were Natural Balance

In [None]:
# code to remove the urls in python
sent_1500 = re.sub(r"http\S+","",sent_1500) # S+ => string of non-white space
sent_5000 = re.sub(r"http\S+","",sent_5000)
print(sent_5000)

It's amazing how little popularity that Diamond Pet Foods seems to have considering the quality ingredients their foods contain!  I worked at a pet store for about four years. Our number one recommendations on foods were Natural Balance and Wellness- two EXCEPTIONAL foods that are pretty expensive.  But for those owners who have multiple dogs or even one large breed dog, those brands can cripple your budget.  So when I would get customers who absolutely LOVED their pets and wanted to be able to feed them healthy food without breaking their banks, I recommended Diamond.  Diamond dog food costs about the same as those commercial brands like Purina and Iams, but contains ONLY natural, healthy ingredients.  Just remember, the less garbage you feed your dog (by-products, fillers) the less your dog uses the bathroom- AND they stay fuller, longer!


In [None]:
# code to remove all html tags from an element
# this library is used for web-scraping
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_1000,'lxml')  # use 'lxlm' or 'html'
text = soup.get_text()
print(text)

15 month old loves to eat them on the go! They seem great for a healthy, quick, and easy snack!


In [None]:
# decontraction of some words
def decontracted(phrase):
  # specific
  phrase = re.sub(r"won't", "will not", phrase)
  phrase = re.sub(r"can't", "can not",phrase)

  # general
  phrase = re.sub(r"n\'t"," not",phrase)
  phrase = re.sub(r"\'re"," are",phrase)
  phrase = re.sub(r"\'s"," is",phrase)
  phrase = re.sub(r"\'d"," would",phrase)
  phrase = re.sub(r"\'ll"," will",phrase)
  phrase = re.sub(r"\'t"," not",phrase)
  phrase = re.sub(r"\'ve"," have",phrase)
  phrase = re.sub(r"\'m"," am",phrase)
  return phrase

In [None]:
print(decontracted(sent_5000))

It is amazing how little popularity that Diamond Pet Foods seems to have considering the quality ingredients their foods contain!  I worked at a pet store for about four years. Our number one recommendations on foods were Natural Balance and Wellness- two EXCEPTIONAL foods that are pretty expensive.  But for those owners who have multiple dogs or even one large breed dog, those brands can cripple your budget.  So when I would get customers who absolutely LOVED their pets and wanted to be able to feed them healthy food without breaking their banks, I recommended Diamond.  Diamond dog food costs about the same as those commercial brands like Purina and Iams, but contains ONLY natural, healthy ingredients.  Just remember, the less garbage you feed your dog (by-products, fillers) the less your dog uses the bathroom- AND they stay fuller, longer!


In [None]:
# removing the alpha-numeric words
sent_1500 = re.sub("\S*\d\S*", "", sent_1500).strip()
print(sent_1500)

These chips are truly amazing. They have it all. They're light, crisp, great tasting, nice texture, AND they're all natural... AND low in fat and sodium! Need I say more? I recently bought a bag of them at a regular grocery store, and couldn't belive my taste buds. That's why I excited why I saw them here on Amazon, and decided to buy a case!


In [None]:
# retaining only words and removing other special characters
sent_1500 = re.sub('^[A-Za-z0-9]+', ' ',sent_1500)
print(sent_1500)

  chips are truly amazing. They have it all. They're light, crisp, great tasting, nice texture, AND they're all natural... AND low in fat and sodium! Need I say more? I recently bought a bag of them at a regular grocery store, and couldn't belive my taste buds. That's why I excited why I saw them here on Amazon, and decided to buy a case!


In [None]:
# In reviews analysis words like 'no', 'nor','not' are very important as their presence can affects the positiveness of sentence
# in order to retain such words remove them from stop words
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', '_i_', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [None]:
# combining all the above function to preprocess the words

from tqdm import tqdm
# for printing the stuts bar or real time update
preprocessed_reviews = []
for sentence in tqdm(final['Text'].values):
  sentence = re.sub(r"http\S+","", sentence)
  sentence = BeautifulSoup(sentence,'lxml').get_text()
  sentence = decontracted(sentence)
  sentence = re.sub("S*\d\S*","", sentence).strip()
  sentence = sentence.replace("_","") # removing underscore from sentence
  sentence = re.sub('^[A-Za-z0-9]+', ' ', sentence)
  # https://www.geeksforgeeks.org/python-remove-punctuation-from-string/
  sentence = re.sub(r'[^\w\s]',"", sentence)
  sentence = ' '.join(e.lower() for e in sentence.split() if e.lower() not in stopwords)
  preprocessed_reviews.append(sentence.strip())

100%|██████████| 9564/9564 [00:05<00:00, 1831.50it/s]


In [None]:
preprocessed_reviews[1500]

'chips truly amazing light crisp great tasting nice texture natural low fat sodium need say recently bought bag regular grocery store could not belive taste buds excited saw amazon decided buy case'

##[3.2] Reviews['summary'] preprocessing :

In [None]:
sum_text = final['Summary'].values
print(sum_text[0])
print("="*50)
print(sum_text[1500])
print("="*50)
print(sum_text[3000])
print("="*50)
print(sum_text[5000])

Flies Begone
Excellent Tortilla chips
organic dog food
Best Dog Food For the Price!


In [None]:
preprocessed_summary = []
for summary in tqdm(final['Summary'].values):
  summary = re.sub(r"http\S+","", summary)
  summary = BeautifulSoup(summary,'lxml').get_text()
  summary = decontracted(summary)
  summary = re.sub("S*\d\S*","", summary).strip()
  summary = summary.replace("_","") # removing underscore from sentence
  summary = re.sub('^[A-Za-z0-9]+', ' ', summary)
  # https://www.geeksforgeeks.org/python-remove-punctuation-from-string/
  summary = re.sub(r'[^\w\s]',"", summary)
  summary = ' '.join(e.lower() for e in summary.split() if e.lower() not in stopwords)
  preprocessed_summary.append(summary.strip())

100%|██████████| 9564/9564 [00:02<00:00, 4154.14it/s]


In [None]:
preprocessed_summary[250]

'mountain blend coffee availability'

#[4.0] Featurization :

##[4.1] Bag Of Words


In [None]:
# BoW for text
count_vector = CountVectorizer() # from sklearn.feature_extraction.text
count_vector.fit(preprocessed_reviews)
print("some features name : ", count_vector.get_feature_names()[:10])
print("="*60)
final_counts = count_vector.transform(preprocessed_reviews)
print("Type of count_vectors : ", type(final_counts))
print("shape of our text BOW vectorizer : ", final_counts.get_shape())
print("no of uniqe of words : ", final_counts.get_shape()[1])
df = pd.DataFrame(final_counts.toarray(), columns = count_vector.get_feature_names())
df.head()

some features name :  ['aa', 'aaaa', 'aahhhs', 'ab', 'aback', 'abandon', 'abates', 'abberline', 'abbott', 'abdominal']
Type of count_vectors :  <class 'scipy.sparse.csr.csr_matrix'>
shape of our text BOW vectorizer :  (9564, 24742)
no of uniqe of words :  24742


Unnamed: 0,aa,aaaa,aahhhs,ab,aback,abandon,abates,abberline,abbott,abdominal,...,zoom,zotz,zucchini,zucchinii,zuke,zukes,zukeseriously,zupas,zuppa,ît
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


##[4.1.2] Bi-Gram and n-Grams.

In [None]:
#bi-gram, tri-gram and n-gram

# removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,2))
# please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# you can choose these numebrs min_df=10, max_features=5000, of your choice
count_vect = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=5000)
final_bigram_counts = count_vect.fit_transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (9564, 5000)
the number of unique words including both unigrams and bigrams  5000


In [None]:
#bag of word for summary
count_summary = CountVectorizer()
count_summary.fit(preprocessed_summary)
print("some features name: ", count_summary.get_feature_names()[:10])
final_summary_counts = count_summary.transform(preprocessed_summary)
df1 = pd.DataFrame(final_summary_counts.toarray(), columns=count_summary.get_feature_names())
df1.head()

some features name:  ['able', 'absolute', 'absolutel', 'absolutely', 'acaigrape', 'acceptable', 'accidents', 'according', 'accurate', 'acid']


Unnamed: 0,able,absolute,absolutel,absolutely,acaigrape,acceptable,accidents,according,accurate,acid,...,yummies,yumminess,yummmmm,yummy,yummywonderful,zany,zero,zots,zuke,zukes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# bi-gram and n-gram of summary
# removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,2))
# please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# you can choose these numebrs min_df=10, max_features=5000, of your choice
count_vect_summary = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=5000)
final_bigram_counts_summary = count_vect_summary.fit_transform(preprocessed_summary)
print("the type of count vectorizer ",type(final_bigram_counts_summary))
print("the shape of out text BOW vectorizer ",final_bigram_counts_summary.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts_summary.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (9564, 413)
the number of unique words including both unigrams and bigrams  413


##[4.2.0] Tf-idf

In [None]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_reviews)
print("Some sample features(unique words in the corpus): ", tf_idf_vect.get_feature_names()[:10])
print("="*50)

final_tfidf = tf_idf_vect.transform(preprocessed_reviews)
print("the type of count vectorizer", type(final_tfidf))
print("the shape of count text vectorizer ", final_tfidf.get_shape())
print("no of unique of words present in unigram and bi-grams", final_tfidf.get_shape())


Some sample features(unique words in the corpus):  ['ability', 'able', 'able buy', 'able eat', 'able find', 'able get', 'able order', 'able use', 'absolute', 'absolute best']
the type of count vectorizer <class 'scipy.sparse.csr.csr_matrix'>
the shape of count text vectorizer  (9564, 5517)
no of unique of words present in unigram and bi-grams (9564, 5517)


In [None]:
# printing the final_tfidf in the form of datafram
data_reviews = pd.DataFrame(final_tfidf.toarray(), columns = tf_idf_vect.get_feature_names())
data_reviews.head()

Unnamed: 0,ability,able,able buy,able eat,able find,able get,able order,able use,absolute,absolute best,...,yr old,yrs,yuban,yuck,yum,yummy,zero,zip,zip lock,ziplock
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# training our own word2vec using self-corpus
i = 0
list_of_sentences = []

for sentence in preprocessed_reviews:
  list_of_sentences.append(sentence.split()) # this will print list of lists of words in each sentence

In [None]:
wrd2vec = Word2Vec(list_of_sentences, min_count = 5, size=100,workers=4)
print(wrd2vec.wv.most_similar("great"))
print("="*50)
print(wrd2vec.wv.most_similar("worst"))

[('excellent', 0.9006282687187195), ('good', 0.886471152305603), ('alternative', 0.8453378677368164), ('easy', 0.8432695269584656), ('wonderful', 0.8426951169967651), ('especially', 0.8396201729774475), ('still', 0.8395291566848755), ('regular', 0.8337851762771606), ('works', 0.8318023681640625), ('way', 0.8312546014785767)]
[('absolute', 0.9930570125579834), ('jamaica', 0.9924485683441162), ('blends', 0.992203414440155), ('stephen', 0.9912437200546265), ('varieties', 0.9899473786354065), ('avid', 0.9897904396057129), ('italian', 0.9896245002746582), ('hawaiian', 0.9896040558815002), ('rodeo', 0.9893539547920227), ('compleats', 0.9889849424362183)]


In [None]:
w2v_list = list(wrd2vec.wv.vocab)
print("numebr of words that occured minimum 5 times: ", len(w2v_list))
print("="*50)
print("sample words ",w2v_list[:50])

numebr of words that occured minimum 5 times:  5600
sample words  ['used', 'fly', 'bait', 'seasons', 'ca', 'not', 'beat', 'great', 'product', 'available', 'traps', 'course', 'total', 'pretty', 'stinky', 'right', 'nearby', 'received', 'shipment', 'could', 'hardly', 'wait', 'try', 'love', 'call', 'instead', 'stickers', 'removed', 'easily', 'daughter', 'designed', 'printed', 'use', 'car', 'windows', 'beautifully', 'print', 'shop', 'program', 'going', 'lot', 'fun', 'everywhere', 'like', 'tv', 'computer', 'really', 'good', 'idea', 'final']


##[4.2.1] Converting text into vectors using wAvg W2V, TFIDF-W2V

In [None]:
# average word2vec
# compute avg word2vec for each reviews

sent_vectors = []
for sent in tqdm(list_of_sentences):
  sent_vec = np.zeros(100) # as word vectors are of length 100
  cnt_word = 0
  for word in sent:
    if word in w2v_list:
      vec = wrd2vec.wv[word]
      sent_vec+=vec
      cnt_word+=1
  if cnt_word !=0:
    sent_vec/=cnt_word
  sent_vectors.append(sent_vec)
print('\n', len(sent_vectors))
print(len(sent_vectors[0]))

100%|██████████| 9564/9564 [00:16<00:00, 586.05it/s]


 9564
100





##[4.2.2] TFIDF weighted W2v

In [None]:
model = TfidfVectorizer()
model.fit(preprocessed_reviews)
# converting a dictionary with a key and tfidf as values
dictionary = dict(zip(model.get_feature_names(), list(model.idf_)))

In [None]:
# TF-IDF Weighted word2vec
tfidf_feat = model.get_feature_names() # storing the features
