#  Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




## Loading the data

The dataset is available in two forms
1. .csv file
2. SQLite Database



Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [1]:
%matplotlib inline
import re
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [3]:
df = pd.read_csv(r'M:\IIST\SEM 2\Data Modelling Lab 2\NLP\NLP MINI Project\Reviews.csv')
df.head(2)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...


In [4]:
df.shape

(568454, 10)

## Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.

In [5]:
df = df[df['Score'] !=3]
df['Score'].unique()

array([5, 1, 4, 2], dtype=int64)

In [6]:
# Give reviews with Score>3 a positive rating, 
# and reviews with a score<3 a negative rating.
def partition(x):
    if x < 3:
        return 'negative'
    return 'positive'

In [10]:
actualScore = df['Score']
actualScore.head(5)

0    5
1    1
2    4
3    2
4    5
Name: Score, dtype: int64

In [11]:
positiveNegative = actualScore.map(partition)
positiveNegative.head(5)

0    positive
1    negative
2    positive
3    negative
4    positive
Name: Score, dtype: object

In [24]:
df["Score"]

0         5
1         1
2         4
3         2
4         5
         ..
568449    5
568450    2
568451    5
568452    5
568453    5
Name: Score, Length: 525814, dtype: int64

In [25]:
df['Score'] = positiveNegative

In [26]:
df["Score"]

0         positive
1         negative
2         positive
3         negative
4         positive
            ...   
568449    positive
568450    negative
568451    positive
568452    positive
568453    positive
Name: Score, Length: 525814, dtype: object

In [27]:
df.shape

(525814, 10)

In [28]:
#looking at the number of attributes and size of the data
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


#  Exploratory Data Analysis

##  Cleaning: Deduplication

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.  Following is an example:

In [29]:
df['UserId'].shape

(525814,)

In [30]:
df['UserId'].unique().shape

(243414,)

In [31]:
df['UserId'].value_counts()

A3OXHLG6DIBRW8    425
A1YUL9PCJR3JTY    391
AY12DBB0U420B     365
A281NPSIMI1C2R    346
A1Z54EM24Y40LL    230
                 ... 
A5S8J4ZXZEDZS       1
A1VX4T4MH72DSG      1
A2YIEYNBHPVCN0      1
A2IF9L6X12C0XM      1
A1HWOFIHYMMH76      1
Name: UserId, Length: 243414, dtype: int64

In [32]:
df["ProductId"].value_counts()

B007JFMH8M    857
B002QWHJOU    611
B002QWP89S    611
B0026RQTGE    611
B002QWP8H0    611
             ... 
B000OWDOHK      1
B002ZH8ZEY      1
B0056CS3B0      1
B0047T9IHG      1
B006TACYS6      1
Name: ProductId, Length: 72005, dtype: int64

In [33]:
df[["UserId","ProfileName","Time","Text"]][df['UserId']=='A3OXHLG6DIBRW8']

Unnamed: 0,UserId,ProfileName,Time,Text
369,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1282176000,"Green Mountain ""Nantucket Blend"" K-Cups make a..."
813,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1272240000,"Trident ""Strawberry Twist"" sugarless gum is ve..."
3306,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
3416,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
3926,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1237161600,"Hershey ""Sugar Free Caramel Filled Chocolates""..."
...,...,...,...,...
561988,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1346112000,This Starbucks French Roast K-Cup is indeed a ...
562279,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1329609600,The Blue Diamond Jalapeno Smokehouse Almonds a...
563973,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1279411200,Kirkland Jelly Beans are a great value and all...
567686,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...


In [34]:
a = df[["ProductId","UserId","ProfileName","Time","Text"]][df['UserId']=='A3OXHLG6DIBRW8'][df['Time']==1321401600]
a

  """Entry point for launching an IPython kernel.


Unnamed: 0,ProductId,UserId,ProfileName,Time,Text
3306,B005K4Q1VI,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
3416,B005K4Q1VI,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
242938,B005K4Q4KG,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
243048,B005K4Q4KG,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
344998,B0076MLL12,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
345108,B0076MLL12,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
430280,B005K4Q1T0,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
430390,B005K4Q1T0,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
567686,B005K4Q68Q,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
567796,B005K4Q68Q,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...


In [35]:
a.shape

(10, 5)

In [36]:
b = a.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
b

Unnamed: 0,ProductId,UserId,ProfileName,Time,Text
430280,B005K4Q1T0,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
430390,B005K4Q1T0,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
3306,B005K4Q1VI,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
3416,B005K4Q1VI,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
242938,B005K4Q4KG,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
243048,B005K4Q4KG,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
567686,B005K4Q68Q,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
567796,B005K4Q68Q,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
344998,B0076MLL12,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
345108,B0076MLL12,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...


In [37]:
final=b.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final

Unnamed: 0,ProductId,UserId,ProfileName,Time,Text
430280,B005K4Q1T0,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...
430390,B005K4Q1T0,A3OXHLG6DIBRW8,"C. F. Hill ""CFH""",1321401600,These Grove Square Hot Cocoa flavors are by fa...


In [38]:
text = final["Text"]
if text.iloc[0] == text.iloc[1]:
    print("equal")
else:
    print("not equal")
    

not equal


In [39]:
df['Time'].value_counts()

1350345600    1060
1322179200    1024
1322438400    1001
1346889600     949
1344211200     919
              ... 
1066348800       1
1102032000       1
1091664000       1
1096761600       1
1077494400       1
Name: Time, Length: 3157, dtype: int64

As can be seen above the same user has multiple reviews of the with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text  and on doing analysis it was found that <br>
<br> 
ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)<br>
<br> 
ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on<br>

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.<br>

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [42]:
# Original Data set
df.shape

(525814, 10)

In [40]:
#Sorting data according to ProductId in ascending order
sorted_data=df.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [41]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

(364173, 10)

In [43]:
364173/525814

0.6925890143662968

In [44]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(df['Id'].size*1.0)*100

69.25890143662969

<b>Observation:-</b> It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions

In [45]:
df[df['HelpfulnessNumerator']> df['HelpfulnessDenominator'] ]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
44736,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,positive,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
64421,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,positive,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...


In [46]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]


In [47]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(364171, 10)


positive    307061
negative     57110
Name: Score, dtype: int64

##   Text Preprocessing: Stemming, stop-word removal and Lemmatization.

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [48]:
final = final.sample(frac=.05)
final.shape

(18209, 10)

In [49]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(18209, 10)


positive    15373
negative     2836
Name: Score, dtype: int64

In [50]:
z = final['Text'].values
z[-2]

'This maybe a little pricey, but its some of the best bacon I have bad. Hands down.I just bought 10 lbs more of it.'

In [51]:
z[6]

"I was going to just put this on my cereal in the morning. Now I add it to my protein shake suppliments and I've started cooking with it by using it as a breading for fish, chicken, or pretty much anything. It's fantastic! I'm going to end up ordering more soon."

In [52]:
z.shape

(18209,)

## find sentences containing HTML tags

In [53]:
# find sentences containing HTML tags
x= []
i=0;
for sent in final['Text'].values:
    if (len(re.findall('<.*?>', sent))):
        x.append(i)
        #break;
    i += 1;    
    
print("Number of  HTML TAGS")
len(x)

Number of  HTML TAGS


4617

In [54]:
# find sentences containing HTML tags
i=0;
for sent in final['Text'].values:
    if (len(re.findall('<.*?>', sent))):
        print(i)
        print(sent)
        break;
    i += 1;        

0
Kind Clusters makes a great cereal but I enjoy eating it right out of the bag any time of day. Tastes great and bonus - it's good for me. I can't beat the value either. Any way you look at it, the product gets a 5-star review from me. <a href="http://www.amazon.com/gp/product/B005IW4WFO">Kind Healthy Grains Cinnamon Oat Clusters with Flax Seeds,11 ounce bag  (Pack of 3)</a>


##  find sentences containing punctuations

In [55]:
# find sentences containing punctuations
i=0;
for sent in final['Text'].values:
    if (len(re.findall(r'[?|!|\'|"|#]', sent))):
        if (len(re.findall('<.*?>', sent))):
            print(i)
            print(sent)
            break;
    i += 1;        

0
Kind Clusters makes a great cereal but I enjoy eating it right out of the bag any time of day. Tastes great and bonus - it's good for me. I can't beat the value either. Any way you look at it, the product gets a 5-star review from me. <a href="http://www.amazon.com/gp/product/B005IW4WFO">Kind Healthy Grains Cinnamon Oat Clusters with Flax Seeds,11 ounce bag  (Pack of 3)</a>


In [56]:
def cleanhtml(sentence): #function to clean the word of any html-tags
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext

In [57]:
def cleanpunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|>|:|"{|}|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned

In [58]:
# for every Review in Reviews
a = final['Text'].values
b = a[23]
b

"I love these. Lately I have been trying different types of spicy peanuts (Da Bomb, Dave's Burning) and these are just as good as the rest. Not as hot as Da Bomb, and not as peanut buttery like Dave's(least heat of the three). I will definitely be getting more...mainly because the other two aren't on amazons Subscribe and Save.<br /><br />I love the Habanero! (also try 100% Pain Garlic hot sauce---delicious)"

In [59]:
# Clean html format
sent=cleanhtml(b)
sent

"I love these. Lately I have been trying different types of spicy peanuts (Da Bomb, Dave's Burning) and these are just as good as the rest. Not as hot as Da Bomb, and not as peanut buttery like Dave's(least heat of the three). I will definitely be getting more...mainly because the other two aren't on amazons Subscribe and Save.  I love the Habanero! (also try 100% Pain Garlic hot sauce---delicious)"

In [60]:
k = sent.split()
print(k)

['I', 'love', 'these.', 'Lately', 'I', 'have', 'been', 'trying', 'different', 'types', 'of', 'spicy', 'peanuts', '(Da', 'Bomb,', "Dave's", 'Burning)', 'and', 'these', 'are', 'just', 'as', 'good', 'as', 'the', 'rest.', 'Not', 'as', 'hot', 'as', 'Da', 'Bomb,', 'and', 'not', 'as', 'peanut', 'buttery', 'like', "Dave's(least", 'heat', 'of', 'the', 'three).', 'I', 'will', 'definitely', 'be', 'getting', 'more...mainly', 'because', 'the', 'other', 'two', "aren't", 'on', 'amazons', 'Subscribe', 'and', 'Save.', 'I', 'love', 'the', 'Habanero!', '(also', 'try', '100%', 'Pain', 'Garlic', 'hot', 'sauce---delicious)']


In [61]:
l =[]
for word in k:
    
    l.append(cleanpunc(word))
    
print(l)  

['I', 'love', 'these ', 'Lately', 'I', 'have', 'been', 'trying', 'different', 'types', 'of', 'spicy', 'peanuts', ' Da', 'Bomb ', 'Daves', 'Burning ', 'and', 'these', 'are', 'just', 'as', 'good', 'as', 'the', 'rest ', 'Not', 'as', 'hot', 'as', 'Da', 'Bomb ', 'and', 'not', 'as', 'peanut', 'buttery', 'like', 'Daves least', 'heat', 'of', 'the', 'three  ', 'I', 'will', 'definitely', 'be', 'getting', 'more   mainly', 'because', 'the', 'other', 'two', 'arent', 'on', 'amazons', 'Subscribe', 'and', 'Save ', 'I', 'love', 'the', 'Habanero', ' also', 'try', '100%', 'Pain', 'Garlic', 'hot', 'sauce---delicious ']


# Train own Word2Vec model using your own text corpus

# Gensim

### Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.. Gensim is implemented in Python and Cython. Gensim is designed to handle large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing.

In [62]:
import gensim

In [63]:
from tqdm import tqdm

## Cleaning HTML Tags and punctuations

In [64]:

list_of_sent=[]
for sent in tqdm(final['Text'].values):
    filtered_sentence=[]
    sent=cleanhtml(sent)
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):    
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue 
    list_of_sent.append(filtered_sentence)
    

100%|██████████████████████████████████████████████████████████████████████████| 18209/18209 [00:04<00:00, 4380.85it/s]


In [65]:
print(final['Text'].values[0])
print("*****************************************************************")
print(list_of_sent[0])

Kind Clusters makes a great cereal but I enjoy eating it right out of the bag any time of day. Tastes great and bonus - it's good for me. I can't beat the value either. Any way you look at it, the product gets a 5-star review from me. <a href="http://www.amazon.com/gp/product/B005IW4WFO">Kind Healthy Grains Cinnamon Oat Clusters with Flax Seeds,11 ounce bag  (Pack of 3)</a>
*****************************************************************
['kind', 'clusters', 'makes', 'a', 'great', 'cereal', 'but', 'i', 'enjoy', 'eating', 'it', 'right', 'out', 'of', 'the', 'bag', 'any', 'time', 'of', 'day', 'tastes', 'great', 'and', 'bonus', 'its', 'good', 'for', 'me', 'i', 'cant', 'beat', 'the', 'value', 'either', 'any', 'way', 'you', 'look', 'at', 'it', 'the', 'product', 'gets', 'a', 'review', 'from', 'me', 'kind', 'healthy', 'grains', 'cinnamon', 'oat', 'clusters', 'with', 'flax', 'seeds', 'ounce', 'bag', 'pack', 'of']


In [67]:
import gensim
w2v_model=gensim.models.Word2Vec(list_of_sent,min_count=5,size=50, workers=4)  

In [69]:
words = list(w2v_model.wv.vocab)
print(len(words))

8570


In [68]:
w2v_model.wv['i']

array([-1.0210929 ,  1.6562191 , -0.29065284, -0.30392447, -0.23833333,
       -2.0100305 ,  0.80751044,  1.3289204 , -0.69519705, -0.7417597 ,
       -0.72237635, -0.408462  , -1.6105518 , -0.9014356 ,  0.05991743,
       -1.5222048 ,  1.6339517 , -1.0536928 ,  0.5844174 ,  0.50248456,
        2.8097956 ,  0.47611067, -1.256445  ,  1.347353  ,  0.80845666,
       -0.68396777,  1.5766245 ,  1.9654882 ,  2.3661225 ,  1.2732943 ,
       -0.7295285 ,  1.5499783 ,  1.3415154 , -0.554769  , -3.1751623 ,
        0.20062573, -0.8697609 ,  0.59165484, -2.6876993 ,  1.4245079 ,
       -1.466184  , -0.9764718 ,  4.530648  ,  0.23597795, -2.5227275 ,
        0.8262703 ,  0.37284896,  0.5542649 , -2.52011   ,  0.75568724],
      dtype=float32)

In [70]:
w2v_model.wv.most_similar('i')

[('we', 0.6611853241920471),
 ('myself', 0.49198752641677856),
 ('opportunity', 0.4617806077003479),
 ('id', 0.43928658962249756),
 ('wed', 0.4285999536514282),
 ('glad', 0.39421820640563965),
 ('straws', 0.38483697175979614),
 ('recipient', 0.38466525077819824),
 ('initially', 0.3799441158771515),
 ('unlikely', 0.37972235679626465)]

In [71]:
w2v_model.wv.most_similar('like')

[('mean', 0.6177371144294739),
 ('prefer', 0.5929393768310547),
 ('think', 0.5828632116317749),
 ('horrible', 0.5749399662017822),
 ('weird', 0.5503914952278137),
 ('skimp', 0.5472136735916138),
 ('bother', 0.5378204584121704),
 ('liked', 0.5309802293777466),
 ('know', 0.5261911153793335),
 ('good', 0.5199871063232422)]

In [72]:
#count_vect = CountVectorizer() #in scikit-learn
#final_counts = count_vect.fit_transform(final['Text'].values)

In [73]:
from tqdm import tqdm

# TF-IDF

In [74]:

tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(final['Text'].values)


In [75]:
final_tf_idf.get_shape()


(18209, 406554)

In [76]:
features = tf_idf_vect.get_feature_names()
len(features)


406554

In [77]:
features[100000:100010]


['developed sore',
 'developed spring',
 'developed strong',
 'developed superiour',
 'developed taste',
 'developed terrible',
 'developed the',
 'developed to',
 'developed uti',
 'developed vomiting']

In [78]:
# covnert a row in saprsematrix to a numpy array
print(final_tf_idf[3,:].toarray()[0]) 


[0. 0. 0. ... 0. 0. 0.]


In [79]:
# source: https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

top_tfidf = top_tfidf_feats(final_tf_idf[1,:].toarray()[0],features,25)

# TFIDF-W2V

### TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

### It has many uses, most importantly in automated text analysis, and is very useful for scoring words in machine learning algorithms for Natural Language Processing (NLP).

In [80]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(final['Text'].values)


In [81]:
tfidf_feat = tf_idf_vect.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_sent[1:250]): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            # obtain the tf_idfidf of a word in a sentence/review
            tf_idf = tfidf_feat.index(word)
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
        except:
            pass
    sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1

100%|████████████████████████████████████████████████████████████████████████████████| 249/249 [04:52<00:00,  1.18s/it]


In [82]:
tfidf_sent_vectors

[array([ 0.58819955,  0.32212378, -0.53847136,  0.08145294,  0.67445032,
         0.79645481, -0.03563814, -0.61212155, -0.20453362,  0.00901039,
        -0.46704635,  0.52748719, -0.88462477,  0.0997008 , -0.03059553,
        -0.22769248, -0.78329203,  0.50340322,  0.84317135,  0.05938958,
         0.03647718,  0.36094766, -0.23812012,  0.23962454,  0.07095457,
         0.07486947,  0.50863129,  0.35297264,  0.23074683, -0.43016828,
         0.12093718,  0.24272389,  0.41042191, -0.0780606 , -0.24160862,
        -0.29825919,  0.45832736, -0.00979008,  0.10387476, -0.35432108,
        -0.78240953,  0.17926363,  0.64759755, -0.53630346,  0.287089  ,
         0.40308982, -0.04894509,  0.22170548, -0.67142313, -0.48433367]),
 array([-0.76338243,  0.02908873, -0.8991248 ,  1.05484207, -0.55022031,
         0.48600439, -0.18565574, -0.1228809 , -1.76772659,  0.30488618,
        -1.2318429 ,  0.71288802, -1.28376903, -0.03728773,  0.46580976,
         0.13030297, -0.11150959,  0.37264401,  0

In [123]:
Sent2Vec = pd.DataFrame(tfidf_sent_vectors)
Sent2Vec.shape

(249, 50)

In [121]:
Final_1_to_250 = final.iloc[1:250]
Final_1_to_250.shape

(249, 10)

In [127]:
Sent2Vec.reset_index(drop=True, inplace=True)
Final_1_to_250.reset_index(drop=True, inplace=True)

In [128]:
Sent2Vec.shape

(249, 50)

In [129]:
Final_1_to_250.shape

(249, 10)

In [130]:
Final_df_with_sentVec = pd.concat([Final_1_to_250,Sent2Vec], axis = 1 )

In [131]:
Final_df_with_sentVec.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,...,40,41,42,43,44,45,46,47,48,49
0,75310,B0009X2A60,A3QF86LI1H6351,"Rev. Joyce E. Perdue ""J. Perdue""",0,0,positive,1311811200,Cat in Love with Food,My cat will eat nothing else. Makes his coat ...,...,-0.78241,0.179264,0.647598,-0.536303,0.287089,0.40309,-0.048945,0.221705,-0.671423,-0.484334
1,461370,B0000DG5CH,A1IWKPQIFU4ATS,Nicky Nickleby,0,0,positive,1104364800,JACK SPARROW LOVES IT!,BUY IT AS A GIFT AND GIVE TO A PIRATE YOU KNOW...,...,-1.005146,-0.831135,0.85206,-0.36514,-1.019252,1.096068,-0.776007,0.039781,0.155959,-0.571045
2,147856,B005HBRVKO,A3CQIIOSPAOQRC,Sumatocha,0,0,positive,1322697600,Best granola in the world!!!,"If you are looking for a healthy, great tastin...",...,-0.744679,-0.462751,0.762627,-0.665582,-0.376273,0.438594,-0.576625,-0.093627,-0.337893,-0.476891
3,190852,B000XN0YB4,A3O1DWMZUNXLPL,42247,0,0,positive,1302912000,Wonderful Product,I thought this were very good. I bought the m...,...,0.069917,0.192627,0.289097,-0.438832,-0.104104,0.452769,-0.084883,0.065292,0.726657,-0.378159
4,499104,B005MRUWLI,A2M54NAUU5DJSK,Catherine,4,8,negative,1319673600,Big Disappointment!,I love the idea of the drawer that you can put...,...,0.002509,-0.099832,0.8602,-0.527537,-0.495638,0.221569,-0.191833,-0.004455,0.576154,-0.233135


In [133]:
Final_df_with_sentVec.to_csv('Final_df_with_sentVec_1_250.csv')

In [148]:
df_sent2Vec = pd.read_csv(r"M:\IIST\SEM 2\Data Modelling Lab 2\NLP\NLP MINI Project\Final_df_with_sentVec_1_250.csv")

In [149]:
df_sent2Vec['Score'] = df_sent2Vec['Score'].replace(["positive","negative"],[1,0])

In [150]:
df_sent2Vec['Score']

0      1
1      1
2      1
3      1
4      0
      ..
244    1
245    1
246    1
247    0
248    1
Name: Score, Length: 249, dtype: int64

In [174]:
df_sent2Vec.isnull().values.any()

False

In [164]:
X = df_sent2Vec.drop(["Unnamed: 0","Id","ProductId","UserId","ProfileName","HelpfulnessNumerator","HelpfulnessDenominator","Time", "Summary","Score","Text"],axis = 1)

In [167]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,0.5882,0.322124,-0.538471,0.081453,0.67445,0.796455,-0.035638,-0.612122,-0.204534,0.00901,...,-0.78241,0.179264,0.647598,-0.536303,0.287089,0.40309,-0.048945,0.221705,-0.671423,-0.484334
1,-0.763382,0.029089,-0.899125,1.054842,-0.55022,0.486004,-0.185656,-0.122881,-1.767727,0.304886,...,-1.005146,-0.831135,0.85206,-0.36514,-1.019252,1.096068,-0.776007,0.039781,0.155959,-0.571045
2,-0.224117,0.74182,-0.555298,0.230184,-0.21436,0.205023,-0.518163,-0.157328,-1.277998,-0.20496,...,-0.744679,-0.462751,0.762627,-0.665582,-0.376273,0.438594,-0.576625,-0.093627,-0.337893,-0.476891
3,-0.026295,0.404695,-0.455078,-0.16741,0.63793,0.116415,-0.224196,-0.34163,-0.85268,-0.164182,...,0.069917,0.192627,0.289097,-0.438832,-0.104104,0.452769,-0.084883,0.065292,0.726657,-0.378159
4,-0.187803,0.230454,-0.781271,-0.611435,0.147687,0.008649,0.020957,-0.30696,-1.252965,-0.178395,...,0.002509,-0.099832,0.8602,-0.527537,-0.495638,0.221569,-0.191833,-0.004455,0.576154,-0.233135


In [249]:
Y = df_sent2Vec["Score"]

In [250]:
Y.head()

0    1
1    1
2    1
3    1
4    0
Name: Score, dtype: int64

In [260]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [261]:
X_train.shape

(199, 50)

In [262]:
y_train.shape

(199,)

In [296]:
from sklearn.linear_model import LogisticRegression

In [297]:
clf = LogisticRegression(random_state=0,max_iter=10000).fit(X_train,y_train)

In [298]:
y_pred = clf.predict(X_test)
y_pred.shape

(50,)

In [299]:
y_test.shape

(50, 1)

In [300]:
y_pred = y_pred.reshape(-1,1)
y_test = np.array(y_test).reshape(-1,1)

In [301]:
confusion_matrix(y_test, y_pred)

array([[ 1,  9],
       [ 1, 39]], dtype=int64)

In [1]:
40/50

0.8

## SIGMOID FUNCTION to convert CONTINOUS TO PROBABILITIES

In [269]:
def cont2prob(x):
    return 1/(1 + np.exp(-x))

### LOGISTIC REGRESSION TRAIN METHOD using GRADIENT DESCENT ALGORITHM

In [270]:
def Logistic_gradient_descent_TRAIN(x,y,iteration,lr):
    k=list(x.columns)
    
    
    
   
    
    global w0
    w0 = 0
    
    global w
    w = [0]*len(k)
    
    global iterationlist
    iterationlist=[]
    
    
    global wlist
    wlist=[]
    
    global w0list
    w0list=[]
    
    global jwlist
    jwlist=[]
 
    global mse_list
    mse_list=[]
 
    
    for i in range(iteration):
        
        iterationlist.append(i)
        
        
        global y_pred_conti_TRAIN
        y_pred_conti_TRAIN = np.dot(x,w) + w0
        
        global y_pred_prob_TRAIN
        y_pred_prob_TRAIN = cont2prob(y_pred_conti_TRAIN)
        
        jw = - np.sum( np.dot(y,np.log(y_pred_prob_TRAIN)) + np.dot( (1 - y), np.log(1 - y_pred_prob_TRAIN) ) )
        
        jwlist.append(jw)
        
        w0d =  np.sum(y_pred_prob_TRAIN-y)
        
        wd =   np.dot((y_pred_prob_TRAIN-y),x)
        
        
        
        w = w - lr*np.array(wd)
        
        wlist.append(w)
        
        
        
        w0 = w0 - lr*np.array(w0d)
        
        w0list.append(w0)
        
        
        
        
    print("w0:{} w:{} cost:{}".format(w0,w,jw))
            #return w0,w,jw

### FUNCTION CALL TO LOGISTIC REGRESSION USING GRADIENT DESCENT
#### Logistic_gradient_descent_TRAIN(X_train , y_train , iteration , lr)

### LOGISTIC REGRESSION TEST METHOD using GRADIENT DESCENT ALGORITHM

In [271]:
def Logistic_gradient_descent_TEST(x,y,decision_boundary_probability):
    
    global y_pred_conti_TEST
    y_pred_conti_TEST = np.dot(x,w) + w0
        
    global y_pred_prob_TEST
    y_pred_prob_TEST = cont2prob(y_pred_conti_TEST)
    
    y_pred_class = [1 if i>decision_boundary_probability else 0 for i in y_pred_prob_TEST]
    
    C = confusion_matrix(y_test,y_pred_class)
    global TN,FN,TP,FP
    TN = C[0][0]
    FN = C[1][0]
    TP = C[1][1]
    FP = C[0][1]
    print(C)
    

In [279]:
Logistic_gradient_descent_TRAIN(X_train,y_train,10000,1e-3)

w0:1.665208093410704 w:[-1.26198681  1.81215813 -0.62073714  4.08543386  1.32551675 -0.47983064
  1.95845197 -1.84223338  2.38358736 -2.48798195 -3.25703321  0.09167127
 -1.47662013 -1.1957899  -0.5638654   0.46971467 -1.29391416  3.73733268
 -1.80903256  1.02253198  3.0661478   1.19079525  2.70851297  0.6221902
 -0.73127707  0.0296016   0.11804109 -0.91767144 -0.23221185  1.68318837
 -0.36116404 -0.24248065 -1.55268628  2.81546798  0.74139095 -3.1055933
 -0.52386193  0.33266842 -2.00590342 -0.49656473 -4.07084494 -0.07474801
  0.14622637 -1.9145876   3.79982449  2.65271265  0.49768462 -3.85026463
  2.19921083 -3.64210379] cost:29.28314652996548


In [283]:
Logistic_gradient_descent_TEST(X_test,y_test,0.5)

[[ 5  5]
 [ 3 37]]


In [287]:
42/50

0.84