# LSTM on Amazon Fine Food Reviews Data

### Amazon's Fine Food reviews dataset :  https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.
   - Number of reviews: 568,454
   - Number of users: 256,059
   - Number of products: 74,258
   - Timespan: Oct 1999 - Oct 2012
   - Number of Attributes/Columns in data: 10(including class attribute)

Attribute Information:

   - Id
   - ProductId - unique identifier for the product
   - UserId - unqiue identifier for the user
   - ProfileName
   - HelpfulnessNumerator - number of users who found the review helpful
   - HelpfulnessDenominator - Total number of users who indicated whether they found the review helpful or not
   - Time - timestamp for the review
   - Summary - brief summary of the review
   - Text - text of the review
   - Score(Class Label) - rating between 1 and 5 (rating 1 & 2 is negative, rating 4 & 5 is positive and rating 3 is neutral)

### Columns/Attribute created by Cleaning
- Cleaned Data: (Refer: "Preprocessing-Amazon fine food review.ipynb")
    - c_text - Cleaned Column with stemming of "Text" Attribute from original dataset.
    - c_summary - Cleaned Column with stemming of "Summary" Attribute from original dataset.
    - nostem_text - Cleaned Column without stemming of "Text" Attribute from original dataset.
    - nostem_summary - Cleaned Column without stemming of "Summary" Attribute from original dataset.

# 1 Importing Required Libraries

In [1]:
import warnings
from tqdm import tqdm_notebook as tqdm
import re, os, sqlite3,pickle
import pandas as pd 

from collections import Counter 

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

warnings.simplefilter('ignore')

from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix,csc_matrix
from sklearn.model_selection import train_test_split

# 2 Creating IMDB type data

## Importing Cleaned data

In [2]:
link = sqlite3.connect('../data_clean/cleaned_reviews.sqlite')
cleaned_data= pd.read_sql_query(''' SELECT * FROM creviews ''',link)
link.close()


cleaned_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,c_text,c_summary,nostem_text,nostem_summary
0,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,witti littl book make son laugh loud recit car...,everi book educ,witty little book makes son laugh loud recite ...,every book educational
1,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,1,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",grew read sendak book watch realli rosi movi i...,love book miss hard cover version,grew reading sendak books watching really rosi...,love book miss hard cover version
2,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,1,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...,fun way children learn month year learn poem t...,chicken soup rice month,fun way children learn months year learn poems...,chicken soup rice months
3,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,1,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...,great littl book read nice rhythm well good re...,good swingi rhythm read aloud,great little book read nice rhythm well good r...,good swingy rhythm reading aloud
4,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,1,1018396800,A great way to learn the months,This is a book of poetry about the months of t...,book poetri month year goe month cute littl po...,great way learn month,book poetry months year goes month cute little...,great way learn months


## Sorting Cleaned data

In [3]:
cleaned_data=cleaned_data[['Id','Score','Time','c_text']]
cleaned_data.head()

Unnamed: 0,Id,Score,Time,c_text
0,150524,1,939340800,witti littl book make son laugh loud recit car...
1,150506,1,1194739200,grew read sendak book watch realli rosi movi i...
2,150507,1,1191456000,fun way children learn month year learn poem t...
3,150508,1,1076025600,great littl book read nice rhythm well good re...
4,150509,1,1018396800,book poetri month year goe month cute littl po...


In [4]:
# Sorting data by Time to employ Time Series Split
cleaned_data = cleaned_data.sort_values('Time',
                                        axis=0, 
                                        ascending=True, 
                                        inplace=False,
                                        kind='quicksort', 
                                        na_position='last')

text = cleaned_data['c_text'].values
y_true=cleaned_data['Score']
cleaned_data.head()

Unnamed: 0,Id,Score,Time,c_text
0,150524,1,939340800,witti littl book make son laugh loud recit car...
30,150501,1,940809600,rememb see show air televis year ago child sis...
215,76882,1,948672000,bought apart infest fruit fli hour trap mani f...
241,1245,1,961718400,realli good idea final product outstand use de...
242,1244,1,962236800,receiv shipment could hard wait tri product lo...


## Train-Test split

In [5]:
# Splitting the data into 70:30 train_data and test_data
train_text, test_text, y_train, y_test = train_test_split(text, y_true, 
                                                    test_size=0.3, 
                                                    shuffle=False)


In [6]:
train_text, test_text

(array(['witti littl book make son laugh loud recit car drive along alway sing refrain hes learn whale india droop love new word book introduc silli classic book will bet son still abl recit memori colleg',
        'rememb see show air televis year ago child sister later bought day thirti someth use seri book song student teach preschool turn whole school purchas along book children tradit live',
        'bought apart infest fruit fli hour trap mani fli within day practic gone may not long term solut fli drive crazi consid buy one surfac sticki tri avoid touch',
        ...,
        'buy cider mill fall kept eat hope healthier potato chip cider mill close good kettl corn hard find look amazon well bought case warehous deal husband take one bag mom assist live home gone alreadi corn tast great habit form husband alway pick shop corn hasnt complain keep go garag get anoth bag eat get home work definit buy box empti',
        'pick tin republ tea peppermint chocol today local store not kn

## Counting word frequency

In [7]:
#  vocabulary of words using countvectorizer
count = CountVectorizer() 

train_vector=count.fit_transform(train_text)
test_vector=count.transform(test_text)
feature_names = count.get_feature_names()


print('Shape of Train vector',train_vector.shape)
print('Shape of Test vector',test_vector.shape)
print('Total number of unique words in the Train data -',
      len(feature_names))

Shape of Train vector (254786, 59059)
Shape of Test vector (109194, 59059)
Total number of unique words in the Train data - 59059


## Converting csr to csc sparse format and calculating frequency of each word in entire corpus

In [9]:
col_vector=csc_matrix(train_vector)

idx=0
freq_dict=dict()

for idx in tqdm(range(len(feature_names))):
    wrd=feature_names[idx]
    val=col_vector[:,idx].todense().sum()
    freq_dict[wrd]=val

HBox(children=(IntProgress(value=0, max=59059), HTML(value='')))




## Converting dict to dataframe and sorting frequency in descending order

In [10]:
df_freq=pd.DataFrame.from_dict(freq_dict,
                               orient='index',
                               dtype=None,
                               columns=['freq'])

sorted_data = df_freq.sort_values('freq',
                                axis=0, 
                                ascending=False, 
                                inplace=False,
                                kind='quicksort', 
                                na_position='last')
print(sorted_data.head())

          freq
not     137870
like    117929
tast    113255
flavor   90151
good     88663


## Creating ranked order by resetting dataframe index

In [11]:
sorted_data=sorted_data.reset_index()
sorted_data.head()

Unnamed: 0,index,freq
0,not,137870
1,like,117929
2,tast,113255
3,flavor,90151
4,good,88663


## Creating dict of words with their ranks

In [12]:
wrd_rank=dict()
rank=1
for i in tqdm(range(len(sorted_data))):
    wrd=sorted_data.iloc[i]['index']
    wrd_rank[wrd]=rank
    rank+=1

HBox(children=(IntProgress(value=0, max=59059), HTML(value='')))




In [13]:
# Saving the word rank dictionary
with open( './model/wrd_rank.pkl','wb') as fi:
    pickle.dump(wrd_rank,fi)

## Creating Ranked format of Amazon Reviews data

### For Train data

In [14]:
train_ranked_data=[]
train_review_len=[]
for sent in tqdm(train_text):
    row=[]
    for word in sent.split():
        try:
            row.append(wrd_rank[word])
        except:
            pass
    train_ranked_data.append(row)
    train_review_len.append(len(row))

HBox(children=(IntProgress(value=0, max=254786), HTML(value='')))




### For Test data

In [15]:
test_ranked_data=[]
test_review_len=[]
for sent in tqdm(test_text):
    row=[]
    for word in sent.split():
        try:
            row.append(wrd_rank[word])
        except:
            pass
    test_ranked_data.append(row)
    test_review_len.append(len(row))

HBox(children=(IntProgress(value=0, max=109194), HTML(value='')))




## Saving the data

In [20]:
review_len={'train_review_len':train_review_len,
             'test_review_len':test_review_len}


with open( './model/review_length_list.pkl','wb') as fi:
    pickle.dump(review_len,fi)

In [21]:
data={'train_ranked_data':train_ranked_data,
      'test_ranked_data':test_ranked_data,
      'y_train':y_train,
      'y_test':y_test}

with open( './model/amazon_imdb_form_data.pkl','wb') as fi:
    pickle.dump(data,fi)

## Conclusion:
   - Amazon fine food review data has been converted to IMDB data format and stored as a pickle file.