## Text Classification on Amazon Fine Food Dataset with Google Word2Vec Word Embeddings in Gensim and training using LSTM In Keras.

### IMPORTING THE MODULES

In [48]:
# Ignore the warinings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

# data visualization and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
#configure
# sets matplotlib to inline and displays graphs below the corressponding cell.
# matplotlib inline
style.use('fivethirtyeight')
sns.set(style='whitegrid', color_codes=True)

# nltk
import nltk

#preprocessing
from nltk.corpus import stopwords  #stopwords
from nltk import word_tokenize, sent_tokenize  # tokenizing
from nltk.stem import PorterStemmer, LancasterStemmer  # using the Porter Stemmer and Lancaster Stemmer and others
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer  # lammatizer from WordNet

# for part-of-speech tagging
from nltk import pos_tag

# from named entity recognition (NER)
from nltk import ne_chunk

# vectorizers for creating the document-term-matrix (DTM)
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

# BeautifulSoup library
from bs4 import BeautifulSoup

import re  # regex

#model_selection
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

#evaluation
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.metrics import classification_report
from mlxtend.plotting import plot_confusion_matrix


#prprocssing scikit
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer  # 'Imputer' is deprecated from 'sklearn.preprocessing'

#classification.
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB

#stop-words
stop_words = set(nltk.corpus.stopwords.words('english'))

#keras
import keras
from keras.preprocessing.text import one_hot, Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding, Input, LSTM  # cannot import name 'CuDNNLSTM' from 'keras.layers'

from keras.models import Model
from keras.preprocessing.text import text_to_word_sequence

#gensim w2v
#word2vec
from gensim.models import Word2Vec

### LOADING THE DATASET

In [49]:
rev_frame = pd.read_csv(r'./input/Reviews.csv')
df = rev_frame.copy()

In [50]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568438 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB


In [52]:
df.groupby(['UserId', 'ProductId']).sum('Score')

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
UserId,ProductId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
#oc-R103C0QSV1DF5E,B006Q820X0,136323,1,2,5,1343088000
#oc-R109MU5OBBZ59U,B008I1XPKA,516062,0,1,5,1350086400
#oc-R10LFEMQEW6QGZ,B008I1XPKA,516079,0,1,5,1345939200
#oc-R10LT57ZGIB140,B0026LJ3EA,378693,0,0,3,1310601600
#oc-R10UA029WVWIUI,B006Q820X0,136545,0,0,1,1342483200
...,...,...,...,...,...,...
AZZV9PDNMCOZW,B003SNX4YA,422838,0,0,4,1329436800
AZZVNIMTTMJH6,B000FI4O90,190698,0,0,5,1268179200
AZZY649VYAHQS,B000N9VLJ2,222781,1,1,5,1309737600
AZZYCJOJLUDYR,B001SB22UG,131469,0,0,5,1337472000


In [53]:
print(df['Time'].min())
print(df['Time'].max())

939340800
1351209600


In [54]:
len(df['UserId'].unique())

256059

#### A brief description of the dataset from Overview tab on Kaggle : -

Data includes:
- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews

### DATA CLEANING AND PRE-PROCESSING

Since here I am concerned with **sentiment analysis** I shall keep only the 'Text' and the 'Score' column.

In [55]:
df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [56]:
df = df[['Text', 'Score']]

In [57]:
df.rename({'Text': 'review', 'Score': 'rating'}, axis=1, inplace=True)

In [58]:
print(df.shape)
df.head()

(568454, 2)


Unnamed: 0,review,rating
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


Let us now see if any of the column has any null values.

In [59]:
# check for null values
print(df['rating'].isnull().sum())
df['review'].isnull().sum()  # no null values.

0


0

Note that there is no point for keeping rows with different scores or sentiment for same review text.  So I will keep only one instance and drop the rest of the duplicates.

In [60]:
# remove duplicates/ for every duplicate we will keep only one row of that type. 
df.drop_duplicates(subset=['review', 'rating'], keep='first', inplace=True)

In [61]:
# now check the shape. note that shape is reduced which shows that we did has duplicate rows.
print(df.shape)
df.head()

(393675, 2)


Unnamed: 0,review,rating
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


Let us now print some reviews and see if we can get insights from the text.

In [62]:
# printing some reviews to see insights.
for review in df['review'][:5]:
    print(review + "\n\n")

I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.


Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".


This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.


If you are looking for the se

There is nothing much that I can figure out except the fact that there are some stray words and some punctuation that we have to remove before moving ahead.

**But note that if I remove the punctuation now then it will be difficult to break the reviews into sentences which is required by Word2Vec constructor in Gensim. So we will first break text into sentences and then clean those sentences.**

Note that since we are doing sentiment analysis I will convert the values in score column to sentiment. Sentiment is 0 for ratings or scores less than 3 and 1 or  +  elsewhere.

In [63]:
def mark_sentiment(rating):
    if (rating<=3):
        return 0
    else:
        return 1

In [64]:
df['sentiment'] = df['rating'].apply(mark_sentiment)

In [65]:
df.drop(['rating'], axis=1, inplace=True)

In [66]:
df.head()

Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,1
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,1


In [67]:
df['sentiment'].value_counts()

1    306819
0     86856
Name: sentiment, dtype: int64

As you can see the sentiment column now has sentiment of the corressponding product review.

#### Pre-processing steps :

1 ) First **removing punctuation and html tags** if any. note that the html tas may be present ast the data must be scraped from net.

2) **Tokenize** the reviews into tokens or words .

3) Next **remove the stop words and shorter words** as they cause noise.

4) **Stem or lemmatize** the words depending on what does better. Herer I have yse lemmatizer.

In [68]:
# function to clean and pre-process the text.
def clean_reviews(review):
    
    # 1. Removing html tags
    review_text = BeautifulSoup(review, "lxml").get_text()
    
    # 2. Retaining only alphabets.
    review_text = re.sub("[^a-zA-Z]"," ", review_text)  # 정규표현식
    
    # 3. Converting to lower case and splitting
    word_tokens = review_text.lower().split()
    
    # 4. Remove stopwords
    le=WordNetLemmatizer()
    stop_words = set(stopwords.words("english"))
    word_tokens = [le.lemmatize(w) for w in word_tokens if w not in stop_words]
    
    cleaned_review = " ".join(word_tokens)
    
    return cleaned_review

Note that pre processing all the reviews is taking way too much time and so I will take only 100K reviews. To balance the class  I have taken equal instances of each sentiment.

In [69]:
len(df)

393675

In [70]:
pos_df = df.loc[df.sentiment==1, :][:50000]
neg_df = df.loc[df.sentiment==0, :][:50000]

In [71]:
pos_df.head()

Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
2,This is a confection that has been around a fe...,1
4,Great taffy at a great price. There was a wid...,1
5,I got a wild hair for taffy and ordered this f...,1
6,This saltwater taffy had great flavors and was...,1


In [72]:
neg_df.head()

Unnamed: 0,review,sentiment
1,Product arrived labeled as Jumbo Salted Peanut...,0
3,If you are looking for the secret ingredient i...,0
12,My cats have been happily eating Felidae Plati...,0
16,I love eating them and they are good for watch...,0
26,"The candy is just red , No flavor . Just plan...",0


We can now combine reviews of each sentiment and shuffle them so that their order doesn't make any sense.

In [73]:
# combining
df = pd.concat([pos_df, neg_df], ignore_index=True)

In [74]:
print(df.shape)
df.head()

(100000, 2)


Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
1,This is a confection that has been around a fe...,1
2,Great taffy at a great price. There was a wid...,1
3,I got a wild hair for taffy and ordered this f...,1
4,This saltwater taffy had great flavors and was...,1


In [75]:
# shuffling rows
df = df.sample(frac=1).reset_index(drop=True)
print(df.shape)
df.head()

(100000, 2)


Unnamed: 0,review,sentiment
0,I really wish Amazon would list ingredients fo...,0
1,Green Mountain Coffee - Dark Magic & Double Bl...,1
2,I admit to being one of those annoying self-co...,1
3,I am very disappointed that Gerber has added D...,0
4,taste just like pumpkin pie. not too sweet.<br...,1


### CREATING GOOGLE WORD2VEC WORD EMBEDDINGS IN GENSIM

In this section I have actually created the word embeddings in Gensim. Note that I planed touse the pre-trained word embeddings like the google word2vec trained on google news corpusor the famous Stanford Glove embeddings. But as soon as I load the corressponding embeddings through Gensim the runtime dies and kernel crashes ; perhaps because it contains 30L words and which is exceeding the RAM on Google Colab.

Because of this ; for now I have created the embeddings by training on my own corpus.

In [76]:
# import gensim
# # load Google's pre-trained Word2Vec model.
# pre_w2v_model = gensim.models.KeyedVectors.load_word2vec_format(r'drive/Colab Notebooks/amazon food reviews/GoogleNews-vectors-negative300.bin', binary=True) 

First we need to break our data into sentences which is requires by the constructor of the Word2Vec class in Gensim. For this I have used Punk English tokenizer from the NLTK.

In [77]:
df.head()

Unnamed: 0,review,sentiment
0,I really wish Amazon would list ingredients fo...,0
1,Green Mountain Coffee - Dark Magic & Double Bl...,1
2,I admit to being one of those annoying self-co...,1
3,I am very disappointed that Gerber has added D...,0
4,taste just like pumpkin pie. not too sweet.<br...,1


In [78]:
from tqdm import tqdm

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = []
sum = 0

for review in tqdm(df['review']):
    sents = tokenizer.tokenize(review.strip())
    sum += len(sents)
    for sent in sents:
        cleaned_sent = clean_reviews(sent)  # text 전처리
        sentences.append(cleaned_sent.split())  # can user word_tokenizer also.
print("sum: ", sum)
print(len(sentences))

100%|██████████| 100000/100000 [02:33<00:00, 650.00it/s]

sum:  512639
512639





Now let us print some sentences just to check iff they are in the correct fornat.

In [79]:
# trying to print few sentences
for te in sentences[:5]:
    print(te)

['really', 'wish', 'amazon', 'would', 'list', 'ingredient', 'food', 'product', 'sell', 'whether', 'human', 'animal', 'consumption']
['reason', 'assumed', 'greenies', 'natural', 'completely', 'fault', 'looking', 'ingredient', 'another', 'site', 'buying', 'yeah', 'corn', 'based', 'thus', 'different', 'junk', 'get', 'supermarket']
['oh', 'except', 'cost']
['buy']
['sure', 'cat', 'like', 'seems', 'like', 'treat', 'give', 'plus', 'simply', 'liking', 'make', 'good', 'healthy', 'much', 'like', 'inhaling', 'box', 'cheez', 'deliciously', 'naughty', 'list', 'ingredient', 'salmon', 'greenies', 'anyone', 'care', 'chicken', 'meal', 'ground', 'brewer', 'rice', 'ground', 'wheat', 'corn', 'gluten', 'meal', 'poultry', 'fat', 'preserved', 'mixed', 'tocopherol', 'sprayed', 'dried', 'hydrolyzed', 'chicken', 'protein', 'concentrate', 'oat', 'fiber', 'salmon', 'meal', 'natural', 'chicken', 'flavor', 'vegetable', 'oil', 'preserved', 'mixed', 'tocopherol', 'natural', 'poultry', 'fish', 'flavor', 'sodium', 'gl

Now actually creating the word 2 vec embeddings.

In [80]:
import gensim
w2v_model = gensim.models.Word2Vec(sentences=sentences, size=300, window=10, min_count=1)

#### Parameters:

- **sentences :** The sentences we have obtained.
- **size :** The dimesnions of the vector used to represent each word.
- **window :** The number f words around any word to see the context.
- **min_count :** The minimum number of times a word should appear for its embedding to be formed or learnt.


In [81]:
w2v_model.train(sentences, epochs=10, total_examples=len(sentences))

(38408827, 41193730)

Now can try some things with word2vec embeddings.

In [82]:
# embedding of a particular word.
w2v_model.wv.get_vector('like')

array([-5.83688974e-01,  5.70252955e-01,  1.79976702e+00, -1.00749405e-02,
       -9.56964314e-01, -5.00047088e-01, -6.18228137e-01, -4.70298618e-01,
       -1.31175053e+00,  2.45180416e+00,  5.45927644e-01,  6.46868765e-01,
       -4.49483216e-01, -1.03179228e+00,  1.57077420e+00,  1.43029165e+00,
        1.00573264e-01,  6.64683223e-01, -6.85839057e-01, -1.96350098e-01,
        6.33605480e-01,  6.77996725e-02, -6.09865189e-01, -1.07628977e+00,
        4.55098860e-02,  1.51399195e-01,  1.00566551e-01,  4.40805480e-02,
        9.97650683e-01, -9.86560509e-02, -4.15108979e-01, -1.03947461e-01,
       -1.05583215e+00,  7.63348401e-01, -6.99360907e-01, -1.33468226e-01,
        2.30404034e-01, -6.94058061e-01, -8.50645304e-01, -8.54234457e-01,
       -2.47394174e-01,  1.05335310e-01, -1.21529090e+00,  3.91671667e-03,
       -6.55919671e-01,  1.52958243e-03, -1.53237998e+00, -7.91378438e-01,
       -1.38934505e+00,  9.80016232e-01, -1.60336435e+00, -9.32592034e-01,
        3.62756640e-01,  

In [83]:
# total number of extracted words.
vocab = w2v_model.wv.vocab
print("The total number of words are : ",len(vocab))

The total number of words are :  56379


In [84]:
# words most similar to a given word.
w2v_model.wv.most_similar('like')

[('reminded', 0.49845317006111145),
 ('weird', 0.4922550618648529),
 ('alright', 0.48545411229133606),
 ('strange', 0.46986639499664307),
 ('reminds', 0.45902395248413086),
 ('akin', 0.44934549927711487),
 ('funny', 0.4454163908958435),
 ('reminiscent', 0.42602288722991943),
 ('gross', 0.42208850383758545),
 ('remind', 0.41538625955581665)]

In [85]:
# similaraity b/w two words
w2v_model.wv.similarity('good','like')

0.3864205

Now creating a dictionary with words in vocab and their embeddings. This will be used when we will be creating embedding matrix (for feeding to keras embedding layer).

In [86]:
print("The no of words :",len(vocab))
# print(vocab)

The no of words : 56379


In [87]:
# print(vocab)
vocab = list(vocab.keys())

In [88]:
word_vec_dict = {}
for word in vocab:
    word_vec_dict[word] = w2v_model.wv.get_vector(word)
print("The no of key-value pairs : ", len(word_vec_dict)) # should come equal to vocab size

The no of key-value pairs :  56379


In [89]:
# # just check
# for word in vocab[:5]:
#   print(word_vec_dict[word])

### PREPARING THE DATA FOR KERAS EMBEDDING LAYER.

Now we have obtained the w2v embeddings. But there are a couple of steps required by Keras embedding layer before we can move on.

**Also note that since w2v embeddings have been made now ; we can preprocess our review column by using the function that we saw above.**

In [90]:
# cleaning reviews.
df['clean_review'] = df['review'].apply(clean_reviews)

We need to find the maximum lenght of any document or review in our case. WE will pad all reviews to have this same length.This will be required by Keras embedding layer. Must check [this](https://www.kaggle.com/rajmehra03/a-detailed-explanation-of-keras-embedding-layer) kernel on Kaggle for a wonderful explanation of keras embedding layer.

In [92]:
df.head()

Unnamed: 0,review,sentiment,clean_review
0,I really wish Amazon would list ingredients fo...,0,really wish amazon would list ingredient food ...
1,Green Mountain Coffee - Dark Magic & Double Bl...,1,green mountain coffee dark magic double black ...
2,I admit to being one of those annoying self-co...,1,admit one annoying self confessed coffee snob ...
3,I am very disappointed that Gerber has added D...,0,disappointed gerber added dha green bean son u...
4,taste just like pumpkin pie. not too sweet.<br...,1,taste like pumpkin pie sweet expensive placing...


In [93]:
# number of unique words = 56379
# now since we will have to pad we need to find the maximum length of any document.

maxi = -1

for i, rev in enumerate(df['clean_review']):
    tokens = rev.split()
    if (len(tokens) > maxi):
        maxi = len(tokens)
print(maxi)

1564


#### Now we integer encode the words in the reviews using Keras tokenizer. 

**Note that there two important variables: which are the vocab_size which is the total no of unique words while the second is max_doc_len which is the length of every document after padding. Both of these are required by the Keras embedding layer.**

In [95]:
tok = Tokenizer()
tok.fit_on_texts(df['clean_review'])
vocab_size = len(tok.word_index) + 1
encd_rev = tok.texts_to_sequences(df['clean_review'])

In [96]:
max_rev_len = 1565  # max length of a review
vocab_size = len(tok.word_index) + 1  # total no of words
embed_dim = 300  #  embedding dimension as choosen in word2vec constructor

In [97]:
# now padding to have a maximum lenght of 1565
pad_rev = pad_sequences(encd_rev, maxlen=max_rev_len, padding='post')
pad_rev.shape  # note that we had 100k reviews and we gave padded each review to have a length of 1565 words.

(100000, 1565)

### CREATING THE EMBEDDING MATRIX

#### Now we need to pass the w2v word embeddings to the embedding layer in Keras. For this we will create the embedding matrix and pass it as 'embedding_initializer' parameter to the layer.

**The embedding matrix will be of dimensions (vocab_size,embed_dim) where the word_index of each word from keras tokenizer is its index into the matrix and the corressponding entry is its w2v vector ;)**

**Note that there may be words which will not be present in embeddings learnt by the w2v model. The embedding matrix entry corressponding to those words will be a vector of all zeros.**

**Also note that if u are thinkng why won't a word be present then it is bcoz now we have learnt on out own corpus but if we use pre-trained embedding then it may happen that some words specific to our dataset aren't present then in those cases we may use a fixed vector of zeros to denote all those words that earen;t present in th pre-trained embeddings. Also note that it may also happen that some words are not present ifu have filtered some words by setting min_count in w2v constructor.
  **

In [99]:
# now creating the embedding matrix
embed_matrix = np.zeros(shape=(vocab_size, embed_dim))
for word, i in tok.word_index.items():
    embed_vector = word_vec_dict.get(word)
    if embed_vector is not None:  # word is in the vocabulary learned by the w2v model
        embed_matrix[i] = embed_vector
    # if word is not found the embed_vector corressponding to that vector will stay zero.

In [100]:
# checking
print(embed_matrix[1])

[-5.83688974e-01  5.70252955e-01  1.79976702e+00 -1.00749405e-02
 -9.56964314e-01 -5.00047088e-01 -6.18228137e-01 -4.70298618e-01
 -1.31175053e+00  2.45180416e+00  5.45927644e-01  6.46868765e-01
 -4.49483216e-01 -1.03179228e+00  1.57077420e+00  1.43029165e+00
  1.00573264e-01  6.64683223e-01 -6.85839057e-01 -1.96350098e-01
  6.33605480e-01  6.77996725e-02 -6.09865189e-01 -1.07628977e+00
  4.55098860e-02  1.51399195e-01  1.00566551e-01  4.40805480e-02
  9.97650683e-01 -9.86560509e-02 -4.15108979e-01 -1.03947461e-01
 -1.05583215e+00  7.63348401e-01 -6.99360907e-01 -1.33468226e-01
  2.30404034e-01 -6.94058061e-01 -8.50645304e-01 -8.54234457e-01
 -2.47394174e-01  1.05335310e-01 -1.21529090e+00  3.91671667e-03
 -6.55919671e-01  1.52958243e-03 -1.53237998e+00 -7.91378438e-01
 -1.38934505e+00  9.80016232e-01 -1.60336435e+00 -9.32592034e-01
  3.62756640e-01  5.22699773e-01  8.50506201e-02  3.11784893e-02
  1.34979284e+00 -7.53154933e-01 -5.99547684e-01  2.65056968e-01
 -2.19297230e-01  2.86805

### PREPARING TRAIN AND VALIDATION SETS.

In [101]:
# prepraed train and test sets first
Y = keras.utils.to_categorical(df['sentiment'])  # one hot target as required by NN.
x_train, x_test, y_train, y_test = train_test_split(pad_rev, Y, test_size=0.20, random_state=42)

### BUILDING A MODEL AND FINALLY PERFORMING TEXT CLASSIFICATION

Having done all the pre-requisites we finally move onto make model in Keras .

**Note that I have commented the LSTM layer as including it causes the trainig loss to be stucked at a value of about 0.6932. I don;t know why ;(.**

**In case someone knows please comment below.**

In [103]:
from keras.initializers import Constant
from keras.layers import ReLU
from keras.layers import Dropout

model = Sequential()
model.add(Embedding(input_dim=vocab_size,
                   output_dim=embed_dim,
                   input_length=max_rev_len,
                  embeddings_initializer=Constant(embed_matrix)))

# model.add(CuDNNLSTM(64,return_sequences=False)) # loss stucks at about 
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.50))
# model.add(Dense(16,activation='relu'))
# model.add(Dropout(0.20))
model.add(Dense(2, activation='sigmoid'))  # sigmoid for bin. classification

Let us now print a summary of the model.

model.summary()

In [105]:
# comile the model
model.compile(optimizer=keras.optimizers.RMSprop(lr=1e-3), loss='binary_crossentropy', metrics=['accuracy'])

In [106]:
# specify batch size and epochs for training.
epochs=5
batch_size = 64

In [107]:
# fitting the model.
model.fit(x_train, 
          y_train, 
          epochs=epochs, 
          batch_size=batch_size,
          validation_data=(x_test, y_test)
         )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f0cd05fc090>

#### Note that loss as well as val_loss is  is still deceasing. You can train for more no of epochs but I am not so patient ;)

**The final accuracy after 5 epochs is about 84% which is pretty decent.**

### FURTHER IDEAS :

1) ProductId and UserId can be used to track the general ratings of a given product and also to track the review patter of a particular user as if he is strict in reviwing or not.
 

2) Helpfulness feature may tell about the product. This is because gretare the no of people talking about reviews, the mre stronger or critical it is expected to be.

3) Summary column can also give a hint.

4) One can also try the pre-trained embeddings like Glove word vectors etc...

5) Lastly tuning the n/w hyperparameters is always an option;).

 