## Text Classification on Amazon Fine Food Dataset with Google Word2Vec Word Embeddings in Gensim and training using LSTM In Keras.

### IMPORTING THE MODULES

In [1]:
# Ignore the warinings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

# data visualization and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
#configure
# sets matplotlib to inline and displays graphs below the corressponding cell.
# matplotlib inline
style.use('fivethirtyeight')
sns.set(style='whitegrid', color_codes=True)

# nltk
import nltk

#preprocessing
from nltk.corpus import stopwords  #stopwords
from nltk import word_tokenize, sent_tokenize  # tokenizing
from nltk.stem import PorterStemmer, LancasterStemmer  # using the Porter Stemmer and Lancaster Stemmer and others
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer  # lammatizer from WordNet

# for part-of-speech tagging
from nltk import pos_tag

# from named entity recognition (NER)
from nltk import ne_chunk

# vectorizers for creating the document-term-matrix (DTM)
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

# BeautifulSoup library
from bs4 import BeautifulSoup

import re  # regex

#model_selection
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

#evaluation
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.metrics import classification_report
from mlxtend.plotting import plot_confusion_matrix


#prprocssing scikit
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer  # 'Imputer' is deprecated from 'sklearn.preprocessing'

#classification.
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB

#stop-words
stop_words = set(nltk.corpus.stopwords.words('english'))

#keras
import keras
from keras.preprocessing.text import one_hot, Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding, Input, LSTM  # cannot import name 'CuDNNLSTM' from 'keras.layers'

from keras.models import Model
from keras.preprocessing.text import text_to_word_sequence

#gensim w2v
#word2vec
from gensim.models import Word2Vec

### LOADING THE DATASET

In [2]:
rev_frame = pd.read_csv(r'./input/Reviews.csv')
df = rev_frame.copy()

In [3]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568438 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB


In [5]:
df.groupby(['UserId', 'ProductId']).sum('Score')

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
UserId,ProductId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
#oc-R103C0QSV1DF5E,B006Q820X0,136323,1,2,5,1343088000
#oc-R109MU5OBBZ59U,B008I1XPKA,516062,0,1,5,1350086400
#oc-R10LFEMQEW6QGZ,B008I1XPKA,516079,0,1,5,1345939200
#oc-R10LT57ZGIB140,B0026LJ3EA,378693,0,0,3,1310601600
#oc-R10UA029WVWIUI,B006Q820X0,136545,0,0,1,1342483200
...,...,...,...,...,...,...
AZZV9PDNMCOZW,B003SNX4YA,422838,0,0,4,1329436800
AZZVNIMTTMJH6,B000FI4O90,190698,0,0,5,1268179200
AZZY649VYAHQS,B000N9VLJ2,222781,1,1,5,1309737600
AZZYCJOJLUDYR,B001SB22UG,131469,0,0,5,1337472000


In [6]:
print(df['Time'].min())
print(df['Time'].max())

939340800
1351209600


In [7]:
len(df['UserId'].unique())

256059

#### A brief description of the dataset from Overview tab on Kaggle : -

Data includes:
- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews

### DATA CLEANING AND PRE-PROCESSING

Since here I am concerned with **sentiment analysis** I shall keep only the 'Text' and the 'Score' column.

In [7]:
df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [8]:
df = df[['Text', 'Score']]

In [9]:
df.rename({'Text': 'review', 'Score': 'rating'}, axis=1, inplace=True)

In [10]:
print(df.shape)
df.head()

(568454, 2)


Unnamed: 0,review,rating
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


Let us now see if any of the column has any null values.

In [11]:
# check for null values
print(df['rating'].isnull().sum())
df['review'].isnull().sum()  # no null values.

0


0

Note that there is no point for keeping rows with different scores or sentiment for same review text.  So I will keep only one instance and drop the rest of the duplicates.

In [12]:
# remove duplicates/ for every duplicate we will keep only one row of that type. 
df.drop_duplicates(subset=['review', 'rating'], keep='first', inplace=True)

In [13]:
# now check the shape. note that shape is reduced which shows that we did has duplicate rows.
print(df.shape)
df.head()

(393675, 2)


Unnamed: 0,review,rating
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


Let us now print some reviews and see if we can get insights from the text.

In [16]:
# printing some reviews to see insights.
for review in df['review'][:5]:
    print(review + "\n\n")

I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.


Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".


This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.


If you are looking for the se

There is nothing much that I can figure out except the fact that there are some stray words and some punctuation that we have to remove before moving ahead.

**But note that if I remove the punctuation now then it will be difficult to break the reviews into sentences which is required by Word2Vec constructor in Gensim. So we will first break text into sentences and then clean those sentences.**

Note that since we are doing sentiment analysis I will convert the values in score column to sentiment. Sentiment is 0 for ratings or scores less than 3 and 1 or  +  elsewhere.

In [18]:
def mark_sentiment(rating):
    if (rating<=3):
        return 0
    else:
        return 1

In [19]:
df['sentiment'] = df['rating'].apply(mark_sentiment)

In [21]:
df.drop(['rating'], axis=1, inplace=True)

In [22]:
df.head()

Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,1
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,1


In [23]:
df['sentiment'].value_counts()

1    306819
0     86856
Name: sentiment, dtype: int64

As you can see the sentiment column now has sentiment of the corressponding product review.

#### Pre-processing steps :

1 ) First **removing punctuation and html tags** if any. note that the html tas may be present ast the data must be scraped from net.

2) **Tokenize** the reviews into tokens or words .

3) Next **remove the stop words and shorter words** as they cause noise.

4) **Stem or lemmatize** the words depending on what does better. Herer I have yse lemmatizer.

In [25]:
# function to clean and pre-process the text.
def clean_reviews(review):
    
    # 1. Removing html tags
    review_text = BeautifulSoup(review, "lxml").get_text()
    
    # 2. Retaining only alphabets.
    review_text = re.sub("[^a-zA-Z]"," ", review_text)  # 정규표현식
    
    # 3. Converting to lower case and splitting
    word_tokens = review_text.lower().split()
    
    # 4. Remove stopwords
    le=WordNetLemmatizer()
    stop_words = set(stopwords.words("english"))
    word_tokens = [le.lemmatize(w) for w in word_tokens if w not in stop_words]
    
    cleaned_review = " ".join(word_tokens)
    
    return cleaned_review

Note that pre processing all the reviews is taking way too much time and so I will take only 100K reviews. To balance the class  I have taken equal instances of each sentiment.

In [26]:
len(df)

393675

In [27]:
pos_df = df.loc[df.sentiment==1, :][:50000]
neg_df = df.loc[df.sentiment==0, :][:50000]

In [28]:
pos_df.head()

Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
2,This is a confection that has been around a fe...,1
4,Great taffy at a great price. There was a wid...,1
5,I got a wild hair for taffy and ordered this f...,1
6,This saltwater taffy had great flavors and was...,1


In [29]:
neg_df.head()

Unnamed: 0,review,sentiment
1,Product arrived labeled as Jumbo Salted Peanut...,0
3,If you are looking for the secret ingredient i...,0
12,My cats have been happily eating Felidae Plati...,0
16,I love eating them and they are good for watch...,0
26,"The candy is just red , No flavor . Just plan...",0


We can now combine reviews of each sentiment and shuffle them so that their order doesn't make any sense.

In [30]:
# combining
df = pd.concat([pos_df, neg_df], ignore_index=True)

In [31]:
print(df.shape)
df.head()

(100000, 2)


Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
1,This is a confection that has been around a fe...,1
2,Great taffy at a great price. There was a wid...,1
3,I got a wild hair for taffy and ordered this f...,1
4,This saltwater taffy had great flavors and was...,1


In [32]:
# shuffling rows
df = df.sample(frac=1).reset_index(drop=True)
print(df.shape)
df.head()

(100000, 2)


Unnamed: 0,review,sentiment
0,I've enjoyed Ambassador Organics products (Goo...,1
1,I love this tea! It's the best I've ever had....,1
2,Brewing directions are right on tin so don't k...,1
3,This is a fantastic rice. It cooks perfectly i...,1
4,"If you use this conditioner, buy the shampoo t...",0


### **CREATING GOOGLE WORD2VEC WORD EMBEDDINGS IN GENSIM### CREATING GOOGLE WORD2VEC WORD EMBEDDINGS IN GENSIM**

In this section I have actually created the word embeddings in Gensim. Note that I planed touse the pre-trained word embeddings like the google word2vec trained on google news corpusor the famous Stanford Glove embeddings. But as soon as I load the corressponding embeddings through Gensim the runtime dies and kernel crashes ; perhaps because it contains 30L words and which is exceeding the RAM on Google Colab.

Because of this ; for now I have created the embeddings by training on my own corpus.