<a href="https://colab.research.google.com/github/NoerNikmat/deep_learning_for_nlp/blob/main/NLP_Sentiment_Analyst_using_Deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **About Dataset**


### Description

**Women's E-Commerce Clothing Reviews on Kaggle**

---

***Link Dataset***: 

https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews


***Context***

Welcome. This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.



***Content***

This dataset includes **23486 rows** and **10 feature variables**. Each row corresponds to a customer review, and includes the variables:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece 
being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review** Text: String variable for the review body.
- **Rating**: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
- **Recommended** IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.


***Acknowledgements***

Anonymous but real source


***Inspiration***

Nicapotato, an owner of dataset Women's E-Commerce Clothing Reviews that looks forward to coming quality NLP! There are also some great opportunities for feature engineering and multivariate analysis.

***Publication***

[Statistical Analysis on E-Commerce Reviews, with Sentiment Classification using Bidirectional Recurrent Neural Network](https://www.researchgate.net/publication/323545316_Statistical_Analysis_on_E-Commerce_Reviews_with_Sentiment_Classification_using_Bidirectional_Recurrent_Neural_Network)

by [Abien Fred Agarap - Github](https://github.com/AFAgarap/ecommerce-reviews-analysis)



### Metadata

**Usage Information**

---



- License [CC0: Public Domain](https://creativecommons.org/publicdomain/zero/1.0/)
- Visibility **public**


**Maintainers**

---



- Dataset owner [nicapotato](https://www.kaggle.com/nicapotato)

**Updates**

---
    
    Expected update frequency (Not specified)

    Last updated 2018-02-04
    Date created 2018-02-04
    Current version Version 1

## **Objectives**

**Problem Framing**
* How to predict sentiment analysis from Women's E-Commerce Clothing Reviews?


**Ideal Outcome**
* A success metric is that a sentence can be classified as positive, negative, or neutral as predicted by the model. 
* Success means predicting >90% for sentiment analysis. 
* Failure means the number of accuracy sentiment predicted is no better than current heuristics.


**Heuristics**
* Consider reviews of people who buy products in the past. Assume that new items buyers by these people will also become positive, negative, or neutral critics.

**Formulation of the problem**
* Comparison of Naive Bayes Model and Support Vector Machine (SVM)

## **References**

- [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course)
- [Predicting Sentiment from Clothing Reviews](https://www.kaggle.com/burhanykiyakoglu/predicting-sentiment-from-clothing-reviews)
- Source from the lecturers 

## **Programming With Python**

### **Data Pre-pocessing**

**Import Dataset**

    How to use the Kaggle dataset

        * Login or sign in for your account on the Kaggle website.
        * Find your dataset, notebook, and other information that you needed.
        * Make your API file .json from your profile.
        * Download your API file .json into your local computers.

Install Kaggle for upload dataset into google colab

In [1]:
!pip install -q kaggle

Upload Kaggle API key

In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"noer001","key":"5c61dd059665969d566be961b002599c"}'}

In [3]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Download dataset from Kaggle

In [4]:
! kaggle datasets download -d 'nicapotato/womens-ecommerce-clothing-reviews/download'  

Downloading womens-ecommerce-clothing-reviews.zip to /content
  0% 0.00/2.79M [00:00<?, ?B/s]
100% 2.79M/2.79M [00:00<00:00, 45.9MB/s]


In [5]:
!ls

kaggle.json  sample_data  womens-ecommerce-clothing-reviews.zip


In [6]:
!unzip -q womens-ecommerce-clothing-reviews.zip

In [7]:
!ls

 kaggle.json  'Womens Clothing E-Commerce Reviews.csv'
 sample_data   womens-ecommerce-clothing-reviews.zip


**Import Libraries**

In [8]:
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from nltk.sentiment.vader import SentimentIntensityAnalyzer



In [9]:
woman = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
woman

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses


Working with Text

In [10]:
pd.set_option('max_colwidth', 500)
woman[["Title","Review Text", "Rating"]].sample(2)

Unnamed: 0,Title,Review Text,Rating
2668,Different colors are sized differently,"I tried these shorts on in 5 colors and found a wide variety of sizing. in the dark grey and taupe i needed to size down one from my usual 28 to a size 27. in the sapphire, lavender and red my regular size 28 fit perfectly. i ended up purchasing the dark grey and lavender with a discount. i gave 4 stars for the variation in sizing but once i got the correct fit the shorts have not stretched out on me throughout the day in either color. the length is perfect, short enough to not look frumpy, but",4
3989,Love - soft and feminine,"I decided to order this on a whim during the promotion on tops, why? because it looked decent and i like the other tops from the same brand. i am so glad i did, it is even better in person.\r\n\r\ncolor: nice ivory with subtle speckle in it. very neutral in color, but also on the warmer side. the fabric is a little sheer, you could definitely see my bright colored bra, but a nude one will do the trick.\r\n\r\nthe length was a little longer, and the sleeves were a bit long too (doesn't come i...",5


Text Cleaning

In [11]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [12]:
ps = PorterStemmer()
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))

In [13]:
def preprocessing(data):
    txt = data.str.lower().str.cat(sep=' ') #1
    words = tokenizer.tokenize(txt) #2
    words = [w for w in words if not w in stop_words] #3
    #words = [ps.stem(w) for w in words] #4
    return words

In [14]:
woman['tokenized'] = woman["Review Text"].astype(str).str.lower() # Turn into lower case text
woman['tokenized'] = woman.apply(lambda row: tokenizer.tokenize(row['tokenized']), axis=1) # Apply tokenize to each row
woman['tokenized'] = woman['tokenized'].apply(lambda x: [w for w in x if not w in stop_words]) # Remove stopwords from each row

Data Pre-pocessing for Sentiment Analysis

In [15]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [16]:
# Pre-Processing
SIA = SentimentIntensityAnalyzer()
woman['Review Text']= woman['Review Text'].astype(str)

# Applying Model, Variable Creation
woman['Polarity Score'] = woman['Review Text'].apply(lambda x: SIA.polarity_scores(x)['compound'])
woman['Neutral Score'] = woman['Review Text'].apply(lambda x: SIA.polarity_scores(x)['neu'])
woman['Negative Score'] = woman['Review Text'].apply(lambda x: SIA.polarity_scores(x)['neg'])
woman['Positive Score'] = woman['Review Text'].apply(lambda x: SIA.polarity_scores(x)['pos'])

# Converting 0 to 1 Decimal Score to a Categorical Variable
woman['Sentiment'] = ''
woman.loc[woman['Polarity Score'] > 0, 'Sentiment'] = 'Positive'
woman.loc[woman['Polarity Score'] == 0, 'Sentiment'] = 'Neutral'
woman.loc[woman['Polarity Score'] < 0, 'Sentiment'] = 'Negative'

In [17]:
def string_unlist(strlist):
    return " ".join(strlist)

woman["tokenized_unlist"] = woman["tokenized"].apply(string_unlist)
woman.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,tokenized,Polarity Score,Neutral Score,Negative Score,Positive Score,Sentiment,tokenized_unlist
0,0,767,33,,Absolutely wonderful - silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates,"[absolutely, wonderful, silky, sexy, comfortable]",0.8932,0.272,0.0,0.728,Positive,absolutely wonderful silky sexy comfortable
1,1,1080,34,,"Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8"". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.",5,1,4,General,Dresses,Dresses,"[love, dress, sooo, pretty, happened, find, store, glad, bc, never, would, ordered, online, bc, petite, bought, petite, 5, 8, love, length, hits, little, knee, would, definitely, true, midi, someone, truly, petite]",0.9729,0.664,0.0,0.336,Positive,love dress sooo pretty happened find store glad bc never would ordered online bc petite bought petite 5 8 love length hits little knee would definitely true midi someone truly petite
2,2,1077,60,Some major design flaws,"I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - ...",3,0,0,General,Dresses,Dresses,"[high, hopes, dress, really, wanted, work, initially, ordered, petite, small, usual, size, found, outrageously, small, small, fact, could, zip, reordered, petite, medium, ok, overall, top, half, comfortable, fit, nicely, bottom, half, tight, layer, several, somewhat, cheap, net, layers, imo, major, design, flaw, net, layer, sewn, directly, zipper, c]",0.9427,0.792,0.027,0.181,Positive,high hopes dress really wanted work initially ordered petite small usual size found outrageously small small fact could zip reordered petite medium ok overall top half comfortable fit nicely bottom half tight layer several somewhat cheap net layers imo major design flaw net layer sewn directly zipper c
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!",5,1,0,General Petite,Bottoms,Pants,"[love, love, love, jumpsuit, fun, flirty, fabulous, every, time, wear, get, nothing, great, compliments]",0.5727,0.34,0.226,0.434,Positive,love love love jumpsuit fun flirty fabulous every time wear get nothing great compliments
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!,5,1,6,General,Tops,Blouses,"[shirt, flattering, due, adjustable, front, tie, perfect, length, wear, leggings, sleeveless, pairs, well, cardigan, love, shirt]",0.9291,0.7,0.0,0.3,Positive,shirt flattering due adjustable front tie perfect length wear leggings sleeveless pairs well cardigan love shirt


In [18]:
conditions = [
    woman['Sentiment'] == "Positive",
    woman['Sentiment'] == "Negative",
    woman['Sentiment'] == "Neutral"]
choices = [1,-1,0]
woman['label'] = np.select(conditions, choices)
woman.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,tokenized,Polarity Score,Neutral Score,Negative Score,Positive Score,Sentiment,tokenized_unlist,label
0,0,767,33,,Absolutely wonderful - silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates,"[absolutely, wonderful, silky, sexy, comfortable]",0.8932,0.272,0.0,0.728,Positive,absolutely wonderful silky sexy comfortable,1
1,1,1080,34,,"Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8"". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.",5,1,4,General,Dresses,Dresses,"[love, dress, sooo, pretty, happened, find, store, glad, bc, never, would, ordered, online, bc, petite, bought, petite, 5, 8, love, length, hits, little, knee, would, definitely, true, midi, someone, truly, petite]",0.9729,0.664,0.0,0.336,Positive,love dress sooo pretty happened find store glad bc never would ordered online bc petite bought petite 5 8 love length hits little knee would definitely true midi someone truly petite,1
2,2,1077,60,Some major design flaws,"I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - ...",3,0,0,General,Dresses,Dresses,"[high, hopes, dress, really, wanted, work, initially, ordered, petite, small, usual, size, found, outrageously, small, small, fact, could, zip, reordered, petite, medium, ok, overall, top, half, comfortable, fit, nicely, bottom, half, tight, layer, several, somewhat, cheap, net, layers, imo, major, design, flaw, net, layer, sewn, directly, zipper, c]",0.9427,0.792,0.027,0.181,Positive,high hopes dress really wanted work initially ordered petite small usual size found outrageously small small fact could zip reordered petite medium ok overall top half comfortable fit nicely bottom half tight layer several somewhat cheap net layers imo major design flaw net layer sewn directly zipper c,1
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!",5,1,0,General Petite,Bottoms,Pants,"[love, love, love, jumpsuit, fun, flirty, fabulous, every, time, wear, get, nothing, great, compliments]",0.5727,0.34,0.226,0.434,Positive,love love love jumpsuit fun flirty fabulous every time wear get nothing great compliments,1
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!,5,1,6,General,Tops,Blouses,"[shirt, flattering, due, adjustable, front, tie, perfect, length, wear, leggings, sleeveless, pairs, well, cardigan, love, shirt]",0.9291,0.7,0.0,0.3,Positive,shirt flattering due adjustable front tie perfect length wear leggings sleeveless pairs well cardigan love shirt,1


In [19]:
woman.shape

(23486, 19)

In [20]:
#woman.to_csv('woman.csv')

### **Data Preparation**

Import Libraries 

In [21]:
from keras.models import Sequential
from keras.initializers import Constant
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Input, Dropout, LSTM, Activation, Embedding, Bidirectional, Flatten, Dense, SimpleRNN

In [22]:
woman.columns

Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name', 'tokenized', 'Polarity Score',
       'Neutral Score', 'Negative Score', 'Positive Score', 'Sentiment',
       'tokenized_unlist', 'label'],
      dtype='object')

Import GloVe

GloVe - Global vector to represent and classify words contained in statements.

[glove.6B.50d (Source URL:https://www.kaggle.com/watts2/glove6b50dtxt)] 



In [23]:
! kaggle datasets download -d 'watts2/glove6b50dtxt/download'  

Downloading glove6b50dtxt.zip to /content
 72% 49.0M/67.7M [00:01<00:00, 30.4MB/s]
100% 67.7M/67.7M [00:01<00:00, 60.8MB/s]


In [24]:
!ls

 glove6b50dtxt.zip  'Womens Clothing E-Commerce Reviews.csv'
 kaggle.json	     womens-ecommerce-clothing-reviews.zip
 sample_data


In [25]:
!unzip -q glove6b50dtxt.zip

In [26]:
!ls

 glove.6B.50d.txt    kaggle.json  'Womens Clothing E-Commerce Reviews.csv'
 glove6b50dtxt.zip   sample_data   womens-ecommerce-clothing-reviews.zip


In [27]:
#Word Preprocessing
#creating subset of the dataframe
recc_lstm=woman[['Review Text','label','Recommended IND']]
recc_lstm=recc_lstm.dropna()  #dropping missing text cases in reviews

In [28]:
#converting the panda series to numpy array
X=recc_lstm['Review Text']
X=np.array(X)
Y=recc_lstm['label']
Y=np.array(Y)

In [29]:
#tokenizing the strings
tokenizer=Tokenizer()
tokenizer.fit_on_texts(X)
sequencer=tokenizer.texts_to_sequences(X)

In [30]:
#finding maximum length of a review
maxLen=0
for string in sequencer:
    temp=len(string)
    if temp>maxLen:
        maxLen=temp

print ('Maximum sequence length:',maxLen)

Maximum sequence length: 116


In [31]:
#creating the word to index and index to word vectors
word_to_index=tokenizer.word_index #dictionary that maps words in the reviews to indices
index_to_word=tokenizer.index_word #dictionary that maps indices back to words

In [32]:
#loading the GloVe embeddings
embeddings_dict = {} #dictionary of words and their correspondng GloVe vector representation
indices=0

with open("/content/glove.6B.50d.txt", 'r', encoding ='utf8') as f:
    for line in f:
        words = line.split()
        word = words[0]
        vector = np.asarray(words[1:], "float32")
        embeddings_dict[word] = vector
        indices+=1

In [33]:
#preparing embedding matrix
vocab_size=len(word_to_index)+1 #to account for out of vocabulary words
embedding_dim=50 #number of dimensions chosen in the GloVe representation
present=0
absent=0
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_to_index.items():
    #embedding_vector = embeddings_dict[word]
    embedding_vector=  embeddings_dict.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        present+=1
    else:
        absent+=1

print ("Number of words in the matrix",present)
print ("Number of missing words",absent)

Number of words in the matrix 12065
Number of missing words 2783


    
Converts the tokenized sequencer to 2D matrix for each text entry. Each row in the matrix is one review. Each column indice indicate one word in the text.
    
Arguments:

* sequencer -- List of list comprisong of text reviews converted to indices.
* maxLen -- Maximum length of the nested list consisting of text converted to indices in the sequencer list.

Returns:
* X_indices -- 2D matrix where each row corresponds to each review. Each column indice correspond to a word in the review.


In [34]:
#preprocessing the text to create indices

def sentences_to_indices(sequencer,maxLen):
    X_indices=np.zeros((len(sequencer),maxLen))
    for i in range(len(sequencer)):
        j=0
        for n in sequencer[i]:
             X_indices[i,j]= n
             j+=1
    return X_indices

X_indices=sentences_to_indices(sequencer,maxLen)

### Modeling the text data

In [35]:
#dividing into training and test data set
np.random.seed(2)
X_tr,X_test,Y_tr,Y_test=train_test_split(X_indices,Y,test_size=0.2)

In [36]:
#building the Bidirectional LSTM model
early_stopping=EarlyStopping(monitor='val_loss',patience=5)
model_save=ModelCheckpoint('top_model.hdf5',save_best_only=True)
model=Sequential()
model.add(Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    input_length=maxLen,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=True,
))
model.add(Bidirectional(LSTM(units = 10, return_sequences= True)))
#model.add(Dropout(rate=0.5))
model.add(Bidirectional(LSTM(units = 10, return_sequences= False)))
model.add(Dense(120,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.add(Dense(240,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.add(Dense(360,activation='relu'))
model.add(Dense(1,activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 116, 50)           742450    
_________________________________________________________________
bidirectional (Bidirectional (None, 116, 20)           4880      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 20)                2480      
_________________________________________________________________
dense (Dense)                (None, 120)               2520      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 121       
_________________________________________________________________
dense_2 (Dense)              (None, 240)               480       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 2

In [37]:
#compiling and fitting the model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_his=model.fit(X_tr, Y_tr, epochs = 50, batch_size = 987, validation_split=0.2, shuffle=True, verbose=True, callbacks=[early_stopping,model_save])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50


In [39]:
#evaluating the model on the test data set
print ('Loss=',model.evaluate(X_test,Y_test)[0])
print ('Accuracy=',model.evaluate(X_test,Y_test)[1])

Loss= 0.426622211933136
Accuracy= 0.8961260318756104
