# E-Commerce Review Classification Using LSTM

* Tutorial : [BISA.AI Academy](https://www.youtube.com/watch?v=RYI0tqngVy4&ab_channel=BISAAIAcademy)
* Dataset : [Womens Ecommerce Clothing Reviews](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews)

## Import modules

In [1]:
# Preprocessing and Visualization modules
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('/kaggle/input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


From the data we only need 'Review Text' and 'Class Name', so let's drop all

## Cleaning the data

In [4]:
df_new = df[['Review Text', 'Class Name']]

In [5]:
df_new

Unnamed: 0,Review Text,Class Name
0,Absolutely wonderful - silky and sexy and comf...,Intimates
1,Love this dress! it's sooo pretty. i happene...,Dresses
2,I had such high hopes for this dress and reall...,Dresses
3,"I love, love, love this jumpsuit. it's fun, fl...",Pants
4,This shirt is very flattering to all due to th...,Blouses
...,...,...
23481,I was very happy to snag this dress at such a ...,Dresses
23482,"It reminds me of maternity clothes. soft, stre...",Knits
23483,"This fit well, but the top was very see throug...",Dresses
23484,I bought this dress for a wedding i have this ...,Dresses


Next, we clean up the null columns

In [6]:
df_new.dropna(axis=0, subset=['Review Text', 'Class Name'], inplace=True) # In this case, subset will remove all row that 
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Review Text,Class Name
0,Absolutely wonderful - silky and sexy and comf...,Intimates
1,Love this dress! it's sooo pretty. i happene...,Dresses
2,I had such high hopes for this dress and reall...,Dresses
3,"I love, love, love this jumpsuit. it's fun, fl...",Pants
4,This shirt is very flattering to all due to th...,Blouses


## Split Data

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

vectorizer = CountVectorizer()

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df_new['Review Text'], df_new['Class Name'], 
                                                    test_size=0.2, random_state=42)

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

## Naive Bayes

In [9]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
y_pred = nb_model.predict(X_test)

## Metrics

In [10]:
from sklearn.metrics import confusion_matrix, classification_report

In [11]:
print(confusion_matrix(y_pred, y_test))

[[ 244    0   11   15    4    8    2  102    1    0    6    4    0    1
     7    0    3    7    2]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0]
 [  59    0 1202   33   12   56    2   85    2    2   28   23   60   16
   105    9   47   30   17]
 [   0    0    0    7    0    1    0    1    0    0    0    0    1    0
     1    0    7    0    0]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0]
 [   0    0    0    0    0   10    0    1    0    0    0    2    0    0
     0    0    1    0    0]
 [   1    0    0    0    0    1  175    2    0    7    5    0   25    5
     0    1    0    0    1]
 [ 281    0   44  126   12   60   13  723   23    4   50    7   22    9
    14   14  106   23    2]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0    1    0    0    0    0
     0    0    0    0    0]


In [12]:
print(classification_report(y_pred, y_test))

                precision    recall  f1-score   support

       Blouses       0.42      0.59      0.49       417
Casual bottoms       0.00      0.00      0.00         0
       Dresses       0.95      0.67      0.79      1788
    Fine gauge       0.03      0.39      0.06        18
     Intimates       0.00      0.00      0.00         0
       Jackets       0.07      0.71      0.12        14
         Jeans       0.75      0.78      0.77       223
         Knits       0.78      0.47      0.59      1533
      Layering       0.00      0.00      0.00         0
       Legwear       0.03      1.00      0.06         1
        Lounge       0.00      0.00      0.00         2
     Outerwear       0.00      0.00      0.00         1
         Pants       0.60      0.55      0.57       293
        Shorts       0.00      0.00      0.00         0
        Skirts       0.17      0.90      0.29        30
         Sleep       0.03      1.00      0.05         1
      Sweaters       0.39      0.54      0.45  

  _warn_prf(average, modifier, msg_start, len(result))


## LSTM

In [13]:
# Make token from text into number
from keras.preprocessing.text import Tokenizer

# Filling the blank space so the model have same length
from keras.preprocessing.sequence import pad_sequences

In [14]:
MAX_NB_WORDS = 5000
MAX_SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 100

In [15]:
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters="!@#$%^&*()-=_+{}|:<>?/.,';[]\`~'",
                      lower=True)
tokenizer.fit_on_texts(df_new['Review Text'].values)
word_index = tokenizer.word_index

print("Found %s unique tokens" % len(word_index))

X = tokenizer.texts_to_sequences(df_new['Review Text'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print("Shape of data tensor", X.shape)

Found 16015 unique tokens
Shape of data tensor (22628, 250)


In [16]:
Y = pd.get_dummies(df_new['Class Name']).values
print('Shape of label tensor', Y.shape)

Shape of label tensor (22628, 20)


In [17]:
# Split data after tokenize and padded
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20,
                                                    random_state=42)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(18102, 250) (18102, 20)
(4526, 250) (4526, 20)


## Build LSTM Architecture

In [18]:
len(df_new['Class Name'].value_counts())

20

In [19]:
from keras.layers import Embedding, SpatialDropout1D, LSTM, Dense
from keras.models import Sequential

model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(20, activation='softmax')) # The 20 is from the total of classification

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 250, 100)          500000    
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 250, 100)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 20)                2020      
Total params: 582,420
Trainable params: 582,420
Non-trainable params: 0
_________________________________________________________________


In [20]:
model.fit(X_train, y_train, epochs=20, batch_size=64, validation_split=0.2)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7ffad3f1a950>

In [21]:
y_pred = model.predict(X_test)
y_pred.shape

(4526, 20)

In [24]:
from mlxtend.preprocessing import one_hot

In [25]:
result = np.where(y_pred[0] == np.amax(y_pred[0]))
one_hot(result[0])

array([[0., 0., 0., 1.]])

In [34]:
model.save('ecommerce_model.h5')
print("Model saved")

Model saved
