# Womens Clothing Ecommerce Recommendation

# Women E-Commerce Clothing Reviews dataset

### Dataset Description

This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

**Objective**
The objective is to classify whether the customer recommends the product or not

**Columns**
1. Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.

2. Age: Positive Integer variable of the reviewers age.
    
3. Title: String variable for the title of the review.
    
4. Review Text: String variable for the review body.
    
5. Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
    
6. Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
    
7. Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
8. Division Name: Categorical name of the product high level division.
    
9. Department Name: Categorical name of the product department name.
    
10. Class Name: Categorical name of the product class name.

# 1. Importing the packages and the dataset(unlabeled)

In [1]:
# Importing the libraries
import numpy as np
import pandas as pd
import re,string
import gensim

First unlabeled data is being taken for analysis

In [2]:
# Reading the dataset(unlabeled)
eccom_unlabel = pd.read_csv('eccomerce_unlabeled.csv')

In [3]:
# Getting the head of the dataset
eccom_unlabel.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,6,General,Tops,Blouses


In [4]:
# Removing Unnamed:0
eccom_unlabel = eccom_unlabel.drop('Unnamed: 0',1)

# 2. Checking for null values and treating them

In [5]:
# Checking for null values
eccom_unlabel.isnull().sum()

Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

In [6]:
# Filling the null values with 0
eccom_unlabel = eccom_unlabel.fillna(value=0)

# 3. Data Cleaning

In [7]:
# Data Cleaning
def clean_string(string):                                                         # The entire document is cleaned defining clean_string
  try:
    string=re.sub(r'^https?:\/\/<>.*[\r\n]*','',string,flags=re.MULTILINE)
    string=re.sub(r"[^A-Za-z0-9]"," ",string)
    words=string.strip().lower().split()
    return " ".join(words)
  except:
    return " "

In [8]:
# Cleaning the Review Text column
eccom_unlabel['Review Text']=eccom_unlabel['Review Text'].apply(clean_string)                                  # Finally cleaned format is applied on the reviews


In [9]:
# Getting the no of samples
print ("No.of samples \n:",(len(eccom_unlabel)))

No.of samples 
: 23486


# 4. Splitting each review for getting word tokens

In [10]:
# Splitting each review to have word tokens
Document=[]
for doc in eccom_unlabel['Review Text']:
  Document.append(doc.split(' '))

In [11]:
# Length of the document
len(Document)

23486

In [12]:
# Exploring the split reviews
Document[11]                                                    

['this', 'dress', 'is', 'perfection', 'so', 'pretty', 'and', 'flattering']

In [118]:
print(len(Document[13]))                                                          # Lenth of the 10th document ,  It has 524 words in it
print(Document[13])

76
['bought', 'the', 'black', 'xs', 'to', 'go', 'under', 'the', 'larkspur', 'midi', 'dress', 'because', 'they', 'didn', 't', 'bother', 'lining', 'the', 'skirt', 'portion', 'grrrrrrrrrrr', 'my', 'stats', 'are', '34a', '28', '29', '36', 'and', 'the', 'xs', 'fit', 'very', 'smoothly', 'around', 'the', 'chest', 'and', 'was', 'flowy', 'around', 'my', 'lower', 'half', 'so', 'i', 'would', 'say', 'it', 's', 'running', 'big', 'the', 'straps', 'are', 'very', 'pretty', 'and', 'it', 'could', 'easily', 'be', 'nightwear', 'too', 'i', 'm', '5', '6', 'and', 'it', 'came', 'to', 'just', 'below', 'my', 'knees']


# 5. Applying the Word 2 Vector in the column

In [119]:
# Import logging
import logging                                                                   

In [120]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# Training the Word 2 Vec model
model=gensim.models.Word2Vec(Document,                                           # List of reviews
                          min_count=10,                                          # we want words appearing atleast 10 times in the vocab otherwise ignore 
                          workers=4,                                             # Use these many worker threads to train the model (=faster training with multicore machines
                           size=50,                                              # it means aword is represented by 50 numbers,in other words the number of neorons in hidden layer is 50 
                          window=5)                                              # 5 neighbors on the either side of a word

2020-06-10 02:38:30,537 : INFO : collecting all words and their counts
2020-06-10 02:38:30,537 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-06-10 02:38:30,624 : INFO : PROGRESS: at sentence #10000, processed 591576 words, keeping 9788 word types
2020-06-10 02:38:30,706 : INFO : PROGRESS: at sentence #20000, processed 1192608 words, keeping 13252 word types
2020-06-10 02:38:30,737 : INFO : collected 14168 word types from a corpus of 1399859 raw words and 23486 sentences
2020-06-10 02:38:30,738 : INFO : Loading a fresh vocabulary
2020-06-10 02:38:30,755 : INFO : effective_min_count=10 retains 3505 unique words (24% of original 14168, drops 10663)
2020-06-10 02:38:30,756 : INFO : effective_min_count=10 leaves 1375465 word corpus (98% of original 1399859, drops 24394)
2020-06-10 02:38:30,768 : INFO : deleting the raw counts dictionary of 14168 items
2020-06-10 02:38:30,768 : INFO : sample=0.001 downsamples 60 most-common words
2020-06-10 02:38:30,769 : IN

In [121]:
# Vocab words length
print(len(model.wv.vocab))

3505


In [122]:
# Dimension of each vector
print(model.wv.vector_size)       

50


In [123]:
# Shape of the vectors
model.wv.vectors.shape  

(3505, 50)

# 6. Finding similarities with the given words

In [124]:
# Finding for similarity with the given word
model.wv.most_similar("elastic")

2020-06-10 02:38:36,183 : INFO : precomputing L2-norms of word weight vectors


[('band', 0.8767274618148804),
 ('waistband', 0.8104285001754761),
 ('string', 0.7814128994941711),
 ('gaping', 0.7778440713882446),
 ('drawstring', 0.7710692882537842),
 ('adjustable', 0.769332230091095),
 ('pleats', 0.7525403499603271),
 ('seam', 0.7354264259338379),
 ('empire', 0.7313741445541382),
 ('slight', 0.7260311841964722)]

In [125]:
# Finding for similarity with the given word
model.wv.most_similar("thin")

[('thick', 0.8345667123794556),
 ('sheer', 0.8288353681564331),
 ('heavy', 0.809961199760437),
 ('itchy', 0.8096404671669006),
 ('stiff', 0.7992092967033386),
 ('scratchy', 0.7959129810333252),
 ('flimsy', 0.786957859992981),
 ('substantial', 0.7443051338195801),
 ('stretchy', 0.7316536903381348),
 ('clingy', 0.7047092318534851)]

In [126]:
# Finding for similarity with the given word
model.wv.most_similar("casual")

[('dressy', 0.8592880964279175),
 ('professional', 0.7696312069892883),
 ('formal', 0.760412871837616),
 ('everyday', 0.7477831840515137),
 ('office', 0.7186474800109863),
 ('business', 0.7093658447265625),
 ('versatile', 0.707597017288208),
 ('statement', 0.691520094871521),
 ('night', 0.687825620174408),
 ('fancy', 0.6745193004608154)]

# 7. Getting the vectors for the given words

In [127]:
# Getting the vectors for the given word
model.wv["good"]                                                                  

array([-1.176295  , -0.45844513, -2.5261326 , -0.80591667, -1.2087256 ,
        1.1122762 , -0.37928355, -0.37312347,  1.0273516 , -1.8836976 ,
        0.64909995, -3.514252  , -0.5397141 ,  0.7332397 ,  1.2798525 ,
        0.8106409 ,  0.18375742,  0.14993669, -2.5257838 , -0.8134738 ,
        1.8921258 , -0.911001  ,  1.6516182 ,  0.84880006,  0.7936893 ,
        0.77655774,  2.263096  , -0.08433716,  0.4509756 , -1.7902579 ,
       -0.3375684 , -0.91015935, -1.4423649 ,  2.4932146 ,  1.0489174 ,
       -0.28217334, -2.4501994 , -0.7577879 , -0.09641537, -1.5083404 ,
       -0.24535249,  0.99679506, -2.2660577 ,  0.6605534 ,  0.0208884 ,
       -2.7304204 , -0.88867676, -0.21991247, -2.7705712 ,  2.6571522 ],
      dtype=float32)

In [128]:
# Getting the vectors for the given word
model.wv['ugly']

array([ 0.07112465,  0.2635235 , -0.2920789 , -0.15770221,  0.15314275,
       -0.13191661,  0.01873192, -0.09989369,  0.22969107,  0.07122152,
       -0.09486619, -0.0740578 ,  0.09546576,  0.10995583,  0.0222888 ,
        0.0044242 , -0.07734506,  0.00986315,  0.01937341, -0.0689163 ,
       -0.172906  , -0.16759486,  0.08832748,  0.00245991,  0.03076484,
       -0.07046505, -0.12630461, -0.09965779,  0.21890326,  0.02806855,
       -0.07381344,  0.09758095, -0.13874969,  0.08826001, -0.13341103,
        0.20776902,  0.16249816, -0.19754645, -0.0517891 ,  0.00286017,
       -0.11863533, -0.2139306 ,  0.15927167, -0.3561667 , -0.16951601,
       -0.01266793,  0.15520552,  0.18639816, -0.02249255,  0.02963131],
      dtype=float32)

# 8. Saving the model

In [129]:
# Saving the model
model.save("word 2 vector woman ecommerce")

2020-06-10 02:38:40,334 : INFO : saving Word2Vec object under word 2 vector woman ecommerce, separately None
2020-06-10 02:38:40,335 : INFO : not storing attribute vectors_norm
2020-06-10 02:38:40,336 : INFO : not storing attribute cum_table
2020-06-10 02:38:40,355 : INFO : saved word 2 vector woman ecommerce


Now pre-trained Word2Vec model

# 9. Reading the labeled dataset

In [178]:
# Reading the dataset
eccom_label = pd.read_csv('eccommerce_labeled.csv')

In [179]:
eccom_label.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,0,absolutely wonderful silky and sexy and comfor...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,0,love this dress it s sooo pretty i happened to...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,i had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,i love love love this jumpsuit it s fun flirty...,5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,this shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [180]:
# Dropping the unwanted column
eccom_label = eccom_label.drop('Unnamed: 0',1)

# 10. Splitting x and y variables into train and test

In [148]:
# Splitting the dataset into train and test using train_test_split
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(eccom_label['Review Text'],eccom_label['Recommended IND'],test_size=0.3,random_state=2)

# 11. Building the tokenizer

In [149]:
# Building tokenizer
from tensorflow.python.keras.preprocessing.text import Tokenizer

#Vocab size
top_words = 10000

t = Tokenizer(num_words=top_words)
t.fit_on_texts(X_train.tolist())

#Get the word index for each of the word in the review
X_train = t.texts_to_sequences(X_train.tolist())
X_test = t.texts_to_sequences(X_test.tolist())

# 12. Using pad sequences to make each review size equal

In [150]:
# Using pad sequences to make each review size equal
from tensorflow.python.keras.preprocessing import sequence

#Each review size
max_review_length = 300

X_train = sequence.pad_sequences(X_train,maxlen=max_review_length,padding='post')
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length, padding='post')

# 13. Building Embedding Matrix from Pre-Trained model

In [151]:
# Building Embedding Matrix from Pre-Trained model
word2vec = gensim.models.Word2Vec.load('word 2 vector woman ecommerce')

#Embedding Length
embedding_vector_length = word2vec.wv.vectors.shape[1]

print('Loaded word2vec model..')
print('Model shape: ', word2vec.wv.vectors.shape)

2020-06-10 02:44:25,292 : INFO : loading Word2Vec object from word 2 vector woman ecommerce
2020-06-10 02:44:25,542 : INFO : loading wv recursively from word 2 vector woman ecommerce.wv.* with mmap=None
2020-06-10 02:44:25,542 : INFO : setting ignored attribute vectors_norm to None
2020-06-10 02:44:25,542 : INFO : loading vocabulary recursively from word 2 vector woman ecommerce.vocabulary.* with mmap=None
2020-06-10 02:44:25,546 : INFO : loading trainables recursively from word 2 vector woman ecommerce.trainables.* with mmap=None
2020-06-10 02:44:25,546 : INFO : setting ignored attribute cum_table to None
2020-06-10 02:44:25,547 : INFO : loaded word 2 vector woman ecommerce


Loaded word2vec model..
Model shape:  (3505, 50)


In [152]:
# Vector size
word2vec.wv.vector_size

50

# 14. Building matrix for the current data

In [153]:
#Initialize embedding matrix to all zeros
embedding_matrix = np.zeros((top_words + 1, # Vocablury size + 1,, we add 1 to vocab size for padding
                             embedding_vector_length))

#Steps for populating embedding matrix

#1. Check each word in tokenizer vocablury to see if it exist in pre-trained
# word2vec model.
#2. If found, update embedding matrix with embeddings for the word 
# from word2vec model

for word, i in sorted(t.word_index.items(),key=lambda x:x[1]):
    if i > top_words:
        break
    if word in word2vec.wv.vocab:
        embedding_vector = word2vec.wv[word]
        embedding_matrix[i] = embedding_vector

In [154]:
#Checking embeddings for word 'good'
embedding_matrix[t.word_index['good']]

array([-1.17629504, -0.45844513, -2.52613258, -0.80591667, -1.20872557,
        1.1122762 , -0.37928355, -0.37312347,  1.02735162, -1.88369763,
        0.64909995, -3.51425195, -0.5397141 ,  0.73323971,  1.27985251,
        0.81064087,  0.18375742,  0.14993669, -2.52578378, -0.81347382,
        1.89212584, -0.91100103,  1.65161824,  0.84880006,  0.79368931,
        0.77655774,  2.26309609, -0.08433716,  0.4509756 , -1.79025793,
       -0.3375684 , -0.91015935, -1.44236493,  2.49321461,  1.04891741,
       -0.28217334, -2.45019937, -0.75778788, -0.09641537, -1.50834036,
       -0.24535249,  0.99679506, -2.26605773,  0.6605534 ,  0.0208884 ,
       -2.73042035, -0.88867676, -0.21991247, -2.77057123,  2.65715218])

# 15. Building the graph

In [155]:
# Building the graph
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dropout, Dense, Embedding, Flatten

#Build a sequential model
model1 = Sequential()

# 16. Adding the embedding layer

In [156]:
# Adding the embedding layer
model1.add(Embedding(top_words + 1,
                    embedding_vector_length,
                    input_length=max_review_length,
                    weights=[embedding_matrix],                                    # Pre-trained embedding
                    trainable=False)                                               # We do not want to change embedding
         )

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


# 17. Flattening embedding and output layers

In [157]:
#Flatten embedding layer output and flatten layers
model1.add(Flatten())                                                             # Flatten enables us to bring down the dimension of the prepared data
model1.add(Dense(200,activation='relu'))                                          # Dense layer is for fully connected layer
model1.add(Dense(100,activation='relu'))
model1.add(Dropout(0.5))                                                          # Dropout is required to avoid overfiting & make the model generalize
model1.add(Dense(60,activation='relu'))
model1.add(Dropout(0.4))
model1.add(Dense(30,activation='relu'))
model1.add(Dropout(0.3))
model1.add(Dense(1,activation='sigmoid'))                                         # We've used sigmoid because output variable is binary

model1.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


# 18. Fitting the model

In [159]:
# Executing the graph
model1.fit(X_train,Y_train,
          epochs=5,
          batch_size=200,          
          validation_data=(X_test, Y_test))

Train on 16440 samples, validate on 7046 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x2b9f3955780>

# 19. Testing the model

In [166]:
# Testing
model1.predict(X_test[5:9])

array([[0.99993086],
       [0.94569564],
       [0.9877014 ],
       [0.9994308 ]], dtype=float32)

In [167]:
eccom_label.iloc[5:9,:]

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
5,1080,49,Not for the very petite,i love tracy reese dresses but this one is not...,2,0,4,General,Dresses,Dresses
6,858,39,Cagrcoal shimmer fun,i aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits
7,858,39,"Shimmer, surprisingly goes with lots",i ordered this in carbon for store pick up and...,4,1,4,General Petite,Tops,Knits
8,1077,24,Flattering,i love this dress i usually get an xs but it r...,5,1,0,General,Dresses,Dresses
