# Description

## Context

This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.

## Content

This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.

Age: Positive Integer variable of the reviewers age.

Title: String variable for the title of the review.

Review Text: String variable for the review body.

Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.

Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.

Division Name: Categorical name of the product high level division.

Department Name: Categorical name of the product department name.

Class Name: Categorical name of the product class name.

## Problem approach

This problem can be considered as a Classification or Regression problem.Our approach is to solve it as an Multiple Classification problem.

We have considered 'Rating' as the Target variable. The main objective is to predict the Women's clothing rating based on the customer reviews.

#### Load the required libraries

In [1]:
import os
import pandas as pd
import numpy as np

from sklearn.preprocessing import MinMaxScaler,LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

from keras.layers import Input,Embedding,Dense,Flatten,concatenate
from keras.models import Model

from IPython.display import Image
from sklearn.model_selection import train_test_split

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


#### Read the data

In [37]:
data= pd.read_csv('Womens Clothing E-Commerce Reviews.csv')

### Understand the data

In [38]:
data.shape

(23486, 11)

In [39]:
data.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [40]:
data.tail()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses
23485,23485,1104,52,Please make more like this one!,This dress in a lovely platinum is feminine an...,5,1,22,General Petite,Dresses,Dresses


In [65]:
data.describe(include='all')

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
count,19662.0,19662.0,19662,19662,19662.0,19662.0,19662.0,19662,19662,19662
unique,1095.0,,13983,19656,5.0,2.0,,3,6,20
top,1078.0,,Love it!,Perfect fit and i've gotten so many compliment...,5.0,1.0,,General,Tops,Dresses
freq,871.0,,136,3,10858.0,16087.0,,11664,8713,5371
mean,,43.260808,,,,,2.652477,,,
std,,12.258122,,,,,5.834285,,,
min,,18.0,,,,,0.0,,,
25%,,34.0,,,,,0.0,,,
50%,,41.0,,,,,1.0,,,
75%,,52.0,,,,,3.0,,,


In [42]:
data.dtypes

Unnamed: 0                  int64
Clothing ID                 int64
Age                         int64
Title                      object
Review Text                object
Rating                      int64
Recommended IND             int64
Positive Feedback Count     int64
Division Name              object
Department Name            object
Class Name                 object
dtype: object

#### Remove the column Unnamed:0 since it has sequence of unique numbers

In [43]:
data=data.drop(labels=['Unnamed: 0'], axis=1)

In [44]:
data.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


#### Identify the unique values for each of the attributes

In [45]:
for i in data.columns.values:
    print (i)
    #print (pd.value_counts(data[i].values,))
    print (len(data[i].unique()))

Clothing ID
1206
Age
77
Title
13994
Review Text
22635
Rating
5
Recommended IND
2
Positive Feedback Count
82
Division Name
4
Department Name
7
Class Name
21


In [46]:
for i in data.columns.values:
    print (i)
    print (pd.value_counts(data[i].values))

Clothing ID
1078    1024
862      806
1094     756
1081     582
872      545
829      527
1110     480
868      430
895      404
936      358
867      351
850      338
1095     327
863      306
1077     297
1059     294
1086     291
1080     289
860      288
1083     249
861      244
873      238
828      225
1092     220
1033     220
927      214
1056     213
820      211
836      205
1022     205
        ... 
88         1
72         1
56         1
1191       1
1175       1
1183       1
1127       1
887        1
600        1
648        1
680        1
712        1
137        1
105        1
89         1
73         1
57         1
41         1
25         1
9          1
1176       1
1160       1
1032       1
856        1
808        1
792        1
776        1
744        1
728        1
0          1
Length: 1206, dtype: int64
Age
39    1269
35     909
36     842
34     804
38     780
37     766
41     741
33     725
46     713
42     651
32     631
48     626
44     617
40     617
43     579

#### Change the data types accordingly

In [47]:
numerical = ['Age','Positive Feedback Count']
categorical =['Rating','Recommended IND','Division Name','Department Name','Class Name','Clothing ID']
string = ['Review Text','Title']

In [48]:
for num in numerical:
    data[num] = data[num].astype('int64')
    
for cat in categorical:
    data[cat] = data[cat].astype('category')

In [49]:
data.describe(include='all')

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
count,23486.0,23486.0,19676,22641,23486.0,23486.0,23486.0,23472,23472,23472
unique,1206.0,,13993,22634,5.0,2.0,,3,6,20
top,1078.0,,Love it!,Perfect fit and i've gotten so many compliment...,5.0,1.0,,General,Tops,Dresses
freq,1024.0,,136,3,13131.0,19314.0,,13850,10468,6319
mean,,43.198544,,,,,2.535936,,,
std,,12.279544,,,,,5.702202,,,
min,,18.0,,,,,0.0,,,
25%,,34.0,,,,,0.0,,,
50%,,41.0,,,,,1.0,,,
75%,,52.0,,,,,3.0,,,


In [50]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 10 columns):
Clothing ID                23486 non-null category
Age                        23486 non-null int64
Title                      19676 non-null object
Review Text                22641 non-null object
Rating                     23486 non-null category
Recommended IND            23486 non-null category
Positive Feedback Count    23486 non-null int64
Division Name              23472 non-null category
Department Name            23472 non-null category
Class Name                 23472 non-null category
dtypes: category(6), int64(2), object(2)
memory usage: 945.4+ KB


observe that Review Text,Title,Division Name, Department Name and Class Name has null values.

#### Verify missing value 

In [51]:
data.isnull().sum()

Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

#### Drop all the rows having Na's

In [52]:
data = data.dropna(axis=0)
data.isnull().sum()

Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

#### Get the unique levels in Clothing ID

In [53]:
clothing_ID_levels = np.size(np.unique(data['Clothing ID'], return_counts=True)[0])
clothing_ID_levels

1095

#### Remove ClothingID and Target attribute from Categorical data for further processing

In [54]:
categorical_attr = data.select_dtypes('category').columns
categorical_attr = categorical_attr.drop(['Rating','Clothing ID'])
categorical_attr 

Index(['Recommended IND', 'Division Name', 'Department Name', 'Class Name'], dtype='object')

In [55]:
target_attr = 'Rating'

### Pre-procressing of numerical variables

#### Convert integer to float ( useful for standardization further)

In [56]:
numerical_attr = data.select_dtypes('int64').columns
numerical_df = data[numerical_attr]

In [57]:
numerical_df=numerical_df.astype('float')
numerical_df.head()

Unnamed: 0,Age,Positive Feedback Count
2,60.0,0.0
3,50.0,0.0
4,47.0,6.0
5,49.0,4.0
6,39.0,1.0


#### Split the data into train and test

In [58]:
data_categorical_train, data_categorical_test, \
data_numerical_train, data_numerical_test, \
data_string_train, data_string_test, \
data_clothingID_train, data_clothingID_test, \
Y_train, Y_test = train_test_split(data[categorical_attr],
                                   numerical_df,
                                   data[string],
                                   data['Clothing ID'],
                                   data[target_attr],
                                   test_size=0.33, random_state=123) 

In [64]:
data_numerical_train

Unnamed: 0,Age,Positive Feedback Count
3394,46.0,1.0
3720,48.0,0.0
1036,82.0,0.0
19794,48.0,0.0
4534,30.0,0.0
22350,44.0,29.0
8586,26.0,1.0
3536,34.0,0.0
15640,45.0,4.0
4663,48.0,2.0


### Preprocessing of categorical variables

#### Convert categorical attributes to numeric

Ignore option is used to ignore if an unknown categorical feature is
present during transform instead of raising error. 

In [59]:
onehotencoder = OneHotEncoder(handle_unknown='ignore' )

In [60]:
OneHotEncoder = onehotencoder.fit(data_categorical_train)

ValueError: could not convert string to float: 'General'

In [28]:
OneHotEncoder_train = OneHotEncoder.transform(data_categorical_train).toarray()
OneHotEncoder_test = OneHotEncoder.transform(data_categorical_test).toarray()

TypeError: transform() missing 1 required positional argument: 'X'

In [30]:
OneHotEncoder_test.shape

(6489, 30)

### Preprocessing of Target variables

In [31]:
data['Rating'].unique()

[3, 5, 2, 4, 1]
Categories (5, int64): [3, 5, 2, 4, 1]

In [32]:
no_of_levels=len(data['Rating'].unique())

Since there are 5 different levels in the Target Rating , we need to one hot encode so that no order is implied

In [33]:
from sklearn.preprocessing import OneHotEncoder

In [34]:
onehotencoder = OneHotEncoder(handle_unknown='ignore')

In [35]:
OneHotEncoder = onehotencoder.fit(Y_train.values.get_values().reshape(-1, 1))

In [36]:
OneHotEncoder_target_train = OneHotEncoder.transform(Y_train.values.get_values().reshape(-1, 1)).toarray()
OneHotEncoder_target_test = OneHotEncoder.transform(Y_test.values.get_values().reshape(-1, 1)).toarray()

In [37]:
OneHotEncoder_target_test.shape

(6489, 5)

#### Min Max Scaling

In [38]:
Scalar= MinMaxScaler()
scaled_attr = Scalar.fit(data_numerical_train)
scaled_attr_train= scaled_attr.transform(data_numerical_train)
scaled_attr_test= scaled_attr.transform(data_numerical_test)

In [39]:
scaled_attr_test

array([[0.60493827, 0.        ],
       [0.20987654, 0.        ],
       [0.08641975, 0.        ],
       ...,
       [0.27160494, 0.        ],
       [0.18518519, 0.07407407],
       [0.51851852, 0.02777778]])

#### Stack both numerical and Categorical feautures

In [40]:
X_train = np.hstack((scaled_attr_train, OneHotEncoder_train))
X_train.shape

(13173, 32)

### Pre-Processing of Text

#### Preprocessing of Review Text

#### Get the length of the text having maximum number of occurances

#### Get the unique count of text length 

In [41]:
unique_elements, counts_elements = np.unique(data_string_train['Review Text'].apply(len),return_counts=True)

In [42]:
unique_elements

array([  9,  12,  13,  15,  16,  17,  20,  22,  24,  25,  26,  27,  28,
        29,  30,  31,  32,  33,  35,  36,  37,  38,  39,  40,  41,  42,
        43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
        56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,
        69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,
        82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,
        95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107,
       108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120,
       121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133,
       134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
       147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
       160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172,
       173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185,
       186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 19

In [43]:
counts_elements

array([   1,    1,    1,    1,    1,    3,    2,    1,    3,    3,    1,
          4,    2,    2,    2,    4,    1,    1,    2,    1,    4,    1,
          4,    2,    3,    2,    3,    4,    2,    6,    4,    2,    4,
         16,   11,    7,    8,   12,   13,    9,    9,    9,   10,   12,
         14,    9,    9,   12,   12,   19,   15,   16,   20,   15,   17,
         24,   25,   11,   10,   18,   11,   12,   13,   18,   15,   23,
         22,   18,   10,   14,   17,   16,   12,   26,   16,   17,   23,
         18,   18,   12,   18,   19,   15,   30,   29,   19,   22,   15,
         17,   20,   15,   23,   23,   16,   30,   15,   29,   16,   21,
         24,   16,   21,   22,   24,   23,   24,   25,   26,   25,   25,
         22,   21,   28,   23,   26,   28,   27,   24,   21,   19,   24,
         29,   28,   24,   28,   26,   23,   26,   22,   27,   26,   20,
         23,   33,   17,   26,   23,   18,   39,   18,   21,   39,   17,
         28,   30,   28,   26,   24,   31,   31,   

#### We observe that the highest value of count is 1927 and the corresponding text length is 500 , hence we are choosing 500 as the maximum text length 

In [44]:
max_text_count_length = list(counts_elements).index(max(counts_elements))
REVIEW_TEXT_MAX_SEQUENCE_LENGTH = unique_elements[max_text_count_length]
REVIEW_TEXT_MAX_SEQUENCE_LENGTH

500

#### Tokenize the words

In [45]:
tokenizer = Tokenizer(oov_token='None')
tokenizer.fit_on_texts(data_string_train['Review Text'])
review_text_train = tokenizer.texts_to_sequences(data_string_train['Review Text'])
review_text_test = tokenizer.texts_to_sequences(data_string_test['Review Text'])

word_index_review_text = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index_review_text))
NUM_WORDS_REVIEW_TEXT = len(word_index_review_text)+1

review_text_seq_train = pad_sequences(review_text_train, maxlen=REVIEW_TEXT_MAX_SEQUENCE_LENGTH)
review_text_seq_test = pad_sequences(review_text_test, maxlen=REVIEW_TEXT_MAX_SEQUENCE_LENGTH)

Found 11986 unique tokens.


###### Load the GloVe word embedding file into memory as a dictionary of word to embedding array.

__Note__: Filter the embedding for the unique words in the training data.

In [46]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('glove.6B.50d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


#### Next, create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.

#### The result is a matrix of weights only for words we will see during training.

#### Also count the number of words not present in the glove to decide whether we need to train or not


In [47]:
# create a weight matrix for words in training docs
review_embedding_matrix = np.zeros((NUM_WORDS_REVIEW_TEXT,50))
review_word_not_in_glove_count = 0
review_word_not_in_glove =[]
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        review_embedding_matrix[i] = embedding_vector
    else:
        review_word_not_in_glove.append(word)
        review_word_not_in_glove_count = review_word_not_in_glove_count+1

In [48]:
print(review_embedding_matrix)

[[ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.41800001  0.24968    -0.41242    ... -0.18411    -0.11514
  -0.78580999]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [-0.17242    -0.086107   -0.78268999 ... -1.3312      0.36011001
  -0.26620001]
 [ 0.40204    -2.33529997  0.22948    ...  0.81226999 -0.63955998
   0.76555002]]


In [49]:
print(review_word_not_in_glove)

['None', "it's", "i'm", '\r', "don't", "didn't", "doesn't", "can't", "i've", "wasn't", "5'4", "isn't", "i'd", "5'3", "5'5", "couldn't", "that's", "5'2", "5'7", "i'll", 'xxs', "you're", "5'8", "wouldn't", "5'6", "they're", 'pilcro', "5'", "won't", "5'1", "there's", "5'9", "haven't", '34d', "5'10", "aren't", '36d', "you'll", 'xsp', '0p', '34dd', '36dd', "would've", '135lbs', 'xxsp', '120lbs', '30dd', "weren't", "you'd", "it'll", "retailer's", '32dd', '00p', 'xsmall', '130lbs', '140lbs', 'tshirt', '125lbs', "5'0", 'skinnies', '115lbs', "5'11", "shouldn't", "they'd", "could've", "model's", "what's", "hadn't", '110lbs', 'pxs', 'jsut', "it'd", '145lbs', "70's", "5'4''", 'armhole', "you've", '34ddd', 'cartonnier', "they'll", 'deletta', "she's", '36ddd', 'skort', 'jeggings', 'heathered', 'bralette', "one's", "year's", "here's", '34f', '100lbs', '34g', "dind't", "6'", '128lbs', "let's", '150lbs', 'skinnys', 'pxxs', '105lbs', 'pilcros', '32ddd', "they've", "60's", 'snugger', '34aa', 'stevies', "

In [50]:
print(review_word_not_in_glove_count)

1943


#### Similar text preprocessing for Title below

#### Get the lines having unique length size

In [51]:
unique_elements, counts_elements = np.unique(data_string_train['Title'].apply(len),return_counts=True)

In [52]:
unique_elements

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52])

In [53]:
counts_elements

array([  9,  36, 170, 172, 163, 301, 518, 500, 602, 586, 582, 619, 656,
       719, 663, 524, 565, 518, 547, 432, 370, 355, 353, 310, 322, 232,
       228, 216, 186, 205, 155, 151, 136, 115, 109, 102, 116,  75,  79,
        63,  54,  59,  46,  41,  47,  30,  38,  42,  54,   2])

#### Select the Max Sequence length

In [54]:
max_text_count_length = list(counts_elements).index(max(counts_elements))
TITLE_MAX_SEQUENCE_LENGTH = unique_elements[max_text_count_length]
TITLE_MAX_SEQUENCE_LENGTH

15

#### Tokenize the Title text

In [55]:
tokenizer = Tokenizer(oov_token='None')
tokenizer.fit_on_texts(data_string_train['Title'])
title_train = tokenizer.texts_to_sequences(data_string_train['Title'])
title_test = tokenizer.texts_to_sequences(data_string_test['Title'])

word_index_title = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index_title))
NUM_WORDS_TITLE = len(word_index_title)+1

title_text_seq_train = pad_sequences(title_train, maxlen=TITLE_MAX_SEQUENCE_LENGTH)
title_text_seq_test = pad_sequences(title_test, maxlen=TITLE_MAX_SEQUENCE_LENGTH)

Found 3117 unique tokens.


#### Also count the number of words not present in the glove to decide whether we need to train or not

In [56]:
# create a weight matrix for words in training docs
title_embedding_matrix = np.zeros((NUM_WORDS_TITLE,50))
title_word_not_in_glove_count = 0
title_word_not_in_glove =[]
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        title_embedding_matrix[i] = embedding_vector
    else:
        title_word_not_in_glove.append(word)
        title_word_not_in_glove_count = title_word_not_in_glove_count+1

In [57]:
print(title_word_not_in_glove_count)

286


### Attribute Clothing ID is not treated as a continuous variable since the values are repeating
Since there are 1172 different values in 'Clothing ID' dummyfying it will cause sparser matrix (having many 0's)

Therefore we are choosing to use categorical embedding

In [58]:
clothing_id_levels_encoded=LabelEncoder().fit(data['Clothing ID'])

In [59]:
clothing_id_levels_encoded_train=clothing_id_levels_encoded.transform(data_clothingID_train)
clothing_id_levels_encoded_test=clothing_id_levels_encoded.transform(data_clothingID_test)

### Build the model using functional api

#### Categorical embedding of Clothing ID

In [60]:
clothing_id_input = Input(shape=(1, ), name="Clothing_ID")
clothing_id_embed = Embedding(input_dim=clothing_ID_levels, output_dim=50)(clothing_id_input)
clothing_id_embed_flat = Flatten()(clothing_id_embed)

#### Dense layer for numerical features

In [61]:
X_test = np.hstack((scaled_attr_test, OneHotEncoder_test))

In [62]:
num_cat_inputs = Input(shape=(X_train.shape[1],),name='num_cat_inputs')
out_num_cat = Dense(64, activation='relu')(num_cat_inputs)

#### Embedding layer for Review Text

#### If there are more than one word in the training data which are not present in Glove then train the embedding layer

In [63]:
review_text_input= Input(shape=(REVIEW_TEXT_MAX_SEQUENCE_LENGTH,),name='review_text_input')
if (review_word_not_in_glove_count<=1):
    text_embed = Embedding(input_dim=NUM_WORDS_REVIEW_TEXT,output_dim=50,weights=[review_embedding_matrix],trainable=False)(review_text_input)
else:
    text_embed = Embedding(input_dim=NUM_WORDS_REVIEW_TEXT,output_dim=50,weights=[review_embedding_matrix],trainable=True)(review_text_input)
review_out_text = Flatten()(text_embed)

#### Embedding layer for Title

In [64]:
title_text_input= Input(shape=(TITLE_MAX_SEQUENCE_LENGTH,),name='title_text_input')
if (title_word_not_in_glove_count<=1):
    text_embed = Embedding(input_dim=NUM_WORDS_TITLE,output_dim=50,weights=[title_embedding_matrix],trainable=False)(title_text_input)
else:
    text_embed = Embedding(input_dim=NUM_WORDS_TITLE,output_dim=50,weights=[title_embedding_matrix],trainable=True)(title_text_input)
title_out_text = Flatten()(text_embed)

#### Concatenate the output of above layers.

In [65]:
concatenated = concatenate([clothing_id_embed_flat,out_num_cat,review_out_text,title_out_text],axis=-1)
X = Dense(8, activation='relu')(concatenated)
final_out = Dense(no_of_levels, activation='softmax')(X)

In [66]:
model = Model(inputs=[clothing_id_input,num_cat_inputs,review_text_input,title_text_input], outputs=final_out)

In [67]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Clothing_ID (InputLayer)        (None, 1)            0                                            
__________________________________________________________________________________________________
review_text_input (InputLayer)  (None, 500)          0                                            
__________________________________________________________________________________________________
title_text_input (InputLayer)   (None, 15)           0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 1, 50)        54750       Clothing_ID[0][0]                
__________________________________________________________________________________________________
num_cat_in

In [68]:
model.compile(loss='categorical_crossentropy', optimizer='adagrad', metrics=['accuracy'])

In [69]:
model.fit([clothing_id_levels_encoded_train,X_train,
           review_text_seq_train,title_text_seq_train], 
          y=OneHotEncoder_target_train, 
          epochs=10,validation_split=0.20)

Train on 10538 samples, validate on 2635 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x136d3e828>

In [70]:
model.evaluate([clothing_id_levels_encoded_train,X_train,review_text_seq_train,title_text_seq_train], 
               y=OneHotEncoder_target_train, )



[0.47434727256303316, 0.8296515600453076]

In [71]:
model.evaluate([clothing_id_levels_encoded_test,X_test,review_text_seq_test,title_text_seq_test], 
               y=OneHotEncoder_target_test, )



[0.9200684088351896, 0.6437047311109283]

In [72]:
model.predict([clothing_id_levels_encoded_test,X_test,review_text_seq_test,title_text_seq_test])

array([[6.22871893e-11, 5.96595706e-09, 2.86042516e-04, 5.86297363e-03,
        9.93850946e-01],
       [3.20166582e-03, 5.61929606e-02, 6.42572790e-02, 6.69453621e-01,
        2.06894472e-01],
       [2.46309161e-01, 3.54032487e-01, 3.51960301e-01, 1.21067669e-02,
        3.55912820e-02],
       ...,
       [2.33408791e-05, 9.01696563e-04, 1.01095371e-01, 5.98917484e-01,
        2.99062163e-01],
       [1.40292281e-02, 8.35757732e-01, 6.50851149e-03, 1.30809769e-01,
        1.28948325e-02],
       [2.00589781e-10, 8.42359640e-08, 1.12169808e-04, 9.48648620e-03,
        9.90401328e-01]], dtype=float32)