<h1>Modeling<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-in-dataset" data-toc-modified-id="Loading-in-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading in dataset</a></span></li><li><span><a href="#Assigning-predictor-variable-and-target-variable" data-toc-modified-id="Assigning-predictor-variable-and-target-variable-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Assigning predictor variable and target variable</a></span></li><li><span><a href="#Modeling---preprocessing" data-toc-modified-id="Modeling---preprocessing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Modeling - preprocessing</a></span></li><li><span><a href="#Modeling---cross-validation-and-performance-evaluation" data-toc-modified-id="Modeling---cross-validation-and-performance-evaluation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Modeling - cross-validation and performance evaluation</a></span></li><li><span><a href="#Modeling---final-model" data-toc-modified-id="Modeling---final-model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Modeling - final model</a></span></li><li><span><a href="#Evaluate-the-model-using-your-test-dataset" data-toc-modified-id="Evaluate-the-model-using-your-test-dataset-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Evaluate the model using your test dataset</a></span></li></ul></div>

Note: please load and preprocess your test dataset along with the original (training) dataset until the final model training part.

In [1]:
# import packages
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import matplotlib.pyplot as plt
import re
from collections import Counter
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer 
from nltk.corpus import stopwords, wordnet
from nltk import punkt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold


# Loading in dataset

In [2]:
# load in the notebook that contains all extracted features we created for the products
product = pd.read_csv('all_features.csv')

# load in the original notebook for all products
product_original = pd.read_excel('Behold+product+data+04262021.xlsx')

In [3]:
# appending the brand of each product to the dataframe that has extracted features 
product['brand'] = product_original['brand']

The final dataframe contains lemmatized description of products (in one column), each of the 30 features that are extracted from the texts, all features combined (in one column), and the brand of each product

In [4]:
product.head(2)

Unnamed: 0,product_id,lemm_total,detailed_category,general_category,gender,season,class,closure,color,dry_clean_only,...,toe_style,trend,wash,width,location,material_percent,material,brand_specific,all_features,brand
0,01EX0PN4J9WRNZH5F93YEX6QAF,unknown khadi stripe shirt our signature shirt...,shirt,top,,spring,shirt,,black white,,...,,,black white,,,,,,shirt top spring shirt black white ...,Two
1,01F0C4SKZV6YXS3265JMC39NXW,unknown ruffle market dress loopy pink sistine...,dress,onepiece,woman,,dress,strap zipper,pink,,...,,,,,ny,,,organic,dress onepiece woman dress strap zipper pink...,Collina Strada


# Assigning predictor variable and target variable

- For the predictor variable, we chose to combine all extracted features, and append it with lemmatized product descriptions&details in case there are not many features extracted. In this way, we can make sure that most products' predictor variable will have more than 64 words when we later pad the documents with max length 64. 

    
- For the target variable, we chose to only include the top 30 appearing brands in the dataset as well as an 'other' category

In [5]:
# We use 'X' to denote the column that represents the predictor variable we are going to use in the model
# It contains all the features of the product, followed by lemmatized description/details 

product['X'] = product['all_features'] + product['lemm_total']

In [6]:
# We use 'target' to denote the column that represents the target variable of the dataset, which contains a total
# of 31 classes 

top30 = product.brand.value_counts()[:30].index.to_list()
def assign_brand(name):
    '''Assigns the brand to each record (either the top 30 brands or Other)'''
    if name in top30:
        return name
    else:
        return 'Other'
product['target'] = product.brand.apply(assign_brand)

# Modeling - preprocessing

- Remove stopwords for the texts in the predictor variable


In [7]:
# importing nltk stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords') 

from nltk.corpus import stopwords
english_stopwords = set(stopwords.words("english"))

# adding 'unknown' as a stopword
english_stopwords.add('unknown')


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/BarbaraLiao/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/BarbaraLiao/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
def remove_stopwords(title):
    '''remove stopwords for a document'''
    if isinstance(title, str):
        tokens = nltk.word_tokenize(title)
        filtered_tokens = []
        for token in tokens:
            if token in english_stopwords:
                continue
            filtered_tokens.append(token)
            
        return " ".join(filtered_tokens)

In [9]:
# removing stopwords for the predictor variable in the data set

product["X"] = product["X"].apply(remove_stopwords)

# Modeling - cross-validation and performance evaluation

In [11]:
# formatting the predictor variable and target variable into 2 separate lists
X = product['X'].to_list()
Y = product['target'].to_list()

In [12]:
# Tokenize the top 5000 appearing words, and mark the rest as UNKNOWN_TOKEN
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000, oov_token="UNKNOWN_TOKEN")
tokenizer.fit_on_texts(X)

In [13]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
def integer_encode_documents(docs, tokenizer):
    '''apply the input tokenizer on the input docs and return sequences'''
    return tokenizer.texts_to_sequences(docs)

# integer encode the documents
encoded_docs = integer_encode_documents(X, tokenizer)
# see some lengths of the documents
list(map(len, encoded_docs))[:5]

[18, 141, 598, 386, 26]

In [15]:
# set MAX_SEQUENCE_LENGTH to 64
MAX_SEQUENCE_LENGTH = 64

# This is a list of lists, the numbers represent the index position of each word;
# for instance, 33 means the 33rd word in the vocabulary

# This step makes sure that each document in the predictor variable has a fixed length of 64 

padded_docs = pad_sequences(encoded_docs, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
padded_docs

array([[  95,    7,  189, ...,    0,    0,    0],
       [   7,    1,    7, ...,   55,    1,    7],
       [2903,  437,  391, ..., 1329,  456,    1],
       ...,
       [  20,   32,  282, ...,   12, 2589,  279],
       [2621,  292,  418, ...,    7, 2621,  292],
       [   6,  125,   43, ...,    6,  186, 1120]], dtype=int32)

In [16]:
# This step encodes the 31 brand category to 31 labels, and makes 31 binary columns for them

from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
labels = to_categorical(encoder.fit_transform(Y),31)

<b> Note: please do not include your test dataset at this point<b>

In [19]:
# making 5-fold cross validation data sets
# for each item in 'cv', it contains 4 lists that represent X_train, X_test,Y_train, and Y_test, respectively

kf = KFold(n_splits=5)
cv = []
for train_index, test_index in kf.split(padded_docs):
    X_train, X_test = padded_docs[train_index], padded_docs[test_index]
    Y_train, Y_test = labels[train_index], labels[test_index]
    cv += [[X_train, X_test,Y_train, Y_test]]
    

In [20]:
from random import randint
from numpy import array, argmax, asarray, zeros
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding

In [21]:
VOCAB_SIZE = int(len(tokenizer.word_index) * 1.1)

In [22]:
def load_glove_vectors():
    '''load in the glove vectors and return embeddings index'''
    embeddings_index = {}
    with open('glove.6B.100d.txt') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    print('Loaded %s word vectors.' % len(embeddings_index))
    return embeddings_index


embeddings_index = load_glove_vectors()

Loaded 400000 word vectors.


In [23]:
# create a weight matrix for words in training docs

embedding_matrix = zeros((VOCAB_SIZE, 100))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: # check that it is an actual word that we have embeddings for
        embedding_matrix[i] = embedding_vector

In [24]:
# define lstm model

import keras
from keras.layers.recurrent import SimpleRNN, LSTM
from keras.layers import Flatten, Masking

def make_lstm_classification_model(plot=False):
    model =  keras.models.Sequential()
    model.add(Embedding(VOCAB_SIZE, 100, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False))
    model.add(Masking(mask_value=0.0)) # masking layer, masks any words that don't have an embedding as 0s.
    model.add(LSTM(units=32, input_shape=(1, MAX_SEQUENCE_LENGTH)))
    model.add(Dense(16))
    model.add(Dense(31, activation='softmax'))
    
    # Compile the model
    model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    # summarize the model
    model.summary()
    
    if plot:
        plot_model(model, to_file='model.png', show_shapes=True)
    return model

In [25]:
# create an instance of the lstm model

model = make_lstm_classification_model()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 64, 100)           3740000   
_________________________________________________________________
masking (Masking)            (None, 64, 100)           0         
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense (Dense)                (None, 16)                528       
_________________________________________________________________
dense_1 (Dense)              (None, 31)                527       
Total params: 3,758,079
Trainable params: 18,079
Non-trainable params: 3,740,000
_________________________________________________________________


<b> Note: You can comment out the next 2 cells because it will take extremely long time to run <b>

In [26]:
# evaluating model performance with cv and recording the accuracy in the dictionary 'cv_results'

cv_results = {}
for i in range(5):

    # train the model
    history = model.fit(cv[i][0], cv[i][2],validation_split = 0.1, epochs=20, verbose=1)

    # evaluate the model
    loss, accuracy = model.evaluate(cv[i][1], cv[i][3], verbose=1)
    cv_results[i] = accuracy

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20


Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [27]:
cv_results

{0: 0.9291999936103821,
 1: 0.9506000280380249,
 2: 0.9599000215530396,
 3: 0.9684000015258789,
 4: 0.9545000195503235}

# Modeling - final model
- Here we use all the data to train the lstm model so that its performance can be improved

- <b> Note: please do not include your test dataset at this point<b>

In [28]:
model = make_lstm_classification_model()
history = model.fit(padded_docs, labels,validation_split = 0.1, epochs=20, verbose=1)


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 64, 100)           3740000   
_________________________________________________________________
masking_1 (Masking)          (None, 64, 100)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                17024     
_________________________________________________________________
dense_2 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_3 (Dense)              (None, 31)                527       
Total params: 3,758,079
Trainable params: 18,079
Non-trainable params: 3,740,000
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 

# Evaluate the model using your test dataset 

In [29]:
# Please replace X_test with your transformed predictor variable
# Please replace y_test with your transformed target variable
# And run the following code

#loss, accuracy = model.evaluate(X_test, y_test, verbose=1)
#print('Accuracy: %f' % (accuracy*100))