## 1. Define Training Data

We are provided with two files of pre-labeled data. The labels are created by industry experts and details the tags associated with certain online fashion items. Using these pre-labeled products as training data, we can train the model to understand characteristics associated with the size of each product and predict the size label of new products.

In [1]:
import pandas as pd
import numpy as np
import string
import re
from collections import Counter
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
## Read the two pre-labeled tagged data sets.

df1 = pd.read_csv("tagged_product_attribute.csv")
df2 = pd.read_csv("usc_additional_tags.csv")
df = pd.concat([df1, df2], ignore_index = True)
df_size = df[df.attribute_name == "sizing"].reset_index(drop = True)

In [3]:
## Use regex to unify the spellings on the categories

df_size.attribute_value = [re.sub(r'(Regular)', 'regular', i) for i in df_size.attribute_value]
df_size.attribute_value = [re.sub(r'(Plus)', 'plus', i) for i in df_size.attribute_value] 
df_size.attribute_value = [re.sub(r'(Maternity)', 'maternity', i) for i in df_size.attribute_value]

In [4]:
df_size = df_size.drop(["product_color_id", "attribute_name"], axis = 1)

In [5]:
df_size.head()

Unnamed: 0,product_id,attribute_value
0,01E2M39XZ7JKEDP7AG4N0DK2N7,regular
1,01DPGVF16NCKCEGM24V3RQ7JZG,regular
2,01DT0DK7AV6D33GRSR0SFXDH24,regular
3,01DS1BTV804QFC91G3R9Q7D792,regular
4,01DTJCD2QG8RKQ4HJKT4GBW888,regular


### Merge with product data

To learn the characteristics, we need to get text descriptions on the product brand, category, etc. We are provided with two datasets of products. First, we join the two datasets to get one master data set.

In [6]:
data1 = pd.read_csv("full_data.csv")
data2 = pd.read_csv("extra_data.csv")

In [7]:
data1=data1[["product_id","brand","product_full_name","description","brand_category","brand_canonical_url","details"]]
data1=data1.rename(columns = {"product_full_name":"name"})
data2=data2[["product_id","brand","name","description","brand_category","brand_canonical_url","details"]]
data = pd.concat([data1, data2], ignore_index = True)

In [8]:
data["details"] = data['details'].str.replace('\n',' ')

In [9]:
## Drop duplicate entries
data = data.drop_duplicates(subset="product_id", keep='last', inplace=False)

Next, we merge the data set with the df_size table which contains the size labels for certain products.

In [10]:
## Merge

my_df = data.merge(df_size, how = "outer", left_on = "product_id", right_on = "product_id")


In [11]:
## Drop duplicate entries
my_df = my_df.drop_duplicates(subset="product_id", keep='last', inplace=False).reset_index(drop = True)

In [12]:
my_df=my_df.rename(columns = {"attribute_value":"size"})

In [13]:
my_df.head()

Unnamed: 0,product_id,brand,name,description,brand_category,brand_canonical_url,details,size
0,01DSE9TC2DQXDG6GWKW9NMJ416,Banana Republic,Ankle-Strap Pump,"A modern pump, in a rounded silhouette with an...",Unknown,https://bananarepublic.gap.com/browse/product....,"A modern pump, in a rounded silhouette with an...",
1,01DSE9SKM19XNA6SJP36JZC065,Banana Republic,Petite Tie-Neck Top,Dress it down with jeans and sneakers or dress...,Unknown,https://bananarepublic.gap.com/browse/product....,Dress it down with jeans and sneakers or dress...,
2,01DSJX8GD4DSAP76SPR85HRCMN,Loewe,52MM Padded Leather Round Sunglasses,Padded leather covers classic round sunglasses.,JewelryAccessories/SunglassesReaders/RoundOval...,https://www.saksfifthavenue.com/loewe-52mm-pad...,100% UV protection Case and cleaning cloth inc...,
3,01DSJVKJNS6F4KQ1QM6YYK9AW2,Converse,Baby's & Little Kid's All-Star Two-Tone Mid-To...,The iconic mid-top design gets an added dose o...,"JustKids/Shoes/Baby024Months/BabyGirl,JustKids...",https://www.saksfifthavenue.com/converse-babys...,Canvas upper Round toe Lace-up vamp SmartFOAM ...,
4,01DSK15ZD4D5A0QXA8NSD25YXE,Alexander McQueen,64MM Rimless Sunglasses,Hexagonal shades offer a rimless view with int...,JewelryAccessories/SunglassesReaders/RoundOval,https://www.saksfifthavenue.com/alexander-mcqu...,100% UV protection Gradient lenses Adjustable ...,


### Preprocess text

Before we build our model. We need to preprocess the text to make it easier for the model to learn and predict outcome. In this step, we made product information (excluding product_id) lowercase, created a column that combines all product information including the brand name, product name, descriptions, etc., and removed stopwords.

In [14]:
## Make everything lowercase so that we won't have to worry about capitalization in analysis

for i in my_df.columns[1:-1]:
    my_df[i] = my_df[i].str.lower()
    
## Replace the slashes with space in brand_category to find key words easily
my_df["brand_category"] = my_df["brand_category"].str.replace("/"," ")
my_df["brand_category"].fillna(value = 'unknown', inplace = True)


### should predict based on brand, name, description, and details. Therefore, create a combined column
product = []
for i in range(len(my_df)):
    product.append(str(my_df["brand"][i]) + ' ' + str(my_df["name"][i]) + ' ' 
                   + str(my_df["description"][i]) + ' ' + str(my_df["details"][i]))
    
my_df["product"] = product

In [15]:
## Remove stopwords

stop = list(set(stopwords.words('english')))
stop.extend(["nan",",","","'",".","@"])

In [16]:
product_token = []
for i in my_df["product"]:
    product_token.append(word_tokenize(i))

my_df["product_token"] = product_token

In [17]:
stop_removed_product = []

for product in my_df["product_token"]:
    words = []
    for word in product:
        if word not in stop:
            words.append(word)
    stop_removed_product.append(words)

info = []
for i in stop_removed_product:
    temp = " ".join(i)
    info.append(temp)
    
my_df["stop_removed_product"] = info

### Find maternity and petite size

The sizing category contains two very specific size labels, which are "maternity" and "petite". In the tagged label, there are very few products with the "maternity" label, and no product with the "petite" label. Note that in order for the model to train and pick up on characteristics related to petite products, there must be existing data with this label. Therefore, we conduct a manual search on items that explicitly have these tags in the product information. Although this is not the most appropriate approach, this step is added for the below two reasons:
1. These tags are usually explicitly stated in the product name and directly refers to the product attribute (i.e. products with name stating "maternity" is usually explictly for pregnant women. Normal clothing does not typically have the description of "maternity" in its name.)
2. The model needs to have all tags in its training data in order to generate predictions that includes each of the tags.

In [18]:
sizing = ["maternity", "petite"]

for i in range(len(my_df)):
    label = word_tokenize(my_df["name"][i])
    
    for word in label:
        if word in sizing:
            my_df["size"][i] = word

### Label non-clothing items

The sizing tag only pertains to clothing items. Therefore, we must have a check that picks up non-clothing items such as jewelry, shoes and accessories.

In [19]:
non_cloth = ["sunglasse?", "shoe","sneaker","pump","boot","earring","necklace","bracelet","belt","bag","wallet",
            "backpack","flat","sandal", "makeup", "case","scarf", "loafer","\shat"]

non_cloth_exp = "|".join(str(non_cloth[i]) + "s?" for i in range(len(non_cloth)))

non_cloth_token = (r'\b' + non_cloth_exp + r'\b')

## 2. Train Embedding

Next, we train the word embeddings for our data set. This means assigning each word in the product information a vector. The word embedding process references the functions and procedures outlined in week 7's "Using RNNs and LSTMs.ipynb" notebook.

In [20]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from numpy import asarray
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing.text import text_to_word_sequence
from typing import List


Using TensorFlow backend.


#### Define training data

We separate out the training data, which is the products that have their sizing tags labeled by industry experts, as well as the maternity and petite products that we manually extracted.

In [21]:
train_df = my_df[my_df["size"].isnull() == False]

In [22]:
train_df["size"].value_counts()

regular      2931
petite        638
maternity       8
plus            2
tall            1
Name: size, dtype: int64

In [23]:
## We assign the labels and document

labels = train_df["size"]
docs = train_df["stop_removed_product"]

We use the keras and sklearn library packages to encode the labels. Since the number of labels we have is limited (i.e. 5), we use the LabelEncoder package.

In [25]:
encoder = LabelEncoder()
labels = to_categorical(encoder.fit_transform(labels))

To obtain word embeddings, we must first tokenize the product information.

In [26]:
tokenizer = Tokenizer(num_words=10000, oov_token = "UNKNOWN_TOKEN")
tokenizer.fit_on_texts(train_df["stop_removed_product"].values)

In [27]:
## Define a function that assigns integer as vectors for each word.

def integer_encode_documents(docs, tokenizer):
    return tokenizer.texts_to_sequences(docs)

In [28]:
## Define a function that returns the maximum length of a product information in terms of number of tokens.

def get_max_token_length_per_doc(docs):
    return max(list(map(lambda x: len(x.split()), docs)))

max_length = get_max_token_length_per_doc(docs)

In [29]:
## Define the number of time steps the model should back propagate. 

MAX_SEQUENCE_LENGTH = 300

## We encode the training data, where the documents are each product's information (name, brand, description 
## and details)

encoded_docs = integer_encode_documents(train_df["stop_removed_product"].values, tokenizer)

padded_docs = pad_sequences(encoded_docs, maxlen=MAX_SEQUENCE_LENGTH, padding='post')

In [30]:
X_train = padded_docs
y_train = labels

In [31]:
## Set a maximum vocabulary size as 1.1 times the number of unique words in the training data.

VOCAB_SIZE = int(len(tokenizer.word_index) * 1.1)

In [32]:
## Load in the GloVe vectors.

def load_glove_vectors():
    embeddings_index = {}
    with open('glove.6B.100d.txt') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    print('Loaded %s word vectors.' % len(embeddings_index))
    return embeddings_index


embeddings_index = load_glove_vectors()

Loaded 400000 word vectors.


In [33]:
## Create a weight matrix for words in training docs.

embedding_matrix = zeros((VOCAB_SIZE, 100))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: 
        embedding_matrix[i] = embedding_vector

## 3. Define Model

Once we have the training data set represented with word embeddings, we can train an LSTM model that takes the embeddings as input. 

In [34]:
def make_lstm_classification_model(plot=False):
    from keras.layers.recurrent import SimpleRNN, LSTM
    from keras.layers import Flatten, Masking
    model = Sequential()
    model.add(Embedding(VOCAB_SIZE, 100, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False))
    model.add(Masking(mask_value=0.0))
    model.add(LSTM(units=32, input_shape=(1, MAX_SEQUENCE_LENGTH)))
    model.add(Dense(16))
    model.add(Dense(5, activation='softmax'))
    
    # Compile the model
    model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    # summarize the model
    model.summary()
    
    return model

In [35]:
model = make_lstm_classification_model()
model.fit(X_train, y_train, epochs=5, verbose=1)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 100)          775100    
_________________________________________________________________
masking_1 (Masking)          (None, 300, 100)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                17024     
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 85        
Total params: 792,737
Trainable params: 17,637
Non-trainable params: 775,100
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x1596cd210>

## 4. Define Function

Now that we have the model trained, we define a function that takes a product information as strings and utilize the trained model to predict the sizing of the product. This function first preprocesses the input string by converting the words to lowercase and removing stopwords. It then checks if any part of the product information signifies that it is a non-clothing item. If not, it then uses the model to predict the sizing.

In [36]:
def pred_size(doc):

    stop = list(set(stopwords.words('english')))
    stop.extend(["nan",",","","'",".","@","/",":",";","_"])
    
    cleaned_txts = []
    for text in doc:
        words = word_tokenize(text)
        new_words = []
        for word in words:
            if word in stop:
                continue
            new_words.append(word.lower())
        cleaned_txt = " ".join(new_words)
        cleaned_txts.append(cleaned_txt)
    
    prod = [" ".join(cleaned_txts[i] for i in range(len(cleaned_txts)))]
    
    if bool(re.findall(non_cloth_token, prod[0])) == True:
        item = "Not clothing item"
    
    else:
    
        encoded_text = integer_encode_documents(prod, tokenizer)
        padded_text = pad_sequences(encoded_text, maxlen=MAX_SEQUENCE_LENGTH, padding='post')

        prediction = model.predict_classes(padded_text)
        item = encoder.inverse_transform(prediction)
    
    return item

Test on a product that's not a clothing item.

In [37]:
pred_size([
    "description: metal sunglasses in translucent tortoise pattern   140mm lens width; 54mm bridge width; 22mm temple length  100% uv protection  tinted lenses  adjustable nose pads  metal  made in italy",
    "brand: prada",
    "brand_category: translucent tortoise sunglasses"
])
    

'Not clothing item'

Test on a product that is a clothing item.

In [38]:
pred_size([
    "description: Our easy-fit shirt that skims your shape for a relaxed silhouette. Now in an updated, slimmer fit and a shorter length that‚Äôs perfect with high-waisted bottoms. Made with a soft satin fabric that uses a jacquard weaving technique to create a tonal leopard print. Spread collar. Long sleeves with buttoned cuffs. Button front placket. Back yoke seam with box pleats. Shirttail hem.",
    "brand: Banana Republic",
    "brand_category: Dillon classic-fit utility shirt"
])

array(['regular'], dtype=object)