<p style="text-align: center;"> <span style="color:firebrick"> <font size="5"> <b> USC Marshall School of Business </b> </font> </p> </span> 

<p style="text-align: center;"> <b> <font font size="5"> DSO 560 - Final Project </p> </b></font>

<p style="text-align: center;"> <b> Spring 2021 </b> </p>

## <span style="color:black"> <font size="3">Group Maroon: Andrew Chuang, Janicia Chang, Ningchuan Peng, Zijing Wu</span>

In [11]:
# import relavent libraries
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
import spacy
from scipy.spatial.distance import cosine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
import xgboost as xgb
import warnings
warnings.filterwarnings("ignore")

## Question 1: Classification

### Executive Summary
For this question, we build a classification model to predict the brands that the products belong to. 
- We only use the top 30 brands and labeled other brands as "Other".
- We utilize all of the columns that contain relevant text for prediction.
- Basic preprocessing: Lower case, Stopwords removal, Lemmatization with part of speech tagging.
- Two different vectorization techniques are tried: Document Embeddings and TF-IDF Vectorization.
- XGBoost classifier is used for modeling and 5-fold cross-validation is used to evaluate different vectorization methods.
- TF-IDF Vectorization technique achieves higher accuracy and is chosen as our final model.

In [2]:
# read datasets
product_df = pd.read_excel('Behold+product+data+04262021.xlsx', encoding = 'latin1') 
additional_tags = pd.read_csv('usc_additional_tags USC.csv', encoding = 'latin1')
outfit_df = pd.read_csv('outfit_combinations USC.csv')

In [3]:
# group additional tags by product_id
tags = additional_tags.groupby(['product_id']).agg(' '.join)

In [4]:
# join additional tags with products dataset using product_id
df = pd.merge(product_df, tags, on = 'product_id', how = 'left')
df = df.fillna('Unknown').astype(str)

In [5]:
# filterout the top 30 brands and label other brands as 'Other'
top_30 = df['brand'].value_counts().nlargest(30).keys()
df['brand_top30'] = df['brand'].apply(lambda x: x if x in top_30 else 'Other')
df['product_active'] = df['product_active'].apply(lambda x: 1 if True else 0)

In [6]:
# make a new column that contains all relavent text
df['text'] = df[['brand_category', 'name', 'details', 'description', 'attribute_value']].agg(' '.join, axis = 1)

In [7]:
# convert all text into lower case
df['text'] = df['text'].str.lower()

In [8]:
# define a function to lemmatize sentences with Part of Speech tagging
lemmatizer = WordNetLemmatizer()
# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [9]:
# define a funtion to remove stopwords and lemmatize sentence
def text_cleaning(x):
    words = x.split()
    new_words = []
    for word in words:
        if word in stopwords.words('english') + ['unknown']:
            continue
        new_words.append(word)
    cleaned_text = " ".join(new_words)
    lemmatized_text = lemmatize_sentence(cleaned_text)
    return lemmatized_text

In [12]:
# creat a new column after text cleaning 
df['cleaned_text'] = df['text'].apply(text_cleaning)

In [15]:
df['cleaned_text'][1]

'ruffle market dress loopy pink sistine tomato mid-length dress ruffle adjustable strap . bias cut . side seam invisible zipper make new york model wear size small 100 % rise sylk rise sylk organic cellulose fiber make natural waste rise bush stem .'

In [16]:
# load a pre-trained word embeddings in spacy and apply to the dataset
nlp = spacy.load('en_core_web_md')
doc = df['cleaned_text'].astype(str).apply(nlp)

In [17]:
# get the document embeddings 
# each document is represented as a vector
emb_array = np.array(list(doc.apply(lambda x: list(x.vector))))
embeddings = pd.DataFrame(emb_array, columns = range(300))
embeddings.shape

(61355, 300)

In [18]:
# use 5-fold CV to evaluate model performance
kfolds_classification = StratifiedKFold(n_splits = 5, random_state = 0, shuffle = True) 
# fit a xgboost classification model with document embeddings as x and brand names as y
xgb_classification = xgb.XGBClassifier(eval_metric = 'merror')
xgb_accuracy_cv = cross_val_score(xgb_classification, embeddings, df['brand_top30'], cv = kfolds_classification)
print("Document Embeddings: \n")
print("Classification error of 5-folds: ",1-xgb_accuracy_cv)
print("Mean classification error:",1-np.mean(xgb_accuracy_cv))

Document Embeddings: 

Classification error of 5-folds:  [0.06372749 0.06299405 0.06364599 0.06494988 0.06160867]
Mean classification error: 0.06338521717871404


In [19]:
# using tf-idf vectorizer to vectorize the dataset
idf_vectorizer = TfidfVectorizer(ngram_range=(1,2),
                                 max_features=1000,
                                 min_df=5)
tfidf = idf_vectorizer.fit_transform(df['cleaned_text'].astype(str))
y = df['brand_top30']

In [20]:
# fit a xgboost classification model with tfidf scores as x and brand names as y
xgb_classification = xgb.XGBClassifier(eval_metric = 'merror')
xgb_accuracy_cv = cross_val_score(xgb_classification, tfidf, y, cv = kfolds_classification)
print("TF-IDF Vectorization: \n")
print("Classification error of 5-folds: ",1-xgb_accuracy_cv)
print("Mean classification error:",1-np.mean(xgb_accuracy_cv))

TF-IDF Vectorization: 

Classification error of 5-folds:  [0.05574118 0.05696357 0.05631163 0.05834895 0.05345938]
Mean classification error: 0.05616494173254016


### Conclusion and Reflection
- In general, the performance of the two models are very close and beat our expectations. We were able to achieve a mean classfication error of 5.6% with an xgboost classification model without any hyperparameters adjustment.
- If we have more time and computating power in the future, we will try deep learning models such as RNN and LSTM to see if we can further improve our results. 

## Question 2: Recommendation

### Executive Summary
To solve the problem of recommending an outfit to a customer, our group has come up with an algorithm:
- We use Regex to classify the products into five categories: Top, Bottom, One piece, Shoes, and Accessory. 
- We get document embeddings of the query. 
- We calculate the cosine distance between the query and each product and find the product with the lowest cosine distance.
- If the product is in the expert defined outfit dataset, we will use the outfit combination directly.
- If the product is not in the outfit dataset, we will use the product with the lowest cosine distance in each category to form an outfit combination.

In [21]:
# create a funtion to replace words
def replace(title, word, reword):
    count = 0
    if isinstance(title, str):
        title = re.sub(word, reword, title, flags=re.IGNORECASE)
    return title

# create a function to find the category
def define_cate(title, word):
    count = 'None'
    if isinstance(title, str):
        count = re.findall(word, title, re.IGNORECASE)
        if len(count) == 0:
            count = 'None'
        else:
            count = count[-1]
    return count.capitalize()

In [22]:
# create words for replacement
top_replace = r'\b(top|shirt|t-shirt|hoodie|shirt|jacket|sweatshirt|coat|sweater|tee)\b'
bot_replace = r'\b(bottom|shorts|dress|skirt|pants?|trousers?|jeans?|leg)\b'
one_replace = r'\b(one piece|gown|swimsuit|overcoat|blouse|parka|blazer|jumpsuit|romper)\b'
shoe_replace = r'\b(shoes?|heels?|flip flops?|boots?|sneakers?|sandals?|loafers?|flats?|mules?)\b'
acce_replace = r'\b(handbag|bag|backpack|purse|tote|hat)\b'

# replace the words in different categories
df['name_cleaned'] = df['name'].apply(lambda x: x.lower())
df['name_cleaned'] = df['name_cleaned'].apply(lambda x: replace(x, top_replace, r'TOP'))
df['name_cleaned'] = df['name_cleaned'].apply(lambda x: replace(x, bot_replace, r'Bottom'))
df['name_cleaned'] = df['name_cleaned'].apply(lambda x: replace(x, one_replace, r'One Piece'))
df['name_cleaned'] = df['name_cleaned'].apply(lambda x: replace(x, shoe_replace, r'Shoe'))
df['name_cleaned'] = df['name_cleaned'].apply(lambda x: replace(x, acce_replace, r'Accessory'))

# classify the products to 5 big categories
big_category = r'\bTop|Bottom|One Piece|Shoe|Accessory\b'
df['product_category'] = df['name_cleaned'].apply(lambda x: define_cate(x, big_category))
df['product_category'].value_counts()

None         30598
Bottom       12279
Top          11985
Shoe          2851
One piece     2094
Accessory     1548
Name: product_category, dtype: int64

From the above value counts we can see that there are still about 30000 products without a defined category. This is because we only used the name cloumn to do our regex matching. Since we will use product categories for our recommendations, we cannot afford miscategorizing a top as a bottom. Hence, we made the decision to optimize for precision and sacrafice recall.

In [23]:
# create a function for giving back the outfit 
def give_back_result(product_id, cos_dis):
    
    # if the product is in the outfit combination dataset
    if product_id in list(outfit_df['product_id']):
        print('We find an outfit.')
        
        # get the best outfit combination
        outfit_id = list(outfit_df[outfit_df['product_id'] == product_id]['outfit_id'])[0]
        final_outfit = outfit_df[outfit_df['outfit_id'] == outfit_id]
        
        # print out the final result
        for i in range(final_outfit.shape[0]):
            out_type = final_outfit.iloc[i, 2].capitalize()
            out_brand = final_outfit.iloc[i, 3]
            out_prod = final_outfit.iloc[i, 4]
            out_id = final_outfit.iloc[i, 1]
            print(f'{out_type}: {out_brand} {out_prod} ({out_id})')
    
    # if the product is not in the outfit combination dataset
    else:
        print("We use products combination instead of an outfit.")
        
        # get the products with the lowest cosine distance in 5 categories
        df['cos_dis'] = cos_dis.values()
        big_cate_list = ['Top', 'Bottom', 'One piece', 'Shoe', 'Accessory']
        best_comb = {}
        for i in range(len(big_cate_list)):
            cate = df[df['product_category']== big_cate_list[i]].\
                   sort_index().sort_values('cos_dis', kind='mergesort').reset_index(drop=True)        
            best_comb[cate.loc[0, 'product_category']] = [cate.loc[0, 'cos_dis'], 
                                                          cate.loc[0, 'brand'],
                                                          cate.loc[0, 'name'],
                                                          cate.loc[0, 'product_id']]
            
        # choose one piece or top and bottom based on the avearge cosine distance
        if best_comb['One piece'][0] > (best_comb['Top'][0]+best_comb['Bottom'][0])/2:
            del best_comb["One piece"]
        else:
            del best_comb["Top"]
            del best_comb["Bottom"]
        
        # print out the final result
        for key in best_comb.keys():
            print(f'{key}: {best_comb[key][1]} {best_comb[key][2]} ({best_comb[key][3]})')

In [24]:
# create a function for searching for products
def search(query):
    
    # get document embeddings of the query
    query_nlp = nlp(query).vector
    
    # calculate the cosine distance between query and other products
    cos_dis = {}
    for i in range(embeddings.shape[0]):
        cos_dis[i] = cosine(embeddings.loc[i,], query_nlp) 
    
    # find the product with the lowest cosine distance
    product_id = df.iloc[min(cos_dis, key=cos_dis.get),0]
    
    # run the function for giving back the outfit 
    give_back_result(product_id, cos_dis)

In [25]:
# run the final test, please input nothing for example
test_query = input('Please input your search query: ')
if test_query == '':
    test_query = 'slim fitting, straight leg pant with a center back zipper and slightly cropped leg'
    print('Example: slim fitting, straight leg pant with a center back zipper and slightly cropped leg')
    print()
search(test_query)

Please input your search query: 
Example: slim fitting, straight leg pant with a center back zipper and slightly cropped leg

We use products combination instead of an outfit.
One piece: A.L.C. Joana Jumpsuit (01EDYCA0E5S1N651MTXRA7AX60)
Shoe: Citizens of Humanity Emannuelle Slim Boot (01EB2DRYX1B0QGMRBC1D3GS1C9)
Accessory: A.L.C. Sadie Handbag (01EDYE843YM31CHVY8ED79DQYW)


In this example, we use the query 'slim fitting, straight leg pant with a center back zipper and slightly cropped leg'. The product that has the lowest cosine distance with this query using document embeddings is not in the outfit combinations dataset. Thus, we get the products with the lowest cosine distance in each category. Since One piece has a lower average cosine distance than Top and Bottom, the function finally gives us a product combination of One piece, Shoes, and Accessory.