# Outfit Recommendation System (Part 2)


For Part-B, our objective was to create a recommender algorithm to recommend an outfit given a customer’s search query. In order to implement this, we created a funnel through which the user’s query gets passed. 

At each stage, based on cosine similarity and a rule-based heuristic to catch edge cases, we build a recommender function to produce a complete outfit comprising of atleast 3 categories out of top, bottom, one-piece, shoe and accessories. 

**Steps to get a recommendation:**
    
    Step 1: Run the sections from Loading Stage to the Main Function (3.2)
    Step 2: Call the outfit( ) function on the new query. This will produce an 
    outfit recommendation

## Importing Packages and Loading Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from tqdm import tqdm
import re
import sys
import string
from collections import Counter
import spacy
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cosine
from fuzzywuzzy import fuzz
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sneharaj/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sneharaj/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
product=pd.read_excel('Behold+product+data+04262021.XLSX')
combo=pd.read_csv('outfit_combinations USC.csv')

## Text Cleaning and Preprocessing

This section utilizes regex and pandas to clean and prepare the data. The goal of this step is to set the stage for our outfit recommendation script. Please note that some of the preprocessing phases we included below didn't get utilized in our final function - we left it in our notebook for exploration purposes.

### Data type and text case cleaning

In [3]:
# Datatype and case check
combo['brand']=combo['brand'].apply(lambda x: x.lower())
combo['product_full_name']=combo['product_full_name'].apply(lambda x: x.lower())
product['brand']=product['brand'].astype(str)
product['brand_name']=product['brand_name'].astype(str)
product['brand']=product['brand'].apply(lambda x: x.lower())
product['brand_name']=product['brand_name'].apply(lambda x: x.lower())
stopwords=list(stopwords.words('english'))

### Regex cleaning

#### Replacing html characters

In [4]:
# Check for html and other processing tags
product.fillna('None',inplace=True)
product.description=product.description.str.replace('\n',' ')
product.description=product.description.str.replace('\r',' ')
product.description=product.description.str.replace(r'[^\w\s]|_',' ')
product.description=product.description.str.lower()

#### Processing cities

In [5]:
# Capturing different versions of cities
product['description']=product['description'].str.\
                                replace(r'\bnew\b\s\byork\b\s(?:\bcity\b)?','new_york_city ')
product['description']=product['description'].str.\
                                replace(r'\b(?:the\s)?usa?\b','USA')
product['description']=product['description'].str.\
                                replace(r'\b(?:the\s)?united\sstates\b','USA')

#### Processing product categories

In [6]:
# Capturing category
bottom_seq=r'\b(capri?|leggings?|bottoms?|skirts?|sweatpants?|pants?|jeans?|midi|trousers?|shorts|trunks?)\b'
one_piece_seq=r'\b(kimono|jumpsuit|dress(?:es)?|gowns?|swimsuit?|onesies?|unitards?|bodysuits?|rompers?|one ?piece)\b'
shoe_seq=r'\b(shoes?|sneakers?|flats|boot|heels?|sandals?|mules?|loafers?|pumps?)\b'
top=r'\b(Tank|caftan|hoodies?|tshirts?|tees?|tops?|top|blouses?|jackets?|blazers?|shirts?|tops?|coats?|suits|sweaters?|sweatshirts?)\b'
acc=r'\b(capes?|socks?|earrings?|belts?|gloves?|headbands?|ties?|hats?|caps?|bags?|handbags?|clutch(?:es)?|wallets?|purses?|duffels?scar(?:f|ves)?|wraps?|stoles?|shawl)\b'
linen=r'\b(linens?)\b'
lingerie=r'\b(bras?)\b'

dict_seq={'one_piece':one_piece_seq,
          'shoe':shoe_seq,
          'top':top,'acc':acc,'linen':linen,
          'bottom':bottom_seq,'lingerie':lingerie}

# Category flag
for d in dict_seq:
    product[f'{d}_check']=0
    for col in ['name','description','details','brand_category']:
        product[f'{d}_check']=product[f'{d}_check']+product[col].str.contains(dict_seq[d],case=False)
         
product['max_value_cat']=product[["bottom_check","shoe_check","one_piece_check",'acc_check','top_check','linen_check','lingerie_check']].max(axis=1)

# Function to identify most common category
def max_presence(row):
    for d in dict_seq.keys():
        colname=f'{d}_check'
        if (row['max_value_cat']==row[colname])&(row['max_value_cat']!=0):
            return d
        elif row['max_value_cat']==0:
            return 'None'
    
product['final_category']=product.apply(max_presence,axis=1)

#### Processing colors

In [7]:
def findColors(txt):
    """ Function to determine the color of item """
    colors_re=r'\b(beige|light brown|black|blue ?green|blue|brown|umber|burgundy|gold(?:en)?|gray|grey|green|navy|neutral|orange|aurantia|pink|purple|violet|red|scarlet|silver|teal|white|yellow|(?:multi(?:ple)?|several|different|many|more than one) ?colou?rs?)\b'
    val=[]
    txt=str(txt)
    if re.findall(colors_re,txt,re.IGNORECASE):
        val=re.findall(colors_re,txt,re.IGNORECASE)
    return val

product['color_list']=product['description'].apply(findColors)+product['name'].apply(findColors)
product['colors']=product['color_list'].apply(lambda x: set(y.lower() for y in x))

product['n_colors']=product['colors'].apply(len)
product['final_color']=np.nan
product.loc[product['n_colors']>1,'final_color']='multi'
product.loc[product['n_colors']==1,'final_color']=product.loc[product['n_colors']==1,'colors'].\
                                                                        apply(lambda x:list(x)[0])

# Products with multiple colors
multi_pattern=r'\b(?:multi(?:ple)?|several|different|many|more than one\b)\s? ?colou?rs?'
for i in tqdm(range(len(product))):
    match=re.search(multi_pattern,
                       str(product.loc[i,'final_color']))
    if match:
        product.loc[i,'final_color']='multi'

100%|██████████| 61355/61355 [00:03<00:00, 19428.43it/s]


#### Combining data

In [8]:
# We consider only the top 30 brands within the scope of this project
topbrands=product['brand'].value_counts()[:30].index.tolist()
product=product[product['brand'].isin(topbrands)]

# Dealing with null values
product=product[product['brand_name'].notnull()]
product=product[product['final_category'].notnull()]
product=product[product['final_color'].notnull()]
product.reset_index(drop=True,inplace=True)

In [9]:
# Concatenating columns
combo['new_brand_product']=combo['brand'].astype(str)+' '+combo['product_full_name'].astype(str)
product['new_brand_product']=product['brand'].astype(str)+' '+product['final_color'].astype(str)+' '+product['brand_name'].astype(str)

**In summary:**

> _The combo dataset corpus for TFIDF includes the concatenation of the "brand" column and "product_full_name" column._

> _The product dataset corpus for TFIDF includes the concatenation of the "brand" column, "brand_name" column, and "final_color" column._

##### A quick peek into the final processed DataFrames

In [10]:
combo.head(2)

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name,new_brand_product
0,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,eileen fisher,slim knit skirt,eileen fisher slim knit skirt
1,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2PEPWFTT7RMP5AA1T,top,eileen fisher,rib mock neck tank,eileen fisher rib mock neck tank


Below is the final product dataset.

In [11]:
product.head(2)

Unnamed: 0,product_id,brand,brand_category,name,details,created_at,brand_canonical_url,description,brand_description,brand_name,...,linen_check,bottom_check,lingerie_check,max_value_cat,final_category,color_list,colors,n_colors,final_color,new_brand_product
0,01F0C4SKZV6YXS3265JMC39NXW,collina strada,Unknown,RUFFLE MARKET DRESS LOOPY PINK SISTINE TOMATO,,2021-03-09 18:43:10.457 UTC,https://collina-strada-2.myshopify.com/product...,mid length dress with ruffles and adjustable s...,Mid-length dress with ruffles and adjustable s...,ruffle market dress loopy pink sistine tomato,...,0,0,0,2.0,one_piece,[PINK],{pink},1,pink,collina strada pink ruffle market dress loopy ...
1,01EW16A4J6YS4XS4JEWXW3Q46C,misa,Unknown,MARION SKIRT,,2021-01-14 19:35:14.245 UTC,https://misa-los-angeles.myshopify.com/product...,one of our signature silhouettes is back in a ...,One of our signature silhouettes is back in a ...,marion skirt,...,0,1,0,1.0,top,"[blue, white, white]","{blue, white}",2,multi,misa multi marion skirt


# Recommendation System

The outfit recommendation system is a function called `outfit( )` 
- The user enters the search terms into the function as the argument
- The function prints a dictionary of the recommended outfit with atleast 3 products in the outfit recommendations

> _Our recommendation system is a rule-based algorithm driven by TFIDF and cosine similarity._


**The function is comprised of two parts:**
- **Part A:** Uses a similarity measure to make recommendations. Part A comprises of three stages:
    - **part_A_1** is designed to extract predefined outfits directly from the combinations dataset when the cosine similarity between the query and a product in the **combinations dataset** is greater than  0.5.
    - **part_A_2** is designed to deal with queries that fail the 0.5 similarity threshold from part_A_1. This stage then searches the **Behold products dataset** (product) for a product with highest similarity to the query. That product then becomes the "new search term" and we aim to return an outfit in the combinations dataset that contains a product that's similar to the "new search term". A checkpoint is installed here which is elaborated above the function definition.
    - **part_A_3** is designed to process the queries that fail to meet the requirements of both part_A_1 and part_A_2. This stage searches the product dataset for the highest cosine similarity match. We then extract out the brand of this product to create a **brand-based** outfit.
    
- **Part B:** Designed to deal with edge cases where the user enters a very specific query like "skirt goes with blue shirt". In this case, we parse out both the product the user is looking for and the product they want to match it with to return an outfit with maximum similarity.

_Note: The reason we chose a sequential process if because we wanted to maximize the chance of returning an outfit curated by a fashion expert._


## Supporting Functions

**Part A - Phase 1 (part_a_1)** checks the outfit combinations dataset (combo) and matches the query with the product with the highest similarity score. Since there is a designated outfit for that product in the combo dataset, we simply return all the items with that outfit ID. 

For this to happen the cosine similarity score must be above **0.5**

In [12]:
def part_a_1(query):
    vectorizer=TfidfVectorizer(ngram_range=(1,1),token_pattern=r'\b[a-zA-Z]{3,}',stop_words=stopwords)
    corpus=list(combo['new_brand_product'].values)
    corpus.append(query)
    X=vectorizer.fit_transform(corpus)
    terms=vectorizer.get_feature_names()
    tf_idf=pd.DataFrame(X.toarray(),columns=terms)
    df_sim=pd.DataFrame(cosine_similarity(tf_idf))
    # Finding product with maximum similarity
    maxvalue=0
    length=len(df_sim)
    for i in range(length-1):
        if df_sim.loc[i,length-1]>maxvalue:
            maxvalue=df_sim.loc[i,length-1]
            loc_i=i
    if maxvalue==0:
        rec_dict={}
        return rec_dict,maxvalue
    # Finding the outfit ID of the product with the highest similarity score
    outfits=combo[combo['product_id']==combo.loc[loc_i,'product_id']]['outfit_id']
    temp=combo[combo['outfit_id']==list(outfits)[0]][['outfit_item_type','product_full_name','product_id']]
    temp['recommendation']=temp['product_full_name'].str.cat(temp['product_id'],sep=' (')+')'
    temp.set_index('outfit_item_type',inplace=True)
    temp.drop(columns=['product_full_name','product_id'],inplace=True)
    rec_dict=temp['recommendation'].to_dict()
    # We return both the recommendation and the similarity value
    return rec_dict,maxvalue

**Part A - Phase 2 (part_a_2)** 

If the cosine similarity score from the above step is below 0.5, this phase searches the Behold products dataset (product) for the highest similarity product. That product then becomes the "new search term" and we aim to return an outfit in the combo dataset that contains a product that is similar to the "new search term". We then install a checkpoint to see if this similarity is greater than 0.5. If it is, this outfit combination is returned, but if it isn't we proceed to the final phase of "Part A". 

In [13]:
def part_a_2(query):
    vectorizer=TfidfVectorizer(ngram_range=(1,1),token_pattern=r'\b[a-zA-Z]{3,}',stop_words=stopwords)
    corpus=list(product['new_brand_product'].values)
    corpus.append(query)
    corpus=[str(item) for item in corpus]
    X=vectorizer.fit_transform(corpus)
    terms=vectorizer.get_feature_names()
    tf_idf=pd.DataFrame(X.toarray(),columns=terms)
    df_sim=pd.DataFrame(cosine_similarity(tf_idf))
    # Finding product with maximum similarity
    maxvalue=0
    length=len(df_sim)
    for i in range(length-1):
        if df_sim.loc[i,length-1]>maxvalue:
            maxvalue=df_sim.loc[i,length-1]
            loc_i=i
    # "item1" becomes the new search term
    item1=product.loc[loc_i,'brand_name']
    vectorizer=TfidfVectorizer(ngram_range=(1,1),token_pattern=r'\b[a-zA-Z]{3,}',stop_words=stopwords)
    corpus=list(combo['new_brand_product'].values)
    corpus.append(item1)
    X=vectorizer.fit_transform(corpus)
    terms=vectorizer.get_feature_names()
    tf_idf=pd.DataFrame(X.toarray(),columns=terms)
    df_sim=pd.DataFrame(cosine_similarity(tf_idf))
    maxvalue=0
    length=len(df_sim)
    for i in range(length-1):
        if df_sim.loc[i,length-1]>maxvalue:
            maxvalue=df_sim.loc[i,length-1]
            loc_i=i
    # Extracting the outfit from the combination dataset based on maximum similarity
    outfits=combo[combo['product_id']==combo.loc[loc_i,'product_id']]['outfit_id']
    temp=combo[combo['outfit_id']==list(outfits)[0]][['outfit_item_type','product_full_name','product_id']]
    temp['recommendation']=temp['product_full_name'].str.cat(temp['product_id'],sep=' (')+')'
    temp.set_index('outfit_item_type',inplace=True)
    temp.drop(columns=['product_full_name','product_id'],inplace=True)
    rec_dict=temp['recommendation'].to_dict()
    # We return both the recommendation and the similarity value
    return rec_dict,maxvalue

**Part A - Phase 3 (part_a_3)** searches the product dataset for the highest cosine similarity match. We then extract out the brand of this product to create a **brand-based** outfit. 

The next step in this process is to return an outfit including a top, a bottom, an accessory, and a shoe from the extracted brand. However, if the category of product with the highest cosine similarity match is a "one_piece", we only return an accessory and a shoe (The reason is because it wouldn't make sense to include a top and a bottom with this curated outfit).

In [14]:
def part_a_3(query):
    vectorizer=TfidfVectorizer(ngram_range=(1,1),token_pattern=r'\b[a-zA-Z]{3,}',stop_words=stopwords)
    # Potential components of an outfit
    keyList=['top','bottom','acc','shoe','one_piece']
    product1=product[product['final_category'].isin(keyList)].reset_index(drop=True)
    df=product1.groupby('brand')['final_category'].nunique()
    filt=list(df[df>4].index)
    product1=product1[product1['brand'].isin(filt)].reset_index(drop=True)
    corpus=list(product1['new_brand_product'].values)
    corpus.append(query)
    corpus=[str(item) for item in corpus]
    X=vectorizer.fit_transform(corpus)
    terms=vectorizer.get_feature_names()
    tf_idf=pd.DataFrame(X.toarray(),columns=terms)
    df_sim=pd.DataFrame(cosine_similarity(tf_idf))
    maxvalue=0
    length=len(df_sim)
    for i in range(length-1):
        if df_sim.loc[i,length-1]>maxvalue:
            maxvalue=df_sim.loc[i,length-1]
            loc_i=i
    if maxvalue==0:
        maxvalue=0
        rec_dict={}
        return rec_dict,maxvalue
    # Extracting out the category and brand
    orig_cat=product1.loc[loc_i,'final_category']
    orig_prod=product1.loc[loc_i,'brand_name']
    new_brand=product1.loc[loc_i,'brand']
    # This path will give a "one_piece", "shoe", and "acc"
    if orig_cat=='one_piece':
        keyList=['acc','shoe']
        rec_dict={}
        for i in keyList:
            rec_dict[i]=None
        for i in keyList:
            if len(product1[(product1['brand']==new_brand) & (product1['final_category']==i)].reset_index(drop=True))>0:
                temp=product1[(product1['brand']==new_brand) & (product1['final_category']==i)].reset_index(drop=True)
                val=temp.loc[0,'brand_name']+' ('+temp.loc[0,'product_id']+')'
                rec_dict[i]=val
        rec_dict[orig_cat]=orig_prod+' ('+product1.loc[loc_i,'product_id']+')'
        maxvalue='Not Applicable'
        print('Brand:',new_brand)
        return rec_dict,maxvalue
    # This path will give a "top", "bottom", "shoe", and "acc"
    else:
        keyList=['top','bottom','acc','shoe']
        rec_dict={}
        for i in keyList:
            rec_dict[i]=None
        for i in keyList:
            if len(product1[(product1['brand']==new_brand) & (product1['final_category']==i)].reset_index(drop=True))>0:
                temp=product1[(product1['brand']==new_brand) & (product1['final_category']==i)].reset_index(drop=True)
                val=temp.loc[0,'brand_name']+' ('+temp.loc[0,'product_id']+')'
                rec_dict[i]=val
        rec_dict[orig_cat]=orig_prod+' ('+product1.loc[loc_i,'product_id']+')'
        maxvalue='Not Applicable'
        print('Brand:',new_brand)
        return rec_dict,maxvalue

**Part B (part_b)** was created to handle edge cases where the user enters in something very specific. For example, we parse out the following statements: "goes well with", "goes with", "allows for", "pairs well with", and "pairs with". The function captures what's before and after the phrase (`query_a` and `query_b`). These queries are independently utilized to return the highest similarity product from the product dataset.

In [15]:
def part_b(query):
    # matching product user mentions in query
    query_1=re.findall(r'goes well with ([\w @;\n\.,]+)',query)
    query_2=re.findall(r'goes with ([\w @;\n\.,]+)',query)
    query_3=re.findall(r'allows for ([\w @;\n\.,]+)',query)
    query_4=re.findall(r'pairs well with ([\w @;\n\.,]+)',query)
    query_5=re.findall(r'pairs with ([\w @;\n\.,]+)',query)
    # Actual product user is searching for
    query_a2=re.findall(r'([\w @;\n\.,]+) goes well with',query)
    query_b2=re.findall(r'([\w @;\n\.,]+)goes with',query)
    query_c2=re.findall(r'([\w @;\n\.,]+)allows for',query)
    query_d2=re.findall(r'([\w @;\n\.,]+)pairs well with',query)
    query_e2=re.findall(r'([\w @;\n\.,]+)pairs with',query)
    for i in [query_1,query_2,query_3,query_4,query_5]:
        if len(i)==1:
            query1=i
    for i in [query_a2,query_b2,query_c2,query_d2,query_e2]:
        if len(i)==1:
            query2=i
    querya=query1[0]
    queryb=query2[0]
    vectorizer=TfidfVectorizer(ngram_range=(1,1),token_pattern=r'\b[a-zA-Z]{3,}',stop_words=stopwords)
    corpus1=list(product['new_brand_product'].values)
    corpus1=[str(item) for item in corpus1]
    corpus1.append(querya)
    X=vectorizer.fit_transform(corpus1)
    terms=vectorizer.get_feature_names()
    tf_idf=pd.DataFrame(X.toarray(),columns=terms)
    df_sim=pd.DataFrame(cosine_similarity(tf_idf))
    maxvalue=0
    length=len(df_sim)
    for i in range(length-1):
        if df_sim.loc[i,length-1]>maxvalue:
            maxvalue=df_sim.loc[i,length-1]
            loc_i=i
    if maxvalue==0:
        rec_dict={}
        return rec_dict,maxvalue
    rec_dict={}
    column1=product.loc[loc_i,'final_category']
    rec_dict[column1]=product.loc[loc_i,'brand_name']+' ('+product.loc[loc_i,'product_id']+')'
    filter1=list(rec_dict.keys())
    temp=product[product['final_category'].isin(filter1)==False]
    temp.reset_index(drop=True,inplace=True)
    vectorizer=TfidfVectorizer(ngram_range=(1,1),token_pattern=r'\b[a-zA-Z]{3,}',stop_words=stopwords)
    corpus2=list(temp['new_brand_product'].values)
    corpus2=[str(item) for item in corpus2]
    corpus2.append(queryb)
    X=vectorizer.fit_transform(corpus2)
    terms=vectorizer.get_feature_names()
    tf_idf=pd.DataFrame(X.toarray(),columns=terms)
    df_sim=pd.DataFrame(cosine_similarity(tf_idf))
    maxvalue=0
    length=len(df_sim)
    for j in range(length-1):
        if df_sim.loc[j,length-1]>maxvalue:
            maxvalue=df_sim.loc[j,length-1]
            loc_j=j
    if maxvalue==0:
        rec_dict={}
        return rec_dict,maxvalue
    column2=temp.loc[loc_j,'final_category']
    rec_dict[column2]=temp.loc[loc_j,'brand_name']+' ('+temp.loc[loc_j,'product_id']+')'
    maxvalue='Not Applicable'
    return rec_dict,maxvalue

## Main Function

This final recommender function called `outfit( )` is what will provide the final outfit recommendation. This function uses the cleaned product and combo dataset and the supporting functions declared above. 

_Please run the entire notebook once and then the function will be ready to use._


In [16]:
def outfit(query):
    query=query.lower()
    query_1=re.findall(r'goes well with ([\w @;\n\.,]+)',query)
    query_2=re.findall(r'goes with ([\w @;\n\.,]+)',query)
    query_3=re.findall(r'allows for ([\w @;\n\.,]+)',query)
    query_4=re.findall(r'pairs well with ([\w @;\n\.,]+)',query)
    query_5=re.findall(r'pairs with ([\w @;\n\.,]+)',query)
    # Makes sure we don't need to utilize "part_b"
    if (len(query_1)==0) & (len(query_2)==0) & (len(query_3)==0) & (len(query_4)==0) & (len(query_5)==0):
        # Function call to "part_a_1"
        rec_dict,maxvalue=part_a_1(query)
        if maxvalue==0:
            rec_dict={}
            return print(rec_dict)
        # Threshold set at 0.5, goes to "part_a_2" if not met
        if maxvalue<0.5:
            rec_dict,maxvalue=part_a_2(query)
            # Threshold set at 0.5, goes to "part_a_3" if not met
            if maxvalue<0.5:
                rec_dict,maxvalue=part_a_3(query)
                if maxvalue==0:
                    rec_dict={}
                    return print(rec_dict)
    else:
        # Deals with our edge cases
        rec_dict,maxvalue=part_b(query)
        if maxvalue==0:
            rec_dict={}
            return print(rec_dict)
    [print(key,':',value) for key, value in rec_dict.items()]

## Example Searches

In [17]:
# Eileen Fisher is a brand 
outfit('eileen fisher slim knit')

bottom : slim knit skirt (01DMBRYVA2P5H24WK0HTK4R0A1)
top : rib mock neck tank (01DMBRYVA2PEPWFTT7RMP5AA1T)
accessory1 : medium margaux leather satchel (01DMBRYVA2S5T9W793F4CY41HE)
shoe : penelope mid cap toe pump (01DMBRYVA2ZFDYRYY5TRQZJTBD)


In [18]:
# Both phrases need to be valid to return a match
outfit('skinny jeans goes well with black shirt')

top : pj shirt in black (01EP63JYKXXTFQKMKNN3RNNVDS)
bottom : b(air) ankle skinny jeans (01E6070892T9HB6NZ6APHTZ70N)


In [19]:
# Prints the phases the query travels through
outfit('blue shirt')

Brand: 6397
top : lori shirt in blue (01F39R5Q9H6YBSEDE1NAAN7XWM)
bottom : pull-on trouser in navy (01EP63R8A0KE3MZ819N5C4T3G1)
acc : grey plaid summer v-neck shirtdress (01ESCC0YDWYFRYMHVEP4YWBNT9)
shoe : 6397 clogs in blue denim (01EP62NW4TV1ND02A2EHMJTM4H)


In [21]:
outfit('red dress')

Brand: chufy
acc : talisman harmony earrings (01EEZT37R6FZFKA4MGZKQQRHKY)
shoe : aroa nappa' black sandals (01ED4MVFN5XYJZZXH0ECTZ40FY)
one_piece : re red mini dress (01EKJJ5D9SBBT68RRP30TYT2PK)


In [22]:
outfit('shirt')

bottom : dazzler distressed slim ankle jeans (01DT0DHSDTQ34YYDQZ856T6WP9)
accessory2 : you're cute classic cotton tee (01DT0DJMGB47PSW7H695CJHXNT)
accessory1 : mia printed leather shoulder bag (01DT8NCT6C6734WAWTYZX94SE8)
top : checked wool shirt (01DVCT2ANYEJBJ8EKD8XJCBB7P)
shoe : cabria leather ankle boots (01DVCTFR5MA1ZDKTAFS4VG4VW4)


In [23]:
# Both phrases need to be valid to return a match
outfit('skinny jeans goes well with xxx')

{}


In [24]:
# Returns an empty dictionary when given a poor request
outfit('abc')

{}


# For Exploration Purposes

We explored fuzzy matching and Doc2Vec in conjunction with cosine similarity. We ultimately decided to go with TFIDF and cosine similarity, but we still wanted to provide simple examples of each for context.

## Option 1: Fuzzy Matching

In [25]:
q='Rib Mock Neck Tank'
max_sim=0
for i in combo.index:
    if fuzz.token_set_ratio(combo.loc[i,'product_full_name'],q)>max_sim:
        res_i=i
        max_sim=fuzz.token_set_ratio(combo.loc[i,'product_full_name'],q)
outfits=combo[combo['product_id']==combo.loc[res_i,'product_id']]['outfit_id']
temp=combo[combo['outfit_id']==list(outfits)[0]][['outfit_item_type','product_full_name','product_id']]
temp['recommendation']=temp['product_full_name'].str.cat(temp['product_id'], sep=' (')+')'
temp.set_index('outfit_item_type',inplace=True)
temp.drop(columns=['product_full_name','product_id'],inplace=True)
rec_dict=temp['recommendation'].to_dict()
rec_dict

{'bottom': 'slim knit skirt (01DMBRYVA2P5H24WK0HTK4R0A1)',
 'top': 'rib mock neck tank (01DMBRYVA2PEPWFTT7RMP5AA1T)',
 'accessory1': 'medium margaux leather satchel (01DMBRYVA2S5T9W793F4CY41HE)',
 'shoe': 'penelope mid cap toe pump (01DMBRYVA2ZFDYRYY5TRQZJTBD)'}

## Option 2: Doc2Vec

In [26]:
q='leather satchel'
documents=[TaggedDocument(doc, [i]) for i, doc in enumerate(combo['product_full_name'])]
model=Doc2Vec(documents)
new_q=model.infer_vector([q]).reshape(1,-1)
max_d2v=0
for i in combo.index:
    doc_vec=model.infer_vector([combo.loc[i,'product_full_name']]).reshape(1,-1)
    if cosine_similarity(doc_vec,new_q)>max_d2v:
        res_i=i
        max_d2v=cosine_similarity(doc_vec,new_q)
outfits=combo[combo['product_id']==combo.loc[res_i,'product_id']]['outfit_id']
temp=combo[combo['outfit_id']==list(outfits)[0]][['outfit_item_type','product_full_name','product_id']]
temp['recommendation']=temp['product_full_name'].str.cat(temp['product_id'], sep=' (')+')'
temp.set_index('outfit_item_type',inplace=True)
temp.drop(columns=['product_full_name','product_id'],inplace=True)
rec_dict=temp['recommendation'].to_dict()
rec_dict

{'accessory1': 'lilleth metal cage clutch (01DT0DJSS1Y0SYS1G2H13CC3Y5)',
 'accessory2': 'jane silk-satin camisole (01DT512VDZ03SMBTPNCRYPXZZX)',
 'bottom': 'manu cropped leather wide-leg pants (01DT517935D6N4Z2G1HFB4DT2A)',
 'top': 'tiger-print lurex cardigan (01DVP7RVX271PQ54TKKWQ8DGYD)',
 'shoe': 'tanya metallic leather mules (01DVP8HWCYG3AP0Z0ZJFVME4W7)'}

## Future Improvements

- If given unlimited resources, we would have liked to explore the potential of "Knowledge Graphs". Based on our research, this is a large undertaking that requires many hours of devotion to implement.


- We also would like to further research the feasibility of using weighted TF-IDF document vectors. We tried doing this but it blew up the dimensionality. To reduce dimensionality, we explored using SVD or PCA but the cosine similarity was adversely affected. We did, however, utilize more than one column of information when constructing our TF-IDF corpus.


- Another method we considered for our recommendations was to weigh the product by the brand instead of concatenating columns. We calculated the similarity separately and then combined the similarity from brand_name and brand_description, giving 80% weightage to description and 20% to brand_name. Using this technique, we got better similarity scores but the recommendations were less intuitive. We were not sure why since it seems counter-intuitive to have higher similarity but worse recommendations. Our suspicion is that by assigning weights to brand name, other products from the same brand become more similar which end up taking the recommendation in a different direction than the actual product. We were unable to confirm this suspicion and given more time could have explored this further. Within the scope of this project, we decided to go ahead with the concatenation logic.

```python
    # weighing by brand name
    weight_corpus=list(product['brand_name'].values)
    weight_corpus=[str(item) for item in weight_corpus]
    weights=vectorizer.fit_transform(weight_corpus)
    weight_df=pd.DataFrame(weights.toarray())
    df_sim_brand=pd.DataFrame(cosine_similarity(tf_idf)).fillna(10^(-10))
    
    # Description gets a weight of .8 and brand name gets a weight of .2
    df_sim=0.8*df_sim_desc+0.2*df_sim_brand
```

- In addition, we also would like to increase our scope to design an outfit when a user enters in a brand name and nothing else.