# Fashion compatibility - Data Exploration and Analysis

---
## Author Information
- **Author:** Francesco Tedesco
- **Email:** francescotedesco7d2@gmail.com
- [**LinkedIn**](https://www.linkedin.com/in/francescotedesco7d2/)

---
## Overview
This notebook explores the dataset to gain insights into the different types of outfits, in order to build a set of possible combinations of outfits for outfits.

---


In [1]:
import pandas as pd
from collections import Counter

from pathlib import Path
import os

os.chdir(str(Path.cwd().parent))

from src.utils.setup_utilities import load_config, setup_logging
from src.data_processing.group_manipulation import get_grouped_counts
from src.data_processing.group_manipulation import get_grouped_counts_feature_values
from src.data_processing.group_manipulation import get_unique_sets_features
from src.data_processing.group_manipulation import create_combinations
from src.data_processing.group_manipulation import create_configurations
from src.data_processing.product_processing import get_des_product_class

config = load_config()

In [2]:
df_outfits = pd.read_csv(config['data']['outfits_path'])
df_products = pd.read_csv(config['data']['products_path'])
df_outfit_products = pd.merge(df_outfits, df_products, on = 'cod_modelo_color', how = 'outer')

## Initial exploration

In [3]:
df_outfits.head()

Unnamed: 0,cod_outfit,cod_modelo_color
0,1,51000622-02
1,1,43067759-01
2,1,53060518-02
3,1,53030594-08
4,1,43077762-01


The `df_outfits` dataframe contains the codes of each outfit and the products code associated. 

In [4]:
df_products.head()

Unnamed: 0,cod_modelo_color,cod_color_code,des_color_specification_esp,des_agrup_color_eng,des_sex,des_age,des_line,des_fabric,des_product_category,des_product_aggregated_family,des_product_family,des_product_type,des_filename
0,41085800-02,02,OFFWHITE,WHITE,Female,Adult,SHE,P-PLANA,Bottoms,Trousers & leggings,Trousers,Trousers,datathon/images/2019_41085800_02.jpg
1,53000586-TO,TO,TEJANO OSCURO,BLUE,Female,Adult,SHE,J-JEANS,Bottoms,Jeans,Jeans,Jeans,datathon/images/2019_53000586_TO.jpg
2,53030601-81,81,ROSA PASTEL,PINK,Female,Adult,SHE,P-PLANA,"Dresses, jumpsuits and Complete set",Dresses and jumpsuits,Dresses,Dress,datathon/images/2019_53030601_81.jpg
3,53050730-15,15,MOSTAZA,YELLOW,Female,Adult,SHE,P-PLANA,"Dresses, jumpsuits and Complete set",Dresses and jumpsuits,Dresses,Dress,datathon/images/2019_53050730_15.jpg
4,53070773-70,70,ROJO,RED,Female,Adult,SHE,P-PLANA,Tops,Shirts,Shirt,Shirt,datathon/images/2019_53070773_70.jpg


The `df_products` contains information about each product, including the path to the corresponding images.

Some important things to consider: 

In [5]:
print("Total number of outfits:", len(df_outfits['cod_outfit'].unique()))

Total number of outfits: 7842


In [6]:
print("Total number of products:", len(df_products['cod_modelo_color'].unique()))

Total number of products: 9222


In [7]:
existing_products = set(df_products['cod_modelo_color'].unique())
products_w_outfit = set(df_outfits['cod_modelo_color'].unique())

products_without_outfit = existing_products - products_w_outfit

print("Number of products without an outfit:", len(products_without_outfit))

Number of products without an outfit: 0


In [8]:
name_set = 'outfit'
code_sets = 'cod_outfit'
get_grouped_counts(df_outfits, code_sets, name_set)

Unnamed: 0,outfit_size,grouped_outfit_size
0,2,30
1,3,142
2,4,758
3,5,3875
4,6,1737
5,7,674
6,8,309
7,9,174
8,10,95
9,11,26


If we plot the number of outfits for each size, we can observe that there are potentially some outfits that can be excluded from the training set due to their low frequency.

In [9]:
df_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9222 entries, 0 to 9221
Data columns (total 13 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   cod_modelo_color               9222 non-null   object
 1   cod_color_code                 9222 non-null   object
 2   des_color_specification_esp    9222 non-null   object
 3   des_agrup_color_eng            9222 non-null   object
 4   des_sex                        9222 non-null   object
 5   des_age                        9222 non-null   object
 6   des_line                       9222 non-null   object
 7   des_fabric                     9222 non-null   object
 8   des_product_category           9222 non-null   object
 9   des_product_aggregated_family  9222 non-null   object
 10  des_product_family             9222 non-null   object
 11  des_product_type               9222 non-null   object
 12  des_filename                   9222 non-null   object
dtypes: 

## Product description

Since not all outfits will be considered in the training process, we should start by examining their descriptions to determine which descriptors are useful. Later, we can use this information to establish criteria for defining valid combinations. 

In [10]:
df_products['des_product_category'].unique()

array(['Bottoms', 'Dresses, jumpsuits and Complete set', 'Tops',
       'Accesories, Swim and Intimate', 'Outerwear', 'Beauty', 'Home'],
      dtype=object)

In [11]:
df_products['des_product_family'].unique()

array(['Trousers', 'Jeans', 'Dresses', 'Shirt', 'Sweater', 'Skirts',
       'Jewellery', 'Bags', 'Glasses', 'Wallets & cases', 'Shorts',
       'Tops', 'Belts and Ties', 'Jumpsuit', 'Jackets', 'Coats',
       'Footwear', 'Hats, scarves and gloves', 'T-shirt', 'Blazers',
       'Gadgets', 'Swimwear', 'Vest', 'Fragances', 'Cardigans',
       'Trenchcoats', 'Puffer coats', 'Outer Vest',
       'Leggings and joggers', 'Deco Accessories', 'Poloshirts',
       'Intimate', 'Sweatshirts', 'Deco Textiles', 'Bedding', 'Bodysuits',
       'Leather jackets', 'Parkas', 'Glassware'], dtype=object)

In [12]:
df_products['des_product_type'].unique()

array(['Trousers', 'Jeans', 'Dress', 'Shirt', 'Sweater', 'Skirt',
       'Earrings', 'Totes bag', 'Sunglasses', 'Card holder', 'Wallet',
       'Shorts', 'Top', 'Belt', 'Crossbody bag', 'Jumpsuit', 'Jacket',
       'Coat', 'Sandals', 'Kerchief', 'Shoes', 'Blouse', 'T-Shirt',
       'Blazer', 'Umbrella', 'Citybag', 'Bikini top', 'Vest',
       'Shoulder bag', 'Bodymist', 'Beanie', 'Handbag', 'Cardigan',
       'Glasses', 'Trenchcoat', 'Puffer coat', 'Necklace',
       'Bikini pantie', 'Outer vest', 'Scarf', 'Ankle Boots', 'Leggings',
       'Basket', 'Cosmetic bag', 'Ring', 'Poloshirt', 'Pyjama',
       'Sweatshirt', 'Plaid', 'Boots', 'Hat', 'Duvet Covers',
       'Beach Towel', 'Gloves', 'Bodysuit', 'Fragance', 'Leather Jacket',
       'Hairband', 'Bermudas', 'Cap', 'Parka', 'Pyjama Trousers',
       'Pyjama Shirt', 'Bras', 'Trainers', 'Foulard', 'Hairclip', 'Case',
       'Bracelet', 'Pyjama Shorts', 'Sweater Vest', 'Pyjiama Sweater',
       'Bucket bag', 'Jacket (Cazadora)', 'Purse',

We can observe that as the description becomes more specific, the number of different labels increases. Let's examine the counts for the 'des_product_category' (less specific description) values:

In [13]:
df_outfit_products['des_product_category'].value_counts()

des_product_category
Accesories, Swim and Intimate          27315
Tops                                    6313
Bottoms                                 5839
Outerwear                               2080
Dresses, jumpsuits and Complete set     1777
Home                                     169
Beauty                                    89
Name: count, dtype: int64

As Home and Beauty labels are less common, we can exclude them when defining valid outfit combinations. The other labels are more generalizable, and they are less likely to contain 'strange' products. This cannot be said for the most common label (Accesories, Swim and Intimate), where sublabels may potentially have very low frequency and/or make less sense for products.

In [14]:
df_accesories = df_outfit_products[df_outfit_products['des_product_category'] == 'Accesories, Swim and Intimate']
df_accesories['des_product_family'].value_counts()

des_product_family
Jewellery                   9686
Footwear                    7976
Bags                        6152
Glasses                      910
Belts and Ties               767
Wallets & cases              611
Hats, scarves and gloves     551
Intimate                     473
Swimwear                     158
Gadgets                       31
Name: count, dtype: int64

The same observation can be made here for the jewelry family. Let's examine the corresponding subtypes.

In [15]:
df_accesories = df_outfit_products[df_outfit_products['des_product_category'] == 'Accesories, Swim and Intimate']
df_jewellery = df_accesories[df_accesories['des_product_family'] == 'Jewellery']
df_jewellery['des_product_type'].value_counts()

des_product_type
Earrings    5858
Ring        1832
Necklace    1363
Bracelet     614
Hairclip      19
Name: count, dtype: int64

We can now create a personalized product class by combining the three column descriptors we just examined.

In [16]:
df_outfit_products = df_outfit_products.copy()
df_outfit_products['des_product_class'] = df_outfit_products.apply(get_des_product_class, axis=1)

In [17]:
df_outfit_products['des_product_class'].unique()

array(['Accesories, Swim and Intimate', 'Tops', 'Earrings', 'Bottoms',
       'Outerwear', 'Dresses, jumpsuits and Complete set', 'Necklace',
       'Bracelet', 'Ring', 'Home', 'Beauty', 'Hairclip'], dtype=object)

## Outfits structure 


Now that we've created a new class to describe the products, let's take a closer look at how the outfits are actually constructed. 


In [18]:
size = 2
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

Unnamed: 0,des_product_class_set,des_product_class_outfit_codes,des_product_class_count
1,"{'Accesories, Swim and Intimate': 2}","[1571, 1572, 1573, 1574, 1580, 1620, 1621, 203...",20
3,"{'Tops': 1, 'Bottoms': 1}","[3712, 3755, 3818, 7373]",4
0,"{'Tops': 1, 'Earrings': 1}","[1214, 3872]",2
2,"{'Outerwear': 1, 'Dresses, jumpsuits and Compl...",[2036],1
4,"{'Necklace': 1, 'Outerwear': 1}",[5448],1


In [19]:
size = 3
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

Unnamed: 0,des_product_class_set,des_product_class_outfit_codes,des_product_class_count
4,"{'Accesories, Swim and Intimate': 3}","[591, 810, 813, 990, 1141, 1161, 1308, 1318, 1...",38
10,"{'Dresses, jumpsuits and Complete set': 1, 'Ea...","[942, 1333, 1504, 1582, 1594, 1809, 2068, 2094...",18
15,"{'Accesories, Swim and Intimate': 2, 'Ring': 1}","[1085, 1151, 1311, 1486, 1688, 1753, 2489, 249...",11
6,"{'Earrings': 1, 'Accesories, Swim and Intimate...","[725, 889, 1215, 1471, 1568, 3131, 4788, 5306,...",9
14,"{'Dresses, jumpsuits and Complete set': 1, 'Ac...","[1019, 1593, 3368, 5464, 5479, 6421, 6949, 696...",9


In [20]:
size = 4
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

Unnamed: 0,des_product_class_set,des_product_class_outfit_codes,des_product_class_count
3,"{'Dresses, jumpsuits and Complete set': 1, 'Ea...","[55, 160, 191, 229, 231, 232, 249, 251, 252, 2...",220
2,"{'Bottoms': 1, 'Tops': 1, 'Earrings': 1, 'Acce...","[39, 185, 288, 350, 352, 369, 425, 428, 463, 4...",101
12,"{'Accesories, Swim and Intimate': 3, 'Earrings...","[595, 612, 718, 878, 883, 888, 892, 949, 964, ...",59
9,"{'Tops': 1, 'Bottoms': 1, 'Accesories, Swim an...","[312, 366, 484, 570, 574, 688, 701, 739, 789, ...",55
13,"{'Accesories, Swim and Intimate': 3, 'Ring': 1}","[609, 709, 714, 880, 890, 895, 982, 987, 988, ...",53


In [21]:
size = 5
feature_name = 'des_product_type'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

Unnamed: 0,des_product_type_set,des_product_type_outfit_codes,des_product_type_count
3,"{'Dress': 1, 'Handbag': 1, 'Sandals': 1, 'Earr...","[15, 22, 23, 24, 33, 37, 153, 163, 228, 353, 3...",54
444,"{'Crossbody bag': 1, 'Dress': 1, 'Ring': 1, 'S...","[2600, 2601, 2804, 2914, 3336, 3538, 4052, 426...",35
7,"{'Necklace': 1, 'Dress': 1, 'Handbag': 1, 'Ear...","[32, 57, 324, 516, 645, 825, 862, 1016, 1231, ...",35
440,"{'Shoulder bag': 1, 'Earrings': 1, 'Dress': 1,...","[2592, 2631, 2821, 3405, 3448, 3480, 3685, 401...",32
459,"{'Sunglasses': 1, 'Sandals': 1, 'Earrings': 1,...","[2627, 2644, 2699, 2776, 2778, 2817, 2823, 346...",31


In [22]:
size = 6
feature_name = 'des_product_category'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

Unnamed: 0,des_product_category_set,des_product_category_outfit_codes,des_product_category_count
1,"{'Accesories, Swim and Intimate': 4, 'Tops': 1...","[7, 27, 28, 30, 44, 45, 51, 58, 60, 62, 70, 82...",587
3,"{'Bottoms': 1, 'Outerwear': 1, 'Accesories, Sw...","[40, 67, 69, 76, 79, 80, 81, 83, 84, 105, 114,...",558
2,"{'Dresses, jumpsuits and Complete set': 1, 'Ac...","[13, 77, 98, 200, 212, 213, 221, 239, 267, 274...",181
4,"{'Bottoms': 1, 'Tops': 2, 'Accesories, Swim an...","[50, 65, 137, 408, 417, 495, 712, 770, 790, 85...",169
13,"{'Bottoms': 2, 'Tops': 1, 'Accesories, Swim an...","[1029, 1321, 2183, 2320, 2580, 2814, 3022, 304...",39


In [23]:
size = 7
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

Unnamed: 0,des_product_class_set,des_product_class_outfit_codes,des_product_class_count
9,"{'Bottoms': 1, 'Tops': 1, 'Earrings': 1, 'Acce...","[72, 99, 202, 219, 244, 333, 382, 389, 572, 65...",45
5,"{'Earrings': 1, 'Outerwear': 1, 'Bottoms': 1, ...","[35, 48, 68, 87, 111, 188, 237, 276, 342, 413,...",42
6,"{'Tops': 1, 'Outerwear': 1, 'Bottoms': 1, 'Acc...","[43, 49, 130, 364, 423, 682, 746, 1551, 1948, ...",38
39,"{'Bottoms': 1, 'Outerwear': 1, 'Tops': 2, 'Ear...","[478, 1934, 2072, 2539, 2717, 3147, 3201, 3264...",27
145,"{'Home': 5, 'Accesories, Swim and Intimate': 2}","[4065, 4702, 4703, 5245, 5246, 5355, 5356, 543...",20


In [24]:
size = 8
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

Unnamed: 0,des_product_class_set,des_product_class_outfit_codes,des_product_class_count
86,"{'Tops': 2, 'Accesories, Swim and Intimate': 3...","[2504, 2915, 3241, 3242, 4118, 4387, 4970, 527...",18
9,"{'Bottoms': 1, 'Outerwear': 1, 'Accesories, Sw...","[172, 455, 898, 963, 1360, 1451, 1477, 1495, 1...",12
100,"{'Accesories, Swim and Intimate': 5, 'Tops': 1...","[3815, 4324, 4325, 5938, 6084, 6085, 6514, 651...",10
44,"{'Bottoms': 2, 'Accesories, Swim and Intimate'...","[838, 3421, 5206, 5512, 5888, 5969, 6370, 6670...",10
88,"{'Accesories, Swim and Intimate': 8}","[2638, 3415, 3994, 4006, 4445, 4460, 4952, 565...",9


In [25]:
size = 9
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

Unnamed: 0,des_product_class_set,des_product_class_outfit_codes,des_product_class_count
31,"{'Earrings': 4, 'Ring': 3, 'Necklace': 2}","[1045, 1046, 1047, 1264, 1417, 1849, 2050, 226...",9
65,"{'Accesories, Swim and Intimate': 4, 'Tops': 1...","[3239, 3359, 4650, 6251, 7597]",5
61,"{'Accesories, Swim and Intimate': 4, 'Tops': 2...","[3032, 5665, 7057, 7302, 7511]",5
73,"{'Accesories, Swim and Intimate': 4, 'Tops': 2...","[4074, 5230, 5231, 7067, 7506]",5
39,"{'Bottoms': 2, 'Tops': 1, 'Outerwear': 1, 'Acc...","[1463, 1974, 2995, 3879]",4


In [26]:
size = 10
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

Unnamed: 0,des_product_class_set,des_product_class_outfit_codes,des_product_class_count
39,"{'Accesories, Swim and Intimate': 10}","[2606, 2637, 4002, 4454, 4456, 4960, 5661, 5662]",8
2,"{'Bottoms': 1, 'Earrings': 1, 'Necklace': 2, '...","[92, 204, 607, 628]",4
12,"{'Earrings': 1, 'Outerwear': 1, 'Bottoms': 1, ...","[761, 1686, 4262]",3
14,"{'Bottoms': 1, 'Earrings': 1, 'Necklace': 2, '...","[820, 1190, 1795]",3
0,"{'Necklace': 2, 'Accesories, Swim and Intimate...","[4, 2182]",2


# Defining valid outfit combinations

Upon reviewing all the outfits, we can establish some rules for selecting valid combinations:

In addition to these observations, we can define a rule for constructing the outfits:

    The foundation of the outfits must consist of one of the following sets:
        Tops + Bottoms + Footwear + Earrings + Accessories
        Dress + Footwear + Earrings + Accessories

    The outfit can optionally include the following complements:
        Outerwear, Bags, Glasses, Ring, and Necklace

We get, that a valid outfit is a configuration base plus (or
not) a possible configuration of complements, each comple-
ment, cannot appear more than once in the configuration.
And so we get 42 possible configuration for the outfits, and
each outfit can reach up to 10 products of different classes


NOTE: Looking at outfits with sizes 2 or higher or equal than
9, we observe that didn’t reach the conditions for being
a valid outfit.

In [27]:
configurations_base = config['data']['configurations_base']

optional_products = config['data']['optional_products']

optional_configurations = create_combinations(optional_products, root = False)

all_configurations = create_configurations(configurations_base, optional_products)

print("Number of Total possible configurations of BASE products:", len(configurations_base))
print("Number of Total possible configurations of OPTIONAL products:", len(optional_configurations))
print("Number of Total possible configurations:", len(all_configurations))

Number of Total possible configurations of BASE products: 2
Number of Total possible configurations of OPTIONAL products: 20
Number of Total possible configurations: 42


In [34]:
feature_name = 'des_product_class'
code_sets = 'cod_outfit'
name_set = 'outfit'
result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)

selected_outfits = []
configurations_count = {}


for size in range(3, 9):
    outfit_products_list = result_df[result_df['outfit_size'] == size][f'{code_sets}_{feature_name}_tuple']
    for out_prod in list(outfit_products_list)[0]:
        for i, conf in enumerate(all_configurations[:1]):
            if Counter(out_prod[1]) == Counter(conf):
                print("hola")
                configurations_count[i] = configurations_count.get(i, 0) + 1
                selected_outfits.append(out_prod[0])

print("Total number of outfits selected:", len(selected_outfits))

columns = ['cod_outfit','cod_modelo_color','des_product_class' ,'des_filename']
df_outfit_products_sel = df_outfit_products[df_outfit_products['cod_outfit'].isin(selected_outfits)][columns]
df_outfit_products_sel.head()

configurations_count_df = pd.DataFrame(list(configurations_count.items()), columns=['conf_id', 'size_conf'])
configurations_count_df['conf_id'] = configurations_count_df['conf_id'].astype(int)
configurations_count_df['prob_conf'] = configurations_count_df['size_conf'] / sum(configurations_count_df['size_conf'])


Total number of outfits selected: 0


In [35]:
df_outfit_products_sel

Unnamed: 0,cod_outfit,cod_modelo_color,des_product_class,des_filename
