# Fashion compatibility - Data Exploration and Analysis

---
## Author Information
- **Author:** Francesco Tedesco
- **Email:** francescotedesco7d2@gmail.com
- [**LinkedIn**](https://www.linkedin.com/in/francescotedesco7d2/)

---
## Overview
This notebook explores the dataset to gain insights into the different types of outfits, in order to build a set of possible combinations of outfits for outfits.

---


In [1]:
from pathlib import Path
import os

os.chdir(str(Path.cwd().parent))

from src.utils.setup_utilities import load_config
from src.data_processing.group_manipulation import get_grouped_counts
from src.data_processing.group_manipulation import get_grouped_counts_feature_values
from src.data_processing.group_manipulation import get_unique_sets_features
from src.data_processing.group_manipulation import create_combinations
from src.data_processing.group_manipulation import create_configurations
from src.data_processing.group_manipulation import select_valid_outfits
from src.data_processing.product_processing import get_des_product_class

config = load_config()

In [2]:
df_outfits = pd.read_csv(config['data']['outfits_path'])
df_products = pd.read_csv(config['data']['products_path'])
df_outfit_products = pd.merge(df_outfits, df_products, on = 'cod_modelo_color', how = 'outer')

NameError: name 'pd' is not defined

## Initial exploration

In [None]:
df_outfits.head()

The `df_outfits` dataframe contains the codes of each outfit and the products code associated. 

In [None]:
df_products.head()

The `df_products` contains information about each product, including the path to the corresponding images.

Some important things to consider: 

In [None]:
print("Total number of outfits:", len(df_outfits['cod_outfit'].unique()))

In [None]:
print("Total number of products:", len(df_products['cod_modelo_color'].unique()))

In [None]:
existing_products = set(df_products['cod_modelo_color'].unique())
products_w_outfit = set(df_outfits['cod_modelo_color'].unique())

products_without_outfit = existing_products - products_w_outfit

print("Number of products without an outfit:", len(products_without_outfit))

In [None]:
name_set = 'outfit'
code_sets = 'cod_outfit'
get_grouped_counts(df_outfits, code_sets, name_set)

If we plot the number of outfits for each size, we can observe that there are potentially some outfits that can be excluded from the training set due to their low frequency.

In [None]:
df_products.info()

## Product description

Since not all outfits will be considered in the training process, we should start by examining their descriptions to determine which descriptors are useful. Later, we can use this information to establish criteria for defining valid combinations. 

In [None]:
df_products['des_product_category'].unique()

In [None]:
df_products['des_product_family'].unique()

In [None]:
df_products['des_product_type'].unique()

We can observe that as the description becomes more specific, the number of different labels increases. Let's examine the counts for the 'des_product_category' (less specific description) values:

In [None]:
df_outfit_products['des_product_category'].value_counts()

As Home and Beauty labels are less common, we can exclude them when defining valid outfit combinations. The other labels are more generalizable, and they are less likely to contain 'strange' products. This cannot be said for the most common label (Accesories, Swim and Intimate), where sublabels may potentially have very low frequency and/or make less sense for products.

In [None]:
df_accesories = df_outfit_products[df_outfit_products['des_product_category'] == 'Accesories, Swim and Intimate']
df_accesories['des_product_family'].value_counts()

The same observation can be made here for the jewelry family. Let's examine the corresponding subtypes.

In [None]:
df_accesories = df_outfit_products[df_outfit_products['des_product_category'] == 'Accesories, Swim and Intimate']
df_jewellery = df_accesories[df_accesories['des_product_family'] == 'Jewellery']
df_jewellery['des_product_type'].value_counts()

We can now create a personalized product class by combining the three column descriptors we just examined.

In [None]:
df_outfit_products = df_outfit_products.copy()
df_outfit_products['des_product_class'] = df_outfit_products.apply(get_des_product_class, axis=1)

In [None]:
df_outfit_products['des_product_class'].unique()

## Outfits structure 


Now that we've created a new class to describe the products, let's take a closer look at how the outfits are actually constructed. 


In [None]:
size = 2
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

In [None]:
size = 3
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

In [None]:
size = 4
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

In [None]:
size = 5
feature_name = 'des_product_type'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

In [None]:
size = 6
feature_name = 'des_product_category'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

In [None]:
size = 7
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

In [None]:
size = 8
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

In [None]:
size = 9
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

In [None]:
size = 10
feature_name = 'des_product_class'
name_set = 'outfit'
code_sets = 'cod_outfit'

result_df = get_grouped_counts_feature_values(df_outfit_products, code_sets, name_set, feature_name)
tuple_set = result_df[result_df['outfit_size'] == size][f'cod_outfit_{feature_name}_tuple'].iloc[0]
features_sets = get_unique_sets_features(tuple_set, feature_name)
features_sets.head()

# Defining valid outfit combinations

Upon reviewing all the outfits, we can establish some rules for selecting valid combinations:

In addition to these observations, we can define a rule for constructing the outfits:

    The foundation of the outfits must consist of one of the following sets:
        Tops + Bottoms + Footwear + Earrings + Accessories
        Dress + Footwear + Earrings + Accessories

    The outfit can optionally include the following complements:
        Outerwear, Bags, Glasses, Ring, and Necklace

We get, that a valid outfit is a configuration base plus (or
not) a possible configuration of complements, each comple-
ment, cannot appear more than once in the configuration.
And so we get 42 possible configuration for the outfits, and
each outfit can reach up to 10 products of different classes


NOTE: Looking at outfits with sizes 2 or higher or equal than
9, we observe that didn’t reach the conditions for being
a valid outfit.

In [None]:
configurations_base = config['data']['configurations_base']

optional_products = config['data']['optional_products']

optional_configurations = create_combinations(optional_products, root = False)

all_configurations = create_configurations(configurations_base, optional_products)

print("Number of Total possible configurations of BASE products:", len(configurations_base))
print("Number of Total possible configurations of OPTIONAL products:", len(optional_configurations))
print("Number of Total possible configurations:", len(all_configurations))

In [None]:
feature_name = 'des_product_class'
code_sets = 'cod_outfit'
name_set = 'outfit'

df_outfit_products_sel, configurations_count = select_valid_outfits(df_outfit_products, feature_name, code_sets, name_set, all_configurations)

print("Total number of outfits selected:", len(df_outfit_products_sel['cod_outfit'].unique()))

df_outfit_products_sel.head()


After selecting the valid outfits, we observe that only a minority of them can be considered trainable. Therefore, the model's performance will be evaluated based on its ability to distinguish good outfits from randomly created outfits using unseen products (unseen images). 


When randomly creating new outfits for evaluating the model later, the products will be selected at random based on the types of configurations considered during training:

In [None]:
configurations_count_df = pd.DataFrame(list(configurations_count.items()), columns=['conf_id', 'size_conf'])
configurations_count_df['conf_id'] = configurations_count_df['conf_id'].astype(int)
configurations_count_df['prob_conf'] = configurations_count_df['size_conf'] / sum(configurations_count_df['size_conf'])
configurations_count_df

Some additional information about the outfits and products considered in the training set is as follows:

In [None]:
name_set = 'outfit'
code_sets = 'cod_outfit'
get_grouped_counts(df_outfit_products_sel, code_sets, name_set)

In [None]:
df_outfit_products_sel['des_product_class'].value_counts() / df_outfit_products_sel['des_product_class'].count()