# TECHNICAL TEST : EXPLORATORY DATA ANALYSIS

## Context

A big prospect has recently contacted us because he has data quality issues. 

The prospect told us that he receives “FEDAS codes” from its suppliers. However those FEDAS codes are often incorrect, and a team of 5 people is currently mobilized full time to check these. Let’s automate this for them!

The prospect sent us a train dataset, in which we can find the original FEDAS codes (***incorrect_fedas_code***) and the manually corrected FEDAS codes (***correct_fedas_code***)

Our goal is to build an algorithm able to correct the FEDAS codes by predicting the column **correct_fedas_code**.

In [16]:
from importlib import reload
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
tqdm.pandas()
from utils import split_fedas_code

The cleaning functions that we'll test in this notebook will be implemented in a separate module `utils.py`. 

## Data exploration

### Train dataset

In [5]:
raw_train = pd.read_csv('data_technical_test/train_technical_test.csv', 
    na_values="",
    dtype={
        "incorrect_fedas_code": object, 
        "correct_fedas_code": object, 
    },
    parse_dates=["avalability_start_date", "avalability_end_date"])

In [6]:
train = raw_train.copy(deep=True)

In [7]:
train["incorrect_fedas_code"] = train["incorrect_fedas_code"].fillna("")

In [4]:
train.shape

(39322, 31)

In [5]:
train.columns

Index(['brand', 'model_code', 'model_label', 'commercial_label',
       'incorrect_fedas_code', 'article_main_category', 'article_type',
       'article_detail', 'comment', 'avalability_start_date',
       'avalability_end_date', 'length', 'width', 'height', 'color_code',
       'color_label', 'inaccurate_gender', 'country_of_origin',
       'country_of_manufacture', 'embakment_harbor', 'shipping_date',
       'eco_participation', 'eco_furniture', 'multiple_of_order',
       'minimum_multiple_of_order', 'net_weight', 'raw_weight', 'volume',
       'size', 'correct_fedas_code', 'accurate_gender'],
      dtype='object')

In [8]:
train.describe(include='all', datetime_is_numeric=True)

Unnamed: 0,brand,model_code,model_label,commercial_label,incorrect_fedas_code,article_main_category,article_type,article_detail,comment,avalability_start_date,...,eco_participation,eco_furniture,multiple_of_order,minimum_multiple_of_order,net_weight,raw_weight,volume,size,correct_fedas_code,accurate_gender
count,39322,39322,39322,6238,39322.0,38571,38402,29622,1552,24908,...,39322.0,39322.0,39322.0,39322.0,39322.0,39322.0,39322.0,39322,39322.0,39322
unique,329,38715,29558,4772,2188.0,710,1221,4076,134,,...,,,,,,,,812,1468.0,11
top,brand_1,813271-40,MAN JEANS,TBT_AP_MN TOP,,LOISIRS,HOMME,09-SHOES (LOW),VETEMENT,,...,,,,,,,,L,275124.0,HO
freq,6089,4,86,20,10854.0,2855,3906,1436,413,,...,,,,,,,,7448,642.0,14775
mean,,,,,,,,,,2020-06-15 17:42:08.151597824,...,0.005552,8e-05,2.780937,10.425055,5.021437,2.270721,1.545066,,,
min,,,,,,,,,,2000-01-01 00:00:00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,
25%,,,,,,,,,,2020-02-15 00:00:00,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,
50%,,,,,,,,,,2020-07-01 00:00:00,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,,,
75%,,,,,,,,,,2020-12-01 00:00:00,...,0.0,0.0,1.0,1.0,0.13,0.0,0.0,,,
max,,,,,,,,,,2021-05-25 00:00:00,...,3.15,3.15,324.0,5000.0,12500.0,14000.0,1010.0,,,


### Test dataset

In [7]:
test = pd.read_csv('data_technical_test/test_technical_test.csv', 
    na_values="",
    dtype={
        "incorrect_fedas_code": object, 
        "correct_fedas_code": object, 
    },
    parse_dates=["avalability_start_date", "avalability_end_date"])

In [8]:
test.shape

(39323, 30)

Let's check that the test dataset only differs from the train dataset by the target column:

In [9]:
set(train.columns).difference(set(test.columns))

{'correct_fedas_code'}

## Business information

A rapid Google search shows that the FEDAS code is built from the following information ([source](https://www.sgidho.com/FR/SiteAssets/SitePages/Introduction/Introduction%20V%204.0.pdf)):

- Digit 1: Product type (Hardware, Footwear, Textile, Service, Rental)
- Digit 2 and 3: Activity code (type of sports)
- Digit 4 and 5: Product Maingroup of the activity
- Digit 6: Product Subgroup


But things are not that simple and the type, activity, etc, are not clearly indicated in such a way that a few expert rules may be sufficient. Moreover there may be errors.

So let's split the the fedas code into 4 groups of digits (1, 23, 45, 6)  and observe the correlations again.

We'll split the incorrect fedas code as well. This will ease analysis and also be part of feature engineering. Maybe parts of the incorrect fedas are correct and may be used to predict the correct fedas?


In [9]:
correct_fedas_columns = [f"correct_fedas_{i}" for i in range(1, 5)]
incorrect_fedas_columns = [f"incorrect_fedas_{i}" for i in range(1, 5)]
train[correct_fedas_columns] = train.correct_fedas_code.apply(split_fedas_code).apply(pd.Series)    
train[incorrect_fedas_columns] = train.incorrect_fedas_code.apply(split_fedas_code).apply(pd.Series)

In [10]:
train.drop(columns=["correct_fedas_code"], inplace=True)


In [None]:
# profile = ProfileReport(train, title="Pandas Profiling Report", explorative=True)
# profile.to_file("train_report.html")

The only noticeable linear correlation between correct fedas codes and other elements is between `correct_code_1` and `incorrect_code_1`. Which is normal since the category section is the one with smallest number of classes (4). 

In [11]:
np.corrcoef(train.correct_fedas_1, train.incorrect_fedas_1)

array([[1.        , 0.39386688],
       [0.39386688, 1.        ]])

But we'd like to be able to predict the correct_fedas_code only from other features. So we'll drop the `incorrect_fedas_code`.

## Feature engineering

Let's start by using only the columns that are all complete. If the model's performance is not satisfying, we'll progressively add the other columns.

### Drop columns

In [101]:
train = raw_train.copy(deep=True)

In [102]:
target = pd.DataFrame(train.correct_fedas_code)

In [103]:
train = train.drop(columns=["incorrect_fedas_code", "correct_fedas_code"])

In [104]:
train.columns

Index(['brand', 'model_code', 'model_label', 'commercial_label',
       'article_main_category', 'article_type', 'article_detail', 'comment',
       'avalability_start_date', 'avalability_end_date', 'length', 'width',
       'height', 'color_code', 'color_label', 'inaccurate_gender',
       'country_of_origin', 'country_of_manufacture', 'embakment_harbor',
       'shipping_date', 'eco_participation', 'eco_furniture',
       'multiple_of_order', 'minimum_multiple_of_order', 'net_weight',
       'raw_weight', 'volume', 'size', 'accurate_gender'],
      dtype='object')

In [105]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39322 entries, 0 to 39321
Data columns (total 29 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   brand                      39322 non-null  object        
 1   model_code                 39322 non-null  object        
 2   model_label                39322 non-null  object        
 3   commercial_label           6238 non-null   object        
 4   article_main_category      38571 non-null  object        
 5   article_type               38402 non-null  object        
 6   article_detail             29622 non-null  object        
 7   comment                    1552 non-null   object        
 8   avalability_start_date     24908 non-null  datetime64[ns]
 9   avalability_end_date       22004 non-null  datetime64[ns]
 10  length                     39322 non-null  float64       
 11  width                      39322 non-null  float64       
 12  heig

Among the features we select those that seem most relevant:

In [117]:
train = train[[
       'article_main_category', 'article_type', 'article_detail', 
       'comment', 'accurate_gender']]

In [107]:
target[correct_fedas_columns] = target.correct_fedas_code.apply(split_fedas_code).apply(pd.Series)

In [108]:
target.drop(columns=["correct_fedas_code"], inplace=True)

In [109]:
target

Unnamed: 0,correct_fedas_1,correct_fedas_2,correct_fedas_3,correct_fedas_4
0,3,78,10,1
1,3,64,30,8
2,1,75,89,0
3,2,24,11,8
4,1,15,94,4
...,...,...,...,...
39317,2,64,70,1
39318,1,46,98,1
39319,2,0,12,5
39320,2,0,12,4


In [72]:
for col in correct_fedas_columns:
    print(f'For {col}, there are {target[col].nunique()} unique values.')

For correct_fedas_1, there are 4 unique values.
For correct_fedas_2, there are 44 unique values.
For correct_fedas_3, there are 99 unique values.
For correct_fedas_4, there are 10 unique values.


Since these codes are categorical, there is no need for normalization / standardization.

We'll consider that the different parts of fedas code are not correlated, so we'll train a model for each part.

Now we need to encode the categorical features.

In [118]:
train.describe()

Unnamed: 0,article_main_category,article_type,article_detail,comment,accurate_gender
count,38571,38402,29622,1552,39322
unique,710,1221,4076,134,11
top,LOISIRS,HOMME,09-SHOES (LOW),VETEMENT,HO
freq,2855,3906,1436,413,14775


In [115]:
train.article_main_category.value_counts()

LOISIRS                       2855
FOOTBALL                      2518
TRAINING                      2498
SPORTSTYLE                    2310
LOISIR                        1770
                              ... 
BAGS WAIST BAG                   1
GOODMORNING                      1
TRICOLORE                        1
TERRE                            1
CLAVAS (BALA, HELMA, THIN)       1
Name: article_main_category, Length: 710, dtype: int64

In [119]:
train

Unnamed: 0,article_main_category,article_type,article_detail,comment,accurate_gender
0,TRAINING,HOMME,09-SHOES (LOW),,HO
1,GARDEN,RUBBER BOOTS,BOOTS,,FE
2,SAC,HOMME,N1FARROW,MATERIEL RANDONNEE,UA
3,RACKET SPORTS,FEMME,21-TANK,,FE
4,,,,SNO,UA
...,...,...,...,...,...
39317,OUTDOOR ADVENTURE,SHORT,,,HO
39318,RUNNING,UNISEX,SEMELLE,,UA
39319,TRAINING,FEMME,KATAKANA GRAPHIC T,,FE
39320,SPORTSTYLE,HOMME,27-T-SHIRT (SHORT SLEEVE),,HO
