<a id="contents"></a>
# Session 2 - The Machine Learning Workflow



### [Preparing a "rich" dataset](#rich)
- [Importing with pandas](#import)
- [Prices variable](#prices)
- [Pandas exercise](#pandas_exercise)

### [Recoding Categorical Data](#recoding)
- [Recoding categorical variables with OneHotEncoder](#ohe)
- [Recoding catgorical variables with pandas's get_dummies](#dummy)
- [Importing `mini_victoria.txt`data](#mini_victoria)

### [Handling Missing Data](#missing_data)
- [Importing datasets](#import_datasets)
- [Preparing datasets](#prepare_data)
- [Imputation with the median](#median)
- [Imputation with the mean](#mean)
- [Imputation with linear interpolation](#linear)
- [Simple imputation](#simple)
- [Multiple imputation](#multiple)
- [K Nearest Neighbors](#neighbors)

<a id="import"></a>
### Importing with pandas

- Save the `mini_victoria.txt` file
- Check the data in a text editor such as Notepad++ or Visual Studio Code
- Import it using pandas
- Print a comprehensive summary

In [2]:
import pandas as pd
import os

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
DATA_PATH = r'D:\Dropbox\EM Lyon\7MPMLS_Introduction_Machine_Learning\data'
os.chdir(DATA_PATH)

In [4]:
df = pd.read_csv('mini_victoria.txt', delimiter='*', encoding='ansi')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45339 non-null  object 
 1   mrp               45339 non-null  object 
 2   price             45339 non-null  object 
 3   pdp_url           45339 non-null  object 
 4   brand_name        45339 non-null  object 
 5   product_category  45339 non-null  object 
 6   retailer          45339 non-null  object 
 7   description       45339 non-null  object 
 8   rating            13662 non-null  float64
 9   review_count      13662 non-null  float64
 10  style_attributes  0 non-null      float64
 11  total_sizes       45339 non-null  object 
 12  available_size    45339 non-null  object 
 13  color             45339 non-null  object 
dtypes: float64(3), object(11)
memory usage: 4.8+ MB


In [6]:
df.head(3)

Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,retailer,description,rating,review_count,style_attributes,total_sizes,available_size,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,$36.50,$36.50,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Victoriassecret US,Game-changer: your favorite maximum-support sp...,3.6,25.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D3,White
1,Body by Victoria Demi Bra,$54.50,$19.99,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Demi Bra,Victoriassecret US,Sexy comfort and a sleek shape start with low-...,,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",38C,cadette green
2,Easy Plunge Bra,$29.50,$29.50,https://www.victoriassecret.com/bras/bralette/...,Victoria's Secret,Easy Plunge Bra,Victoriassecret US,This supersoft bra is easy to love with fully ...,4.4,260.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",34DD,Black


[Table of Contents](#contents)

<a id="prices"></a>
### The price variables are not recognized as quantitative
- Make the necessary pre-processing to read them as such
- Create a function that removes the $ symbol for the USD currencies and replaces all others by missing values
- Apply it on each of the price columns
- Check again


Let us first have a look at one price column

In [7]:
df['mrp'].value_counts()

$10.50       4871
$36.50       4537
$34.50       3600
$34.95       2696
$54.50       2292
$32.95       1975
$39.50       1966
$44.50       1911
$20.00       1822
$29.50       1818
$49.50       1721
$59.50       1418
$24.50       1189
$32.50       1028
$46.50        982
$62.50        915
$42.50        803
$56.50        774
$14.50        714
$58.50        675
$48.50        668
$52.50        644
$8.50         586
$35.00        535
$16.50        502
$36.00        430
$38.00        345
$36.95        323
$64.50        298
$24.95        281
$48.00        271
$25.00        255
$58.00        211
$15.00        206
$55.50        179
$30.00        161
$49.95        154
$52.00        148
$38.50        129
$42.00        128
$68.00        126
$32.00        120
$22.95        104
$39.95         93
$66.50         76
$26.95         72
$68.50         71
$29.95         62
$40.00         50
$54.00         44
$22.50         38
$42.95         37
$28.00         31
$54.95         29
$78.00         27
$12.50    

In [8]:
import numpy as np

def remove(symbol,var):
    value=[]
    for row in var:
        if symbol in row :
            value.append(float(row.split(symbol)[1]))
        else :
            value.append(np.nan)
        
    return value

Check how many missing values we have

In [9]:
res = remove('$',df['mrp'])
pd.DataFrame(res).isnull().sum()

0    39
dtype: int64

Now, replace the two non-numerical price columns by numerical price columns (quantitative data)

In [10]:
df['mrp'] = remove('$',df['mrp'])

In [11]:
df['price'] = remove('$',df['price'])

Count the number of unique modalities in each variable of the dataframe

In [12]:
df.nunique()

product_name         599
mrp                   72
price                 89
pdp_url             1410
brand_name             2
product_category     445
retailer               1
description          536
rating                31
review_count         333
style_attributes       0
total_sizes           30
available_size        44
color               1300
dtype: int64

Check this particular binary variable

In [13]:
df['brand_name'].value_counts()

Victoria's Secret         34240
Victoria's Secret Pink    11099
Name: brand_name, dtype: int64

Were we to continue the analysis of this dataset we would certainly remove the following columns
- retailer : it has no variability so it is useless
- style attibutes does not have any values (all data missing)

[Table of Contents](#contents)

<a id="pandas_exercise"></a>
### Pandas Exercise

1. Write the lines of code to provide the name of the cheapest product 
2. Write the lines of code to count the number of products with available size equal to '38A’ 
3. Write the lines of code to list and count the type and color of the most expensive products containing 'sport bra'

In [14]:
# Write the lines of code to provide the name of the cheapest product 
df.loc[df['price']==df['price'].min(),'product_name'].unique()

array(['Cotton Lingerie Lace-waist Brief Panty',
       'Cotton Lingerie Mesh Thong Panty', 'Seamless Cheekini Panty',
       'Cotton Lingerie String Bikini Panty'], dtype=object)

In [15]:
# Write the lines of code to count the number of products with available size equal to '38A’ 
df.loc[df['available_size'].str.contains('38A'), 'product_name'].value_counts()

Series([], Name: product_name, dtype: int64)

In [16]:
# Write the lines of code to count the number of products with available size equal to '38A’ 
df.loc[df['available_size'].str.contains('38A'), 'product_name'].value_counts()

Series([], Name: product_name, dtype: int64)

In [16]:
# Write the lines of code to list and count the type and color of the most expensive products containing 'sport bra' 
df.loc[df['product_name'].str.lower().str.contains('sport bra') & df['price']==df['price'].max(),
       ['product_category', 'color']].value_counts()

Series([], dtype: int64)

[Table of Contents](#contents)

<a id="recoding"></a>
## Recoding Categorical Data

### Import the `Credit.csv` dataset
- Recode all the categorical variables using sklearn onehotencoder and pandas get_dummies
- Compare your results

In [17]:
import pandas as pd
import os
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_info_columns', 300)

import warnings
warnings.filterwarnings('ignore')

In [18]:
DATA_PATH = r'D:\Dropbox\EM Lyon\7MPMLS_Introduction_Machine_Learning\data'
os.chdir(DATA_PATH)

In [19]:
df = pd.read_csv('Credit.csv', delimiter=',', encoding='utf-8')

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Income     400 non-null    float64
 1   Limit      400 non-null    int64  
 2   Rating     400 non-null    int64  
 3   Cards      400 non-null    int64  
 4   Age        400 non-null    int64  
 5   Education  400 non-null    int64  
 6   Own        400 non-null    object 
 7   Student    400 non-null    object 
 8   Married    400 non-null    object 
 9   Region     400 non-null    object 
 10  Balance    400 non-null    int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 34.5+ KB


In [21]:
df.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Own,Student,Married,Region,Balance
0,14.891,3606,283,2,34,11,No,No,Yes,South,333
1,106.025,6645,483,3,82,15,Yes,Yes,Yes,West,903
2,104.593,7075,514,4,71,11,No,No,No,West,580
3,148.924,9504,681,3,36,11,Yes,No,No,West,964
4,55.882,4897,357,2,68,16,No,No,Yes,South,331


[Table of Contents](#contents)

<a id="ohe"></a>
### Recode all the categorical variables using sklearn onehotencoder

In [22]:
df_cat = df.select_dtypes(include=['object'])
df_num = df.select_dtypes(exclude=['object'])

In [23]:
from sklearn.preprocessing import OneHotEncoder
OHE = OneHotEncoder(sparse_output=False)

In [24]:
df_cat_ohe = pd.DataFrame(OHE.fit_transform(df_cat))
df_cat_ohe.columns = OHE.get_feature_names_out()

In [25]:
df_cat_ohe.info()
display(df_cat_ohe.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Own_No        400 non-null    float64
 1   Own_Yes       400 non-null    float64
 2   Student_No    400 non-null    float64
 3   Student_Yes   400 non-null    float64
 4   Married_No    400 non-null    float64
 5   Married_Yes   400 non-null    float64
 6   Region_East   400 non-null    float64
 7   Region_South  400 non-null    float64
 8   Region_West   400 non-null    float64
dtypes: float64(9)
memory usage: 28.3 KB


Unnamed: 0,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [26]:
df_ohe = pd.concat([df_num, df_cat_ohe], axis=1)
df_ohe.info()
df_ohe.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Income        400 non-null    float64
 1   Limit         400 non-null    int64  
 2   Rating        400 non-null    int64  
 3   Cards         400 non-null    int64  
 4   Age           400 non-null    int64  
 5   Education     400 non-null    int64  
 6   Balance       400 non-null    int64  
 7   Own_No        400 non-null    float64
 8   Own_Yes       400 non-null    float64
 9   Student_No    400 non-null    float64
 10  Student_Yes   400 non-null    float64
 11  Married_No    400 non-null    float64
 12  Married_Yes   400 non-null    float64
 13  Region_East   400 non-null    float64
 14  Region_South  400 non-null    float64
 15  Region_West   400 non-null    float64
dtypes: float64(10), int64(6)
memory usage: 50.1 KB


Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,3606,283,2,34,11,333,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645,483,3,82,15,903,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075,514,4,71,11,580,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504,681,3,36,11,964,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897,357,2,68,16,331,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


<a id="dummy"></a>
### Recode all the categorical variables using pandas get_dummies

In [27]:
df_dummy = pd.get_dummies(df)
df_dummy.info()
df_dummy.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Income        400 non-null    float64
 1   Limit         400 non-null    int64  
 2   Rating        400 non-null    int64  
 3   Cards         400 non-null    int64  
 4   Age           400 non-null    int64  
 5   Education     400 non-null    int64  
 6   Balance       400 non-null    int64  
 7   Own_No        400 non-null    uint8  
 8   Own_Yes       400 non-null    uint8  
 9   Student_No    400 non-null    uint8  
 10  Student_Yes   400 non-null    uint8  
 11  Married_No    400 non-null    uint8  
 12  Married_Yes   400 non-null    uint8  
 13  Region_East   400 non-null    uint8  
 14  Region_South  400 non-null    uint8  
 15  Region_West   400 non-null    uint8  
dtypes: float64(1), int64(6), uint8(9)
memory usage: 25.5 KB


Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,3606,283,2,34,11,333,1,0,1,0,0,1,0,1,0
1,106.025,6645,483,3,82,15,903,0,1,0,1,0,1,0,0,1
2,104.593,7075,514,4,71,11,580,1,0,1,0,1,0,0,0,1
3,148.924,9504,681,3,36,11,964,0,1,1,0,1,0,0,0,1
4,55.882,4897,357,2,68,16,331,1,0,1,0,0,1,0,1,0


Check equivalence of the two dataframes

In [28]:
df_ohe.astype(float).equals(df_dummy.astype(float))

True

[Table of Contents](#contents)

<a id="mini_victoria"></a>
### Import the `mini_victoria.txt` dataset
- Which categorical variables should be onehot encoded ?
- Which categorical variables should be label encoded ?


In [29]:
import pandas as pd
import os

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


In [30]:
DATA_PATH = r'D:\Dropbox\EM Lyon\7MPMLS_Introduction_Machine_Learning\data'
os.chdir(DATA_PATH)

In [31]:
df = pd.read_csv('mini_victoria.txt', delimiter='*', encoding='ansi')

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45339 non-null  object 
 1   mrp               45339 non-null  object 
 2   price             45339 non-null  object 
 3   pdp_url           45339 non-null  object 
 4   brand_name        45339 non-null  object 
 5   product_category  45339 non-null  object 
 6   retailer          45339 non-null  object 
 7   description       45339 non-null  object 
 8   rating            13662 non-null  float64
 9   review_count      13662 non-null  float64
 10  style_attributes  0 non-null      float64
 11  total_sizes       45339 non-null  object 
 12  available_size    45339 non-null  object 
 13  color             45339 non-null  object 
dtypes: float64(3), object(11)
memory usage: 4.8+ MB


[Table of Contents](#contents)

In [33]:
import numpy as np

def remove(symbol,var):
    value=[]
    for row in var:
        if symbol in row :
            value.append(float(row.split(symbol)[1]))
        else :
            value.append(np.nan)
        
    return value

Now, replace the two non-numerical price columns by numerical price columns (quantitative data)

In [34]:
df['mrp'] = remove('$',df['mrp'])

In [35]:
df['price'] = remove('$',df['price'])

In [36]:
df.info()
df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45339 non-null  object 
 1   mrp               45300 non-null  float64
 2   price             45300 non-null  float64
 3   pdp_url           45339 non-null  object 
 4   brand_name        45339 non-null  object 
 5   product_category  45339 non-null  object 
 6   retailer          45339 non-null  object 
 7   description       45339 non-null  object 
 8   rating            13662 non-null  float64
 9   review_count      13662 non-null  float64
 10  style_attributes  0 non-null      float64
 11  total_sizes       45339 non-null  object 
 12  available_size    45339 non-null  object 
 13  color             45339 non-null  object 
dtypes: float64(5), object(9)
memory usage: 4.8+ MB


Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,retailer,description,rating,review_count,style_attributes,total_sizes,available_size,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Victoriassecret US,Game-changer: your favorite maximum-support sp...,3.6,25.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D3,White
1,Body by Victoria Demi Bra,54.5,19.99,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Demi Bra,Victoriassecret US,Sexy comfort and a sleek shape start with low-...,,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",38C,cadette green
2,Easy Plunge Bra,29.5,29.5,https://www.victoriassecret.com/bras/bralette/...,Victoria's Secret,Easy Plunge Bra,Victoriassecret US,This supersoft bra is easy to love with fully ...,4.4,260.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",34DD,Black


[Table of Contents](#contents)

In [37]:
df_cat = df.select_dtypes(include=['object'])
df_cat.info()
count_modality = df_cat.nunique()
count_modality

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   product_name      45339 non-null  object
 1   pdp_url           45339 non-null  object
 2   brand_name        45339 non-null  object
 3   product_category  45339 non-null  object
 4   retailer          45339 non-null  object
 5   description       45339 non-null  object
 6   total_sizes       45339 non-null  object
 7   available_size    45339 non-null  object
 8   color             45339 non-null  object
dtypes: object(9)
memory usage: 3.1+ MB


product_name         599
pdp_url             1410
brand_name             2
product_category     445
retailer               1
description          536
total_sizes           30
available_size        44
color               1300
dtype: int64

In [38]:
df_num = df.select_dtypes(exclude=['object'])
df_num.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   mrp               45300 non-null  float64
 1   price             45300 non-null  float64
 2   rating            13662 non-null  float64
 3   review_count      13662 non-null  float64
 4   style_attributes  0 non-null      float64
dtypes: float64(5)
memory usage: 1.7 MB


[Table of Contents](#contents)

In [39]:
label_list = count_modality[count_modality > 50].index
df_label = df[label_list]
df_label.info()
display(df_label.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   product_name      45339 non-null  object
 1   pdp_url           45339 non-null  object
 2   product_category  45339 non-null  object
 3   description       45339 non-null  object
 4   color             45339 non-null  object
dtypes: object(5)
memory usage: 1.7+ MB


Unnamed: 0,product_name,pdp_url,product_category,description,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,https://www.victoriassecret.com/bras/shop-all-...,Incredible by Victoria Sport Front-close Sport...,Game-changer: your favorite maximum-support sp...,White
1,Body by Victoria Demi Bra,https://www.victoriassecret.com/bras/shop-all-...,Demi Bra,Sexy comfort and a sleek shape start with low-...,cadette green
2,Easy Plunge Bra,https://www.victoriassecret.com/bras/bralette/...,Easy Plunge Bra,This supersoft bra is easy to love with fully ...,Black
3,The T-Shirt Perfect Shape Bra,https://www.victoriassecret.com/bras/shop-all-...,Perfect Shape Bra,The everyday go-to bra pairs sexy lift and the...,Coconut White Matte Print
4,PINK NEW! Wear Everywhere Super Push,https://www.victoriassecret.com/pink/panties/w...,Wear Everywhere Super Push,"A super flirty new style, with more push than ...",bayberry


Any categorical variable with more than 50 modalities should be label-encoded <br>
Why 50 modalities, not more nor less ? Well it depends on the number of remaining features - the more features, the less onehot encoding...

[Table of Contents](#contents)

In [40]:
from sklearn.preprocessing import LabelEncoder
LBE = LabelEncoder()

In [41]:
for col in label_list :
    df_label[col] = LBE.fit_transform(df_label[col])

In [42]:
df_label.info()
df_label.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   product_name      45339 non-null  int32
 1   pdp_url           45339 non-null  int32
 2   product_category  45339 non-null  int32
 3   description       45339 non-null  int32
 4   color             45339 non-null  int32
dtypes: int32(5)
memory usage: 885.7 KB


Unnamed: 0,product_name,pdp_url,product_category,description,color
0,567,338,122,170,553
1,4,264,67,302,705
2,165,89,73,475,23
3,439,491,283,406,159
4,324,1247,427,68,610


[Table of Contents](#contents)

In [43]:
ohe_list = count_modality[count_modality <= 50].index
df_ohe = df[ohe_list]
df_ohe.info()
display(df_ohe.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   brand_name      45339 non-null  object
 1   retailer        45339 non-null  object
 2   total_sizes     45339 non-null  object
 3   available_size  45339 non-null  object
dtypes: object(4)
memory usage: 1.4+ MB


Unnamed: 0,brand_name,retailer,total_sizes,available_size
0,Victoria's Secret,Victoriassecret US,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D3
1,Victoria's Secret,Victoriassecret US,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",38C
2,Victoria's Secret,Victoriassecret US,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",34DD
3,Victoria's Secret,Victoriassecret US,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D
4,Victoria's Secret Pink,Victoriassecret US,"[""30AA"", ""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""...",32D


In [44]:
def clean(row):
    import regex as re
    row = re.sub('[^A-Z0-9]'," ",row)
    row = re.split('\s+',row)
    return [item for item in row if item !='']

In [45]:
df_ohe['total_sizes'] = df_ohe['total_sizes'].apply(clean)

In [46]:
df_ohe.info()
df_ohe.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   brand_name      45339 non-null  object
 1   retailer        45339 non-null  object
 2   total_sizes     45339 non-null  object
 3   available_size  45339 non-null  object
dtypes: object(4)
memory usage: 1.4+ MB


Unnamed: 0,brand_name,retailer,total_sizes,available_size
0,Victoria's Secret,Victoriassecret US,"[32A, 32B, 32C, 32D, 32DD, 32DDD, 34A, 34B, 34...",32D3
1,Victoria's Secret,Victoriassecret US,"[30A, 30B, 30C, 30D, 30DD, 30DDD, 32A, 32B, 32...",38C
2,Victoria's Secret,Victoriassecret US,"[32A, 32B, 32C, 32D, 32DD, 34A, 34B, 34C, 34D,...",34DD
3,Victoria's Secret,Victoriassecret US,"[32A, 32B, 32C, 32D, 32DD, 32DDD, 34A, 34B, 34...",32D
4,Victoria's Secret Pink,Victoriassecret US,"[30AA, 30A, 30B, 30C, 30D, 30DD, 32AA, 32A, 32...",32D


[Table of Contents](#contents)

In [47]:
df_exp = df_ohe.explode('total_sizes')

In [48]:
df_exp.info()
df_exp.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 885689 entries, 0 to 45338
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   brand_name      885689 non-null  object
 1   retailer        885689 non-null  object
 2   total_sizes     885671 non-null  object
 3   available_size  885689 non-null  object
dtypes: object(4)
memory usage: 33.8+ MB


Unnamed: 0,brand_name,retailer,total_sizes,available_size
0,Victoria's Secret,Victoriassecret US,32A,32D3
0,Victoria's Secret,Victoriassecret US,32B,32D3
0,Victoria's Secret,Victoriassecret US,32C,32D3
0,Victoria's Secret,Victoriassecret US,32D,32D3
0,Victoria's Secret,Victoriassecret US,32DD,32D3


In [49]:
df_exp.nunique()

brand_name         2
retailer           1
total_sizes       52
available_size    44
dtype: int64

[Table of Contents](#contents)

In [50]:
df_right = df_num.join(df_label)
df_right.info()
df_right.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   mrp               45300 non-null  float64
 1   price             45300 non-null  float64
 2   rating            13662 non-null  float64
 3   review_count      13662 non-null  float64
 4   style_attributes  0 non-null      float64
 5   product_name      45339 non-null  int32  
 6   pdp_url           45339 non-null  int32  
 7   product_category  45339 non-null  int32  
 8   description       45339 non-null  int32  
 9   color             45339 non-null  int32  
dtypes: float64(5), int32(5)
memory usage: 2.6 MB


Unnamed: 0,mrp,price,rating,review_count,style_attributes,product_name,pdp_url,product_category,description,color
0,36.5,36.5,3.6,25.0,,567,338,122,170,553
1,54.5,19.99,,,,4,264,67,302,705
2,29.5,29.5,4.4,260.0,,165,89,73,475,23
3,39.5,39.5,,,,439,491,283,406,159
4,32.95,32.95,,,,324,1247,427,68,610


[Table of Contents](#contents)

In [51]:
df_left = pd.get_dummies(df_exp)
df_left.info()
df_left.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 885689 entries, 0 to 45338
Data columns (total 99 columns):
 #   Column                             Non-Null Count   Dtype
---  ------                             --------------   -----
 0   brand_name_Victoria's Secret       885689 non-null  uint8
 1   brand_name_Victoria's Secret Pink  885689 non-null  uint8
 2   retailer_Victoriassecret US        885689 non-null  uint8
 3   total_sizes_30A                    885689 non-null  uint8
 4   total_sizes_30AA                   885689 non-null  uint8
 5   total_sizes_30B                    885689 non-null  uint8
 6   total_sizes_30C                    885689 non-null  uint8
 7   total_sizes_30D                    885689 non-null  uint8
 8   total_sizes_30DD                   885689 non-null  uint8
 9   total_sizes_30DDD                  885689 non-null  uint8
 10  total_sizes_32A                    885689 non-null  uint8
 11  total_sizes_32AA                   885689 non-null  uint8
 12  tot

Unnamed: 0,brand_name_Victoria's Secret,brand_name_Victoria's Secret Pink,retailer_Victoriassecret US,total_sizes_30A,total_sizes_30AA,total_sizes_30B,total_sizes_30C,total_sizes_30D,total_sizes_30DD,total_sizes_30DDD,total_sizes_32A,total_sizes_32AA,total_sizes_32B,total_sizes_32C,total_sizes_32D,total_sizes_32DD,total_sizes_32DDD,total_sizes_34A,total_sizes_34AA,total_sizes_34B,total_sizes_34C,total_sizes_34D,total_sizes_34DD,total_sizes_34DDD,total_sizes_36A,total_sizes_36AA,total_sizes_36B,total_sizes_36C,total_sizes_36D,total_sizes_36DD,total_sizes_36DDD,total_sizes_38A,total_sizes_38AA,total_sizes_38B,total_sizes_38C,total_sizes_38D,total_sizes_38DD,total_sizes_38DDD,total_sizes_40A,total_sizes_40B,total_sizes_40C,total_sizes_40D,total_sizes_40DD,total_sizes_40DDD,total_sizes_A,total_sizes_AA,total_sizes_B,total_sizes_C,total_sizes_D,total_sizes_DD,total_sizes_L,total_sizes_M,total_sizes_S,total_sizes_XL,total_sizes_XS,available_size_30A,available_size_30B,available_size_30C,available_size_32A,available_size_32AA,available_size_32B,available_size_32C,available_size_32D,available_size_32D3,available_size_32DD,available_size_34A,available_size_34AA,available_size_34B,available_size_34C,available_size_34D,available_size_34D3,available_size_34DD,available_size_36A,available_size_36AA,available_size_36B,available_size_36C,available_size_36D,available_size_36D3,available_size_36DD,available_size_38B,available_size_38C,available_size_38D,available_size_38D3,available_size_38DD,available_size_40C,available_size_40D,available_size_40D3,available_size_40DD,available_size_AA/A,available_size_B/C,available_size_D/DD,available_size_L,available_size_M,available_size_M/L,available_size_OS,available_size_S,available_size_XL,available_size_XS,available_size_XS/S
0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [52]:
df_recoded = df_left.join(df_right)

In [53]:
df_recoded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 885689 entries, 0 to 45338
Data columns (total 109 columns):
 #    Column                             Non-Null Count   Dtype  
---   ------                             --------------   -----  
 0    brand_name_Victoria's Secret       885689 non-null  uint8  
 1    brand_name_Victoria's Secret Pink  885689 non-null  uint8  
 2    retailer_Victoriassecret US        885689 non-null  uint8  
 3    total_sizes_30A                    885689 non-null  uint8  
 4    total_sizes_30AA                   885689 non-null  uint8  
 5    total_sizes_30B                    885689 non-null  uint8  
 6    total_sizes_30C                    885689 non-null  uint8  
 7    total_sizes_30D                    885689 non-null  uint8  
 8    total_sizes_30DD                   885689 non-null  uint8  
 9    total_sizes_30DDD                  885689 non-null  uint8  
 10   total_sizes_32A                    885689 non-null  uint8  
 11   total_sizes_32AA         

[Table of Contents](#contents)

<a id="missing_data"></a>
## Handling Missing Data 

In [54]:
import pandas as pd
import os
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_info_columns', 300)

import warnings
warnings.filterwarnings('ignore')

In [55]:
DATA_PATH = r'D:\Dropbox\EM Lyon\7MPMLS_Introduction_Machine_Learning\data'
os.chdir(DATA_PATH)

[Table of Contents](#contents)

<a id="import_datasets"></a>
### Importing datasets

In [17]:
df_miss = pd.read_csv('Credit.dat', sep='\t')
df_miss.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Income     314 non-null    float64
 1   Limit      321 non-null    float64
 2   Rating     310 non-null    float64
 3   Cards      314 non-null    float64
 4   Age        316 non-null    float64
 5   Education  314 non-null    float64
 6   Own        317 non-null    object 
 7   Student    330 non-null    object 
 8   Married    325 non-null    object 
 9   Region     314 non-null    object 
 10  Balance    355 non-null    float64
dtypes: float64(7), object(4)
memory usage: 34.5+ KB


In [18]:
df = pd.read_csv('Credit.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Income     400 non-null    float64
 1   Limit      400 non-null    int64  
 2   Rating     400 non-null    int64  
 3   Cards      400 non-null    int64  
 4   Age        400 non-null    int64  
 5   Education  400 non-null    int64  
 6   Own        400 non-null    object 
 7   Student    400 non-null    object 
 8   Married    400 non-null    object 
 9   Region     400 non-null    object 
 10  Balance    400 non-null    int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 34.5+ KB


In [19]:
df.nunique()

Income       399
Limit        387
Rating       283
Cards          9
Age           68
Education     16
Own            2
Student        2
Married        2
Region         3
Balance      284
dtype: int64

[Table of Contents](#contents)

<a id="prepare_data"></a>
### Preparing datasets


In [7]:
var_cat = [var for var in df.columns if df[var].dtypes == 'object']
var_cat

['Own', 'Student', 'Married', 'Region']

In [58]:
df_dummy = pd.get_dummies(df, dummy_na=True)
df_dummy.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Own_nan,Student_No,Student_Yes,Student_nan,Married_No,Married_Yes,Married_nan,Region_East,Region_South,Region_West,Region_nan
0,14.891,3606,283,2,34,11,333,1,0,0,1,0,0,0,1,0,0,1,0,0
1,106.025,6645,483,3,82,15,903,0,1,0,0,1,0,0,1,0,0,0,1,0
2,104.593,7075,514,4,71,11,580,1,0,0,1,0,0,1,0,0,0,0,1,0
3,148.924,9504,681,3,36,11,964,0,1,0,1,0,0,1,0,0,0,0,1,0
4,55.882,4897,357,2,68,16,331,1,0,0,1,0,0,0,1,0,0,1,0,0


In [9]:
# Replace each missing value in a categorical variable with NaN in the corresponding dummy coded variables

for col in var_cat :
    df_dummy.loc[df[col].isnull(), df_dummy.columns.str.startswith(col+"_")] = np.nan
    df_dummy.drop(columns = [col+"_nan"], inplace=True)
    
df_dummy.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,3606,283,2,34,11,333,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645,483,3,82,15,903,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075,514,4,71,11,580,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504,681,3,36,11,964,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897,357,2,68,16,331,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


[Table of Contents](#contents)

In [60]:
df_miss_dummy = pd.get_dummies(df_miss, dummy_na=True)
df_miss_dummy.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Own_nan,Student_No,Student_Yes,Student_nan,Married_No,Married_Yes,Married_nan,Region_East,Region_South,Region_West,Region_nan
0,14.891,,283.0,2.0,34.0,11.0,333.0,0,0,1,1,0,0,0,1,0,0,1,0,0
1,106.025,6645.0,483.0,3.0,,15.0,903.0,0,1,0,0,1,0,0,1,0,0,0,1,0
2,104.593,7075.0,,4.0,71.0,11.0,580.0,0,0,1,1,0,0,1,0,0,0,0,1,0
3,148.924,9504.0,681.0,3.0,36.0,11.0,964.0,0,0,1,1,0,0,1,0,0,0,0,1,0
4,55.882,4897.0,357.0,2.0,,16.0,331.0,1,0,0,1,0,0,0,1,0,0,1,0,0


In [11]:
# Replace each missing value in a categorical variable with NaN in the corresponding dummy coded variables

for col in var_cat :
    df_miss_dummy.loc[df[col].isnull(), df_miss_dummy.columns.str.startswith(col+"_")] = np.nan
    df_miss_dummy.drop(columns = [col+"_nan"], inplace=True)
    
df_miss_dummy.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,,283.0,2.0,34.0,11.0,333.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645.0,483.0,3.0,,15.0,903.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075.0,,4.0,71.0,11.0,580.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504.0,681.0,3.0,36.0,11.0,964.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897.0,357.0,2.0,,16.0,331.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


[Table of Contents](#contents)

In [12]:
def compare_df(df, df_impute):
    from sklearn.metrics import mean_squared_error as mse
    reg_error = []
    for col in df.columns:
        reg_error.append(mse(df[col],df_impute[col]))
    return reg_error

In [None]:
# def compare_df(df, df_impute):
#     from sklearn.metrics import mean_squared_error as mse
#     from sklearn.metrics import accuracy_score
#     reg_error = []
#     cl_error = []
#     for col in df.columns:
#         if df[col].dtypes != 'object' :
#             reg_error.append(mse(df[col],df_impute[col]))
#         else :
#             cl_error.append(1-accuracy_score(df[col], df_impute[col]))
#     return reg_error, cl_error

[Table of Contents](#contents)

<a id="median"></a>
### Imputation with the median

In [13]:
df_median = df_miss_dummy.apply(lambda col: col.fillna(col.median()), axis=0)
df_median.info()
display(df_median.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Income        400 non-null    float64
 1   Limit         400 non-null    float64
 2   Rating        400 non-null    float64
 3   Cards         400 non-null    float64
 4   Age           400 non-null    float64
 5   Education     400 non-null    float64
 6   Balance       400 non-null    float64
 7   Own_No        400 non-null    float64
 8   Own_Yes       400 non-null    float64
 9   Student_No    400 non-null    float64
 10  Student_Yes   400 non-null    float64
 11  Married_No    400 non-null    float64
 12  Married_Yes   400 non-null    float64
 13  Region_East   400 non-null    float64
 14  Region_South  400 non-null    float64
 15  Region_West   400 non-null    float64
dtypes: float64(16)
memory usage: 50.1 KB


Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,4612.0,283.0,2.0,34.0,11.0,333.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645.0,483.0,3.0,56.0,15.0,903.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075.0,339.5,4.0,71.0,11.0,580.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504.0,681.0,3.0,36.0,11.0,964.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897.0,357.0,2.0,56.0,16.0,331.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [16]:
mse_median = compare_df(df_dummy, df_median)
mse_median

[261.56832420874997,
 740372.535,
 4933.76625,
 0.5225,
 64.5425,
 2.19,
 17411.8275,
 0.1,
 0.1075,
 0.165,
 0.01,
 0.0725,
 0.115,
 0.045,
 0.11,
 0.06]

In [20]:
np.mean(mse_median)

47690.48356713805

[Table of Contents](#contents)

<a id="mean"></a>
### Imputation with the mean

In [17]:
df_mean = df_miss_dummy.apply(lambda col: col.fillna(col.mean()), axis=0)
df_mean.info()
display(df_mean.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Income        400 non-null    float64
 1   Limit         400 non-null    float64
 2   Rating        400 non-null    float64
 3   Cards         400 non-null    float64
 4   Age           400 non-null    float64
 5   Education     400 non-null    float64
 6   Balance       400 non-null    float64
 7   Own_No        400 non-null    float64
 8   Own_Yes       400 non-null    float64
 9   Student_No    400 non-null    float64
 10  Student_Yes   400 non-null    float64
 11  Married_No    400 non-null    float64
 12  Married_Yes   400 non-null    float64
 13  Region_East   400 non-null    float64
 14  Region_South  400 non-null    float64
 15  Region_West   400 non-null    float64
dtypes: float64(16)
memory usage: 50.1 KB


Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,4778.953271,283.0,2.0,34.0,11.0,333.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645.0,483.0,3.0,56.044304,15.0,903.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075.0,350.490323,4.0,71.0,11.0,580.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504.0,681.0,3.0,36.0,11.0,964.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897.0,357.0,2.0,56.044304,16.0,331.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [18]:
mse_mean = compare_df(df_dummy, df_mean)
mse_mean

[235.4431424759965,
 749343.4803611669,
 4808.782351716962,
 0.5362177978822671,
 64.57547548469796,
 2.2135303663434622,
 17094.94492144416,
 0.1,
 0.1075,
 0.165,
 0.01,
 0.0725,
 0.115,
 0.045,
 0.11,
 0.06]

In [21]:
np.mean(mse_mean)

48221.92256252831

[Table of Contents](#contents)

<a id="linear"></a>
### Impution with linear interpolation

In [33]:
df_linear = df_miss_dummy.interpolate(axis=0)
df_linear.info()
display(df_linear.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Income        400 non-null    float64
 1   Limit         399 non-null    float64
 2   Rating        400 non-null    float64
 3   Cards         400 non-null    float64
 4   Age           400 non-null    float64
 5   Education     400 non-null    float64
 6   Balance       400 non-null    float64
 7   Own_No        400 non-null    float64
 8   Own_Yes       400 non-null    float64
 9   Student_No    400 non-null    float64
 10  Student_Yes   400 non-null    float64
 11  Married_No    400 non-null    float64
 12  Married_Yes   400 non-null    float64
 13  Region_East   400 non-null    float64
 14  Region_South  400 non-null    float64
 15  Region_West   400 non-null    float64
dtypes: float64(16)
memory usage: 50.1 KB


Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,,283.0,2.0,34.0,11.0,333.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645.0,483.0,3.0,52.5,15.0,903.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075.0,582.0,4.0,71.0,11.0,580.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504.0,681.0,3.0,36.0,11.0,964.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897.0,357.0,2.0,56.5,16.0,331.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [34]:
mse_linear = compare_df(df_dummy[1:], df_linear[1:])
mse_linear

[379.1613821655876,
 1230354.040685046,
 7422.979664438875,
 0.6432400445558341,
 75.96494708994709,
 3.3259885825675295,
 30470.02074631022,
 0.09774436090225563,
 0.10776942355889724,
 0.16541353383458646,
 0.010025062656641603,
 0.07268170426065163,
 0.11528822055137844,
 0.045112781954887216,
 0.11027568922305764,
 0.06015037593984962]

In [35]:
np.mean(mse_linear)

79294.18256967691

[Table of Contents](#contents)

<a id="simple"></a>
### Simple imputation

Using the mean as constant

In [40]:
from sklearn.impute import SimpleImputer
df_simple = pd.DataFrame(SimpleImputer().fit_transform(df_miss_dummy), columns = df_miss_dummy.columns)

In [41]:
df_simple.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,4778.953271,283.0,2.0,34.0,11.0,333.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645.0,483.0,3.0,56.044304,15.0,903.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075.0,350.490323,4.0,71.0,11.0,580.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504.0,681.0,3.0,36.0,11.0,964.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897.0,357.0,2.0,56.044304,16.0,331.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [42]:
mse_simple = compare_df(df_dummy, df_simple)
mse_simple

[235.4431424759965,
 749343.4803611669,
 4808.782351716962,
 0.5362177978822671,
 64.57547548469796,
 2.2135303663434622,
 17094.94492144416,
 0.1,
 0.1075,
 0.165,
 0.01,
 0.0725,
 0.115,
 0.045,
 0.11,
 0.06]

In [43]:
np.mean(mse_simple)

48221.92256252831

[Table of Contents](#contents)

Using the mode as constant

In [44]:
from sklearn.impute import SimpleImputer
df_simple = pd.DataFrame(SimpleImputer(strategy='most_frequent').fit_transform(df_miss_dummy), columns = df_miss_dummy.columns)

In [45]:
df_simple.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,855.0,283.0,2.0,34.0,11.0,333.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645.0,483.0,3.0,44.0,15.0,903.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075.0,344.0,4.0,71.0,11.0,580.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504.0,681.0,3.0,36.0,11.0,964.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897.0,357.0,2.0,44.0,16.0,331.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [46]:
mse_simple = compare_df(df_dummy, df_simple)
mse_simple

[322.3631665175,
 3450099.3925,
 4876.02,
 0.8325,
 85.9625,
 3.25,
 46085.595,
 0.1,
 0.1075,
 0.165,
 0.01,
 0.0725,
 0.115,
 0.045,
 0.11,
 0.06]

In [47]:
np.mean(mse_simple)

218842.13754165734

[Table of Contents](#contents)

<a id="multiple"></a>
### Multiple imputation

In [36]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df_iterate = pd.DataFrame(IterativeImputer().fit_transform(df_miss_dummy), columns = df_miss_dummy.columns)

In [37]:
df_iterate.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,3679.822199,283.0,2.0,34.0,11.0,333.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645.0,483.0,3.0,56.980844,15.0,903.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075.0,515.208219,4.0,71.0,11.0,580.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504.0,681.0,3.0,36.0,11.0,964.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897.0,357.0,2.0,57.949089,16.0,331.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [38]:
mse_iterate = compare_df(df_dummy, df_iterate)
mse_iterate

[126.0077432870651,
 397440.9521898587,
 1774.5124609612851,
 0.5992764963562621,
 65.31326125288993,
 2.2393742121359685,
 17149.772538558464,
 0.1,
 0.1075,
 0.165,
 0.01,
 0.0725,
 0.115,
 0.045,
 0.11,
 0.06]

In [39]:
np.mean(mse_iterate)

26035.01136528918

[Table of Contents](#contents)

<a id="neighbors"></a>
### K-Nearest Neighbors

With the default 5 neighbors

In [48]:
from sklearn.impute import KNNImputer
df_knn = pd.DataFrame(KNNImputer().fit_transform(df_miss_dummy), columns = df_miss_dummy.columns)

In [49]:
df_knn.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,3677.0,283.0,2.0,34.0,11.0,333.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645.0,483.0,3.0,65.2,15.0,903.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075.0,373.0,4.0,71.0,11.0,580.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504.0,681.0,3.0,36.0,11.0,964.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897.0,357.0,2.0,49.0,16.0,331.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [50]:
mse_knn = compare_df(df_dummy, df_knn)
mse_knn

[165.21741681530003,
 419521.8051,
 2657.1564000000003,
 0.6762999999999999,
 71.9763,
 2.4398,
 18098.069000000003,
 0.1,
 0.1075,
 0.165,
 0.01,
 0.0725,
 0.115,
 0.045,
 0.11,
 0.06]

In [51]:
np.mean(mse_iterate)

26035.01136528918

[Table of Contents](#contents)

## Conclusion

**On average, the multiple (iterative) and the KNN imputation methods are clearly the best**