<a id="contents"></a>
# Session 2 - The Machine Learning Workflow



### [Preparing a "rich" dataset](#rich)
- [Importing with pandas](#import)
- [Prices variable](#prices)
- [Pandas exercise](#pandas_exercise)

### [Recoding Categorical Data](#recoding)
- [Recoding categorical variables with OneHotEncoder](#ohe)
- [Recoding catgorical variables with pandas's get_dummies](#dummy)
- [Importing `mini_victoria.txt`data](#mini_victoria)

### [Handling Missing Data](#missing_data)
- [Importing datasets](#import_datasets)
- [Preparing datasets](#prepare_data)
- [Imputation with the median](#median)
- [Imputation with the mean](#mean)
- [Imputation with linear interpolation](#linear)
- [Simple imputation](#simple)
- [Multiple imputation](#multiple)
- [K Nearest Neighbors](#neighbors)

<a id="import"></a>
### Importing with pandas

- Save the `mini_victoria.txt` file
- Check the data in a text editor such as Notepad++ or Visual Studio Code
- Import it using pandas
- Print a comprehensive summary

In [28]:
import pandas as pd
import os
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [29]:
ISO-8859-1
cpl1252

NameError: name 'ISO' is not defined

In [None]:
df = pd.read_csv('./data/mini_victoria.txt',sep='*', header=0 ,encoding='ISO-8859-1')

In [None]:
df.head()

Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,retailer,description,rating,review_count,style_attributes,total_sizes,available_size,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,$36.50,$36.50,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Victoriassecret US,Game-changer: your favorite maximum-support sp...,3.6,25.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D3,White
1,Body by Victoria Demi Bra,$54.50,$19.99,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Demi Bra,Victoriassecret US,Sexy comfort and a sleek shape start with low-...,,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",38C,cadette green
2,Easy Plunge Bra,$29.50,$29.50,https://www.victoriassecret.com/bras/bralette/...,Victoria's Secret,Easy Plunge Bra,Victoriassecret US,This supersoft bra is easy to love with fully ...,4.4,260.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",34DD,Black
3,The T-Shirt Perfect Shape Bra,$39.50,$39.50,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Perfect Shape Bra,Victoriassecret US,The everyday go-to bra pairs sexy lift and the...,,,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D,Coconut White Matte Print
4,PINK NEW! Wear Everywhere Super Push,$32.95,$32.95,https://www.victoriassecret.com/pink/panties/w...,Victoria's Secret Pink,Wear Everywhere Super Push,Victoriassecret US,"A super flirty new style, with more push than ...",,,,"[""30AA"", ""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""...",32D,bayberry


[Table of Contents](#contents)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45339 non-null  object 
 1   mrp               45339 non-null  object 
 2   price             45339 non-null  object 
 3   pdp_url           45339 non-null  object 
 4   brand_name        45339 non-null  object 
 5   product_category  45339 non-null  object 
 6   retailer          45339 non-null  object 
 7   description       45339 non-null  object 
 8   rating            13662 non-null  float64
 9   review_count      13662 non-null  float64
 10  style_attributes  0 non-null      float64
 11  total_sizes       45339 non-null  object 
 12  available_size    45339 non-null  object 
 13  color             45339 non-null  object 
dtypes: float64(3), object(11)
memory usage: 4.8+ MB


In [None]:
df.describe()

Unnamed: 0,rating,review_count,style_attributes
count,13662.0,13662.0,0.0
mean,4.173371,9.311044999999999e+35,
std,0.484358,1.17109e+37,
min,0.0,2.0,
25%,4.0,39.0,
50%,4.3,149.0,
75%,4.5,410.0,
max,5.0,1.5600000000000001e+38,


<a id="prices"></a>
### The price variables (mrp and price) are not recognized as quantitative
- Make the necessary pre-processing to read them as such
- Create a function that removes the $ symbol for the USD currencies and replaces all others by missing values
- Apply it on each of the price columns
- Check again


In [None]:
## your code here ##
victoria = df.copy()

In [None]:
def remove_dollar(row):
    x = row['price']
    if '$' in str(x):
        return x.replace('$','')
    else:
        return np.nan
victoria['price'] = victoria.apply(remove_dollar, axis = 1)
victoria.dropna(subset=['price'], inplace=True)


def remove_dollar(row):
    x = row['mrp']
    if '$' in str(x):
        return x.replace('$','')
    else:
        return np.nan
victoria['mrp'] = victoria.apply(remove_dollar, axis = 1)
victoria.dropna(subset=['mrp'], inplace=True)

Check how many missing values we have

In [None]:
## your code here ##

Now, replace the two non-numerical price columns by numerical price columns (quantitative data)

In [None]:
## your code here ##
victoria['mrp'] = victoria['mrp'].astype(float)

In [None]:
## your code here ##
victoria['price'] = victoria['price'].astype(float)

Count the number of unique modalities in each variable of the dataframe

In [None]:
## your code here ##
victoria.nunique()

product_name         599
mrp                   72
price                 89
pdp_url             1410
brand_name             2
product_category     445
retailer               1
description          536
rating                31
review_count         333
style_attributes       0
total_sizes           30
available_size        44
color               1300
dtype: int64

Check the modalities of the `brand_name` variable

In [None]:
## your code here ##
victoria['brand_name'].value_counts()

Victoria's Secret         34208
Victoria's Secret Pink    11092
Name: brand_name, dtype: int64

Were we to continue the analysis of this dataset we would certainly remove the following columns
- retailer : it has no variability so it is useless
- style attibutes does not have any values (all data missing)

In [None]:
victoria.drop(columns=['retailer', 'style_attributes'], axis=0, inplace=True)

In [None]:
victoria.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45300 entries, 0 to 45338
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45300 non-null  object 
 1   mrp               45300 non-null  float64
 2   price             45300 non-null  float64
 3   pdp_url           45300 non-null  object 
 4   brand_name        45300 non-null  object 
 5   product_category  45300 non-null  object 
 6   description       45300 non-null  object 
 7   rating            13641 non-null  float64
 8   review_count      13641 non-null  float64
 9   total_sizes       45300 non-null  object 
 10  available_size    45300 non-null  object 
 11  color             45300 non-null  object 
dtypes: float64(4), object(8)
memory usage: 4.5+ MB


[Table of Contents](#contents)

<a id="pandas_exercise"></a>
### Pandas Exercise

1. Write the lines of code to provide the name of the cheapest product 
2. Write the lines of code to count the number of products with available size equal to '38A’ 
3. Write the lines of code to list and count the type and color of the most expensive products containing 'sport bra'

In [None]:
# Write the lines of code to provide the name of the cheapest product 
## your code here ##
display(victoria.loc[victoria['price'] == victoria['price'].min(), 'product_name'].unique())
display(victoria['price'].min())

array(['Cotton Lingerie Lace-waist Brief Panty',
       'Cotton Lingerie Mesh Thong Panty', 'Seamless Cheekini Panty',
       'Cotton Lingerie String Bikini Panty'], dtype=object)

2.99

In [None]:
victoria.head(5)

Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,description,rating,review_count,total_sizes,available_size,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Game-changer: your favorite maximum-support sp...,3.6,25.0,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D3,White
1,Body by Victoria Demi Bra,54.5,19.99,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Demi Bra,Sexy comfort and a sleek shape start with low-...,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",38C,cadette green
2,Easy Plunge Bra,29.5,29.5,https://www.victoriassecret.com/bras/bralette/...,Victoria's Secret,Easy Plunge Bra,This supersoft bra is easy to love with fully ...,4.4,260.0,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",34DD,Black
3,The T-Shirt Perfect Shape Bra,39.5,39.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Perfect Shape Bra,The everyday go-to bra pairs sexy lift and the...,,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D,Coconut White Matte Print
4,PINK NEW! Wear Everywhere Super Push,32.95,32.95,https://www.victoriassecret.com/pink/panties/w...,Victoria's Secret Pink,Wear Everywhere Super Push,"A super flirty new style, with more push than ...",,,"[""30AA"", ""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""...",32D,bayberry


In [None]:
# Write the lines of code to count the number of products with available size equal to '38A’ 
## your code here ##

In [None]:
# Write the lines of code to count the number of products with available size equal to '38A’ 
## your code here ##
display(victoria.loc[victoria['available_size'] == '38B', 'available_size'].count())

In [None]:
# Write the lines of code to list and count the type and color of the most expensive products containing 'sport bra' 
## your code here ##
display(victoria.loc[victoria['product_category'].str.lower().str.contains('sport bra') & victoria['price'] == victoria['price'].max(), ['product_category','color']].value_counts())


Series([], dtype: int64)

[Table of Contents](#contents)

<a id="recoding"></a>
## Recoding Categorical Data

### Import the `Credit.csv` dataset
- Recode all the categorical variables using sklearn onehotencoder and pandas get_dummies
- Compare your results

In [None]:
import pandas as pd
import os
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_info_columns', 300)

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('./data/Credit.csv', sep=',', header=0)

In [None]:
credit = df.copy()

In [None]:
credit.head(10)

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Own,Student,Married,Region,Balance
0,14.891,3606,283,2,34,11,No,No,Yes,South,333
1,106.025,6645,483,3,82,15,Yes,Yes,Yes,West,903
2,104.593,7075,514,4,71,11,No,No,No,West,580
3,148.924,9504,681,3,36,11,Yes,No,No,West,964
4,55.882,4897,357,2,68,16,No,No,Yes,South,331
5,80.18,8047,569,4,77,10,No,No,No,South,1151
6,20.996,3388,259,2,37,12,Yes,No,No,East,203
7,71.408,7114,512,2,87,9,No,No,No,West,872
8,15.125,3300,266,5,66,13,Yes,No,No,South,279
9,71.061,6819,491,3,41,19,Yes,Yes,Yes,East,1350


### sklearn OneHotEncoder

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(credit[['Own', 'Student', 'Married', 'Region']])
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Own', 'Student', 'Married', 'Region']))

result = pd.concat([credit, encoded_df], axis=1)

In [None]:
result.head(10)


Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Own,Student,Married,Region,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,3606,283,2,34,11,No,No,Yes,South,333,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645,483,3,82,15,Yes,Yes,Yes,West,903,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075,514,4,71,11,No,No,No,West,580,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504,681,3,36,11,Yes,No,No,West,964,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897,357,2,68,16,No,No,Yes,South,331,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
5,80.18,8047,569,4,77,10,No,No,No,South,1151,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
6,20.996,3388,259,2,37,12,Yes,No,No,East,203,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
7,71.408,7114,512,2,87,9,No,No,No,West,872,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
8,15.125,3300,266,5,66,13,Yes,No,No,South,279,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
9,71.061,6819,491,3,41,19,Yes,Yes,Yes,East,1350,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0


In [None]:
encoded_data2 = pd.get_dummies(credit, columns=['Own', 'Student', 'Married', 'Region'])

In [None]:
encoded_data2.head(10)

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,3606,283,2,34,11,333,1,0,1,0,0,1,0,1,0
1,106.025,6645,483,3,82,15,903,0,1,0,1,0,1,0,0,1
2,104.593,7075,514,4,71,11,580,1,0,1,0,1,0,0,0,1
3,148.924,9504,681,3,36,11,964,0,1,1,0,1,0,0,0,1
4,55.882,4897,357,2,68,16,331,1,0,1,0,0,1,0,1,0
5,80.18,8047,569,4,77,10,1151,1,0,1,0,1,0,0,1,0
6,20.996,3388,259,2,37,12,203,0,1,1,0,1,0,1,0,0
7,71.408,7114,512,2,87,9,872,1,0,1,0,1,0,0,0,1
8,15.125,3300,266,5,66,13,279,0,1,1,0,1,0,0,1,0
9,71.061,6819,491,3,41,19,1350,0,1,0,1,0,1,1,0,0


[Table of Contents](#contents)

<a id="ohe"></a>
### Recode all the categorical variables using sklearn onehotencoder

In [None]:
from sklearn.preprocessing import OneHotEncoder


df_cat = credit[['Own', 'Student', 'Married', 'Region']]
df_num = credit[['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Balance']]


encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(df_cat)
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(df_cat.columns))

encoder_rlt = pd.concat([credit, encoded_df], axis=1)

In [None]:
encoder_rlt.head(10)

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Own,Student,Married,Region,Balance,Own_No,Own_Yes,Student_No,Student_Yes,Married_No,Married_Yes,Region_East,Region_South,Region_West
0,14.891,3606,283,2,34,11,No,No,Yes,South,333,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,106.025,6645,483,3,82,15,Yes,Yes,Yes,West,903,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,104.593,7075,514,4,71,11,No,No,No,West,580,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,148.924,9504,681,3,36,11,Yes,No,No,West,964,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,55.882,4897,357,2,68,16,No,No,Yes,South,331,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
5,80.18,8047,569,4,77,10,No,No,No,South,1151,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
6,20.996,3388,259,2,37,12,Yes,No,No,East,203,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
7,71.408,7114,512,2,87,9,No,No,No,West,872,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
8,15.125,3300,266,5,66,13,Yes,No,No,South,279,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
9,71.061,6819,491,3,41,19,Yes,Yes,Yes,East,1350,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0


<a id="dummy"></a>
### Recode all the categorical variables using pandas get_dummies

In [None]:
encoded_data2 = pd.get_dummies(credit, columns=df_cat.columns, dummy_na=True)
encoded_data2.head(10)

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Own_No,Own_Yes,Own_nan,Student_No,Student_Yes,Student_nan,Married_No,Married_Yes,Married_nan,Region_East,Region_South,Region_West,Region_nan
0,14.891,3606,283,2,34,11,333,1,0,0,1,0,0,0,1,0,0,1,0,0
1,106.025,6645,483,3,82,15,903,0,1,0,0,1,0,0,1,0,0,0,1,0
2,104.593,7075,514,4,71,11,580,1,0,0,1,0,0,1,0,0,0,0,1,0
3,148.924,9504,681,3,36,11,964,0,1,0,1,0,0,1,0,0,0,0,1,0
4,55.882,4897,357,2,68,16,331,1,0,0,1,0,0,0,1,0,0,1,0,0
5,80.18,8047,569,4,77,10,1151,1,0,0,1,0,0,1,0,0,0,1,0,0
6,20.996,3388,259,2,37,12,203,0,1,0,1,0,0,1,0,0,1,0,0,0
7,71.408,7114,512,2,87,9,872,1,0,0,1,0,0,1,0,0,0,0,1,0
8,15.125,3300,266,5,66,13,279,0,1,0,1,0,0,1,0,0,0,1,0,0
9,71.061,6819,491,3,41,19,1350,0,1,0,0,1,0,0,1,0,1,0,0,0


Check equivalence of the two dataframes

In [None]:
equivalent = encoder_rlt.equals(encoded_data2)

if equivalent:
    print("The DataFrames are equivalent.")
else:
    print("The DataFrames are not equivalent.")

The DataFrames are not equivalent.


In [None]:
df_cat = credit[['Own', 'Student', 'Married', 'Region']]
categorical_cols = df_cat.columns

# Create a new DataFrame for one-hot encoding
df_onehot = credit.copy()

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Iterate through each categorical column and encode it
for col in categorical_cols:
    encoded_cols = encoder.fit_transform(credit[[col]])
    # Create column names for the encoded features
    col_names = [f'{col}_{cat}' for cat in encoder.categories_[0]]
    # Add the encoded features to the DataFrame
    encoded_df = pd.DataFrame(encoded_cols, columns=col_names)
    df_onehot = pd.concat([df_onehot, encoded_df], axis=1)
    # Drop the original column
    df_onehot.drop(col, axis=1, inplace=True)

# Step 3: Recode using pandas get_dummies
df_dummies = pd.get_dummies(credit, columns=categorical_cols)

# Step 4: Check equivalence
equivalent = df_onehot.equals(df_dummies)

if equivalent:
    print("The two DataFrames are equivalent.")
else:
    print("The two DataFrames are not equivalent.")

The two DataFrames are not equivalent.


[Table of Contents](#contents)

<a id="mini_victoria"></a>
### Import the `mini_victoria.txt` dataset
- Which categorical variables should be onehot encoded ?
- Which categorical variables should be label encoded ?


In [None]:
df = pd.read_csv('./data/mini_victoria.txt',sep='*', header=0 ,encoding='ISO-8859-1')

In [None]:
victoria = df.copy()

[Table of Contents](#contents)

Now, replace the two non-numerical price columns by numerical price columns (quantitative data)

In [None]:
def remove_dollar(row):
    x = row['price']
    if '$' in str(x):
        return x.replace('$','')
    else:
        return np.nan
victoria['price'] = victoria.apply(remove_dollar, axis = 1)
victoria.dropna(subset=['price'], inplace=True)


def remove_dollar(row):
    x = row['mrp']
    if '$' in str(x):
        return x.replace('$','')
    else:
        return np.nan
victoria['mrp'] = victoria.apply(remove_dollar, axis = 1)
victoria.dropna(subset=['mrp'], inplace=True)

In [None]:
victoria['mrp'] = victoria['mrp'].astype(float)
victoria['price'] = victoria['price'].astype(float)

[Table of Contents](#contents)

### Count the number of modalities for each categorical variable

In [None]:
victoria.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45300 entries, 0 to 45338
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45300 non-null  object 
 1   mrp               45300 non-null  float64
 2   price             45300 non-null  float64
 3   pdp_url           45300 non-null  object 
 4   brand_name        45300 non-null  object 
 5   product_category  45300 non-null  object 
 6   retailer          45300 non-null  object 
 7   description       45300 non-null  object 
 8   rating            13641 non-null  float64
 9   review_count      13641 non-null  float64
 10  style_attributes  0 non-null      float64
 11  total_sizes       45300 non-null  object 
 12  available_size    45300 non-null  object 
 13  color             45300 non-null  object 
dtypes: float64(5), object(9)
memory usage: 5.2+ MB


In [None]:
victoria.head(5)

Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,retailer,description,rating,review_count,style_attributes,total_sizes,available_size,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Victoriassecret US,Game-changer: your favorite maximum-support sp...,3.6,25.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D3,White
1,Body by Victoria Demi Bra,54.5,19.99,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Demi Bra,Victoriassecret US,Sexy comfort and a sleek shape start with low-...,,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",38C,cadette green
2,Easy Plunge Bra,29.5,29.5,https://www.victoriassecret.com/bras/bralette/...,Victoria's Secret,Easy Plunge Bra,Victoriassecret US,This supersoft bra is easy to love with fully ...,4.4,260.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",34DD,Black
3,The T-Shirt Perfect Shape Bra,39.5,39.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Perfect Shape Bra,Victoriassecret US,The everyday go-to bra pairs sexy lift and the...,,,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D,Coconut White Matte Print
4,PINK NEW! Wear Everywhere Super Push,32.95,32.95,https://www.victoriassecret.com/pink/panties/w...,Victoria's Secret Pink,Wear Everywhere Super Push,Victoriassecret US,"A super flirty new style, with more push than ...",,,,"[""30AA"", ""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""...",32D,bayberry


In [None]:
victoria.drop(columns=['retailer', 'style_attributes'], axis=0, inplace=True)

In [None]:
cat_col = ['brand_name', 'product_category', 'color']
for col in cat_col:
    display(victoria[col].value_counts())

Victoria's Secret         34208
Victoria's Secret Pink    11092
Name: brand_name, dtype: int64

Demi Bra                                              3969
Push-Up Bra                                           3618
Perfect Coverage Bra                                  2548
Incredible by Victoria Sport Bra                      1953
Wear Everywhere Push-Up Bra                           1776
Perfect Shape Bra                                     1548
Knockout by Victoria Sport Front-Close Sport Bra      1413
Wear Everywhere T-Shirt Bra                            987
Incredible by Victoria Sport Front-close Sport Bra     981
Lightweight by Victoria Sport Bra                      721
Multi-Way Bra                                          707
Wear Everywhere Lightly Lined Bra                      688
Add-2-Cups Push-Up Bra                                 611
Wear Everywhere T-Back Push-Up Bra                     595
Easy Push-Up Bra                                       590
Unlined Demi Bra                                       584
Strappy Push-Up Bra                                    4

Black                                                  1789
black                                                  1330
White                                                  1096
pure black                                              688
Ensign                                                  679
Almost Nude                                             655
bayberry                                                543
Hello Lovely                                            496
Radiating Aztec                                         484
Sheer Pink                                              478
Black Marl                                              477
buff                                                    466
Burnished Lilac                                         440
triumph white                                           408
Black Lace                                              373
coconut white                                           340
white                                   

[Table of Contents](#contents)

Any categorical variable with more than 20 modalities should be label-encoded <br>
Why 20 modalities, not more nor less ? Well it depends on the number of remaining features - the more features, the less onehot encoding...

In [None]:
## your code here ##

[Table of Contents](#contents)

In [None]:
## your code here ##

[Table of Contents](#contents)

Any categorical variable with less than 20 modalities should be one hot encoded <br>
Why 20 modalities, not more nor less ? Well it depends on the number of remaining features - the more features, the less onehot encoding...

Create a function that cleans and formats the `total_size` column

In [None]:
def clean(row):
    import regex as re
    row = re.sub('[^A-Z0-9]'," ",row)
    row = re.split('\s+',row)
    return [item for item in row if item !='']

In [None]:
victoria['total_sizes'] = victoria['total_sizes'].apply(clean)

In [None]:
victoria['total_sizes']

0        [32A, 32B, 32C, 32D, 32DD, 32DDD, 34A, 34B, 34...
1        [30A, 30B, 30C, 30D, 30DD, 30DDD, 32A, 32B, 32...
2        [32A, 32B, 32C, 32D, 32DD, 34A, 34B, 34C, 34D,...
3        [32A, 32B, 32C, 32D, 32DD, 32DDD, 34A, 34B, 34...
4        [30AA, 30A, 30B, 30C, 30D, 30DD, 32AA, 32A, 32...
5                                        [XS, S, M, L, XL]
6        [30AA, 30A, 30B, 30C, 30D, 30DD, 32AA, 32A, 32...
7                                        [XS, S, M, L, XL]
8        [30A, 30B, 30C, 30D, 30DD, 30DDD, 32A, 32B, 32...
9        [32A, 32B, 32C, 32D, 32DD, 34A, 34B, 34C, 34D,...
10                                       [XS, S, M, L, XL]
11       [32A, 32B, 32C, 32D, 32DD, 32DDD, 34A, 34B, 34...
12       [32AA, 32A, 32B, 32C, 32D, 32DD, 34AA, 34A, 34...
13       [32B, 32C, 32D, 32DD, 34B, 34C, 34D, 34DD, 36B...
14       [32B, 32C, 32D, 32DD, 32DDD, 34B, 34C, 34D, 34...
15                                       [XS, S, M, L, XL]
16                                       [XS, S, M, L, X

[Table of Contents](#contents)

### Explore the .explode() method with `total_sizes` columns

In [None]:
df_exp = df_ohe.explode('total_sizes')

In [None]:
df_exp = victoria.explode('total_sizes')

In [None]:
df_exp.info()
df_exp.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 884916 entries, 0 to 45338
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   product_name      884916 non-null  object 
 1   mrp               884916 non-null  float64
 2   price             884916 non-null  float64
 3   pdp_url           884916 non-null  object 
 4   brand_name        884916 non-null  object 
 5   product_category  884916 non-null  object 
 6   description       884916 non-null  object 
 7   rating            272277 non-null  float64
 8   review_count      272277 non-null  float64
 9   total_sizes       884898 non-null  object 
 10  available_size    884916 non-null  object 
 11  color             884916 non-null  object 
dtypes: float64(4), object(8)
memory usage: 87.8+ MB


Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,description,rating,review_count,total_sizes,available_size,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Game-changer: your favorite maximum-support sp...,3.6,25.0,32A,32D3,White
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Game-changer: your favorite maximum-support sp...,3.6,25.0,32B,32D3,White
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Game-changer: your favorite maximum-support sp...,3.6,25.0,32C,32D3,White
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Game-changer: your favorite maximum-support sp...,3.6,25.0,32D,32D3,White
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Game-changer: your favorite maximum-support sp...,3.6,25.0,32DD,32D3,White


In [None]:
df_exp.nunique()

product_name         599
mrp                   72
price                 89
pdp_url             1410
brand_name             2
product_category     445
description          536
rating                31
review_count         333
total_sizes           52
available_size        44
color               1300
dtype: int64

[Table of Contents](#contents)

<a id="missing_data"></a>
## Handling Missing Data 

In [None]:
import pandas as pd
import os
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_info_columns', 300)

import warnings
warnings.filterwarnings('ignore')

[Table of Contents](#contents)

<a id="import_datasets"></a>
### Importing datasets

import the `Credit.dat` dataset

In [None]:
df_miss = pd.read_csv('./data/Credit.dat', sep='\t')

import the `Credit.csv` dataset

In [None]:
df = pd.read_csv('./data/Credit.csv', sep=',', header=0)

In [None]:
df_miss.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Own,Student,Married,Region,Balance
0,14.891,,283.0,2.0,34.0,11.0,,No,Yes,South,333.0
1,106.025,6645.0,483.0,3.0,,15.0,Yes,Yes,Yes,West,903.0
2,104.593,7075.0,,4.0,71.0,11.0,,No,No,West,580.0
3,148.924,9504.0,681.0,3.0,36.0,11.0,,No,No,West,964.0
4,55.882,4897.0,357.0,2.0,,16.0,No,No,Yes,South,331.0


In [None]:
df_miss.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Income     314 non-null    float64
 1   Limit      321 non-null    float64
 2   Rating     310 non-null    float64
 3   Cards      314 non-null    float64
 4   Age        316 non-null    float64
 5   Education  314 non-null    float64
 6   Own        317 non-null    object 
 7   Student    330 non-null    object 
 8   Married    325 non-null    object 
 9   Region     314 non-null    object 
 10  Balance    355 non-null    float64
dtypes: float64(7), object(4)
memory usage: 34.5+ KB


[Table of Contents](#contents)

<a id="median"></a>
### Imputation with the median

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
def impute_with_median(dataframe, column_name):
    median = dataframe[column_name].median()
    dataframe[column_name].fillna(median, inplace=True)
    return dataframe

# def calculate_mse(original, imputed):
#     return mean_squared_error(original, imputed)

def calculate_mse(original, imputed):
    # Find the indices of non-NaN values
    # not_nan_indices = ~np.isnan(original)
    
    # Calculate the MSE only for non-NaN values
    mse = mean_squared_error(original, imputed)
    # mse = ((original[not_nan_indices] - imputed[not_nan_indices]) ** 2).mean()
    return mse


In [30]:
# It appears that the first row is repeating as a header. I'll skip the first row while reading the file.
# df = pd.read_csv(file_path, delimiter='\s+', comment='#', names=correct_columns, skiprows=1)
df = df_miss

# Convert columns to appropriate data types for calculations
for col in ['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Balance']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Display the first few rows to confirm
df.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Own,Student,Married,Region,Balance
0,14.891,,283.0,2.0,34.0,11.0,,No,Yes,South,333.0
1,106.025,6645.0,483.0,3.0,,15.0,Yes,Yes,Yes,West,903.0
2,104.593,7075.0,,4.0,71.0,11.0,,No,No,West,580.0
3,148.924,9504.0,681.0,3.0,36.0,11.0,,No,No,West,964.0
4,55.882,4897.0,357.0,2.0,,16.0,No,No,Yes,South,331.0


In [31]:
from sklearn.metrics import mean_squared_error
import numpy as np

# Function to impute missing values with the median of the column
def impute_with_median(dataframe, column_name):
    median = dataframe[column_name].median()
    dataframe[column_name].fillna(median, inplace=True)
    return dataframe

# Function to calculate the Mean Squared Error (MSE) between original and imputed data
def calculate_mse(original, imputed):
    mse = mean_squared_error(original.fillna(0), imputed.fillna(0))
    return mse

# Choose the column to impute
column_to_impute = 'Age'
# Make a copy of the original dataframe
df_miss = df.copy()

# Perform the imputation
imputed_data = impute_with_median(df_miss, column_to_impute)

# Calculate the MSE
mse = calculate_mse(df[column_to_impute], imputed_data[column_to_impute])

mse

658.56

In [None]:
column_to_impute = 'Age'
imputed_data = df_miss.copy()
# imputed_data = imputed_data.fillna(imputed_data.median())
imputed_data = impute_with_median(imputed_data, column_to_impute)
mse = calculate_mse(df_miss[column_to_impute], imputed_data[column_to_impute])
print(f'Mean Squared Error (MSE) for imputation: {mse}')

ValueError: Input contains NaN.

In [None]:
for column in list(df.columns[df.isnull().sum() > 0]):
    mean_val = df[column].mean()
    df[column].fillna(mean_val, inplace=True)

compute the overall error in imputation using MSE
- suggestion : use a function...

In [None]:
# your code here ##

[Table of Contents](#contents)

<a id="mean"></a>
### Imputation with the mean

In [None]:
## your code here ##

compute the overall error in imputation using MSE
- suggestion : use a function...

In [None]:
## your code here ##

[Table of Contents](#contents)

<a id="linear"></a>
### Impution with linear interpolation

In [None]:
## your code here ##

compute the overall error in imputation using MSE
- suggestion : use a function...

In [None]:
## your code here ##

[Table of Contents](#contents)

<a id="simple"></a>
### Simple imputation

Using the mean as constant

In [None]:
## your code here ##

In [None]:
## your code here ##

[Table of Contents](#contents)

Using the mode as constant

In [None]:
## your code here ##

In [None]:
## your code here ##

[Table of Contents](#contents)

<a id="multiple"></a>
### Multiple imputation

In [None]:
## your code here ##

In [None]:
## your code here ##

[Table of Contents](#contents)

<a id="neighbors"></a>
### K-Nearest Neighbors

With the default 5 neighbors

In [None]:
## your code here ##

In [None]:
## your code here ##

[Table of Contents](#contents)

## Conclusion

**On average, the multiple (iterative) and the KNN imputation methods are clearly the best**