# Hot Deck Imputation Trials Preparation.

The data fed to the imputer should be label encoded then one-hot encoded. 

However, it should still have some missing values when being fed to the imputer whether in the training or testing phase of the imputer.

Thus, the data needs to be processed to such a state before it is ready for use in the trials

In [1]:
%cd ..

c:\Users\nick\OneDrive\Desktop\Prospect 33\Mini_DIVA


In [2]:
# from google.colab import drive
# drive.mount('/content/drive')
# %cd '/content/drive/My Drive/Mini_DIVA'

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings

from imputers.hotDeckImputer import hotDeckImputer
from sklearn.metrics import accuracy_score, mean_squared_error

pd.set_option("display.max_columns", None)
warnings.filterwarnings("ignore")

## Loading data

In [4]:
# read the data
file_dir = "../Mini_DIVA/datasets\Automobile.csv"

df = pd.read_csv(file_dir)
df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,num_of_cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
1,2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
2,1,158,audi,gas,std,four,sedan,fwd,front,105.8,192.7,71.4,55.7,2844,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
3,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,192.7,71.4,55.9,3086,ohc,five,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875
4,2,192,bmw,gas,std,two,sedan,rwd,front,101.2,176.8,64.8,54.3,2395,ohc,four,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16430


In [5]:
# function for determining a categorical variable vs not (from utils.py)
def iscategorical(x, threshold=0.12):
    """
    determine if x is a categorical variable.


    Inputs:
    ------------------------------------------------------------
    x: pd.DataFrame or np.ndarray, a vector


    Outputs:
    ------------------------------------------------------------
    Bool value
    """
    # convert x to np.ndarray
    if isinstance(x, pd.DataFrame):
        x = x.to_numpy()

    if x.dtype in ["object", "bool", "str"]:
        return True
    elif len(np.unique(x[~np.isnan(x)])) < threshold * len(
        x[~np.isnan(x)]
    ):
        return True
    else:
        return False

In [6]:
# creating a mask for categorical variables the filtering the columns using it
categorical_mask = {col: iscategorical(df[col]) for col in df.columns}
cat_vars = [col for col, val in categorical_mask.items() if val == True]

# making a copy of the original dataframe to keep it unaltered
df_le = df.copy()


## Label Encoding & Setting missing fraction

In [7]:
# label encode the data
for col in cat_vars:
    le = LabelEncoder()
    df_le[col] = le.fit_transform(df[col])

# saving the label encoded dataframe version
df_frac = df_le.copy()  # some of the values in this will be set to nan
random_state = 20

for idx, col in enumerate(df_frac.columns):
    # set fraction missing in the dataframe
    missing = df_frac[col].sample(frac=0.08, random_state=random_state, replace=False).index.to_list()
    df_frac.iloc[missing, idx] = np.nan
    random_state += 2

In [8]:
# separating missing from complete data
missing_idx = {}
complete_idx = {}

for col in df_frac.columns:
    missing_idx[col] = list()
    key = missing_idx[col]
    for idx, rec in enumerate(df_frac[col]):
        if np.isnan(rec):
            key.append(idx)

## One-hot encoding

First we have to impute the data with mean and/or mode for it to work with the ohe encoder.

In [9]:
# dictionary to save the values to be imputed per column
imputed_value = {}

# obtaining the mode/mean to use
for col in df_le.columns:
    if col in cat_vars:
        imputed_value[col] = float(df_le[col].mode())
    else:
        imputed_value[col] = float(df_le[col].mean())

In [10]:
# instantiate the ohe encoder
ohe = OneHotEncoder(drop="first", sparse=False)

# fitting to df_le because it is the last version of data that is complete
ohe.fit(df_le[cat_vars])

# one-hot encode the data
cat_transComp = ohe.transform(df_le[cat_vars])
cat_transNames = ohe.get_feature_names_out()

# switch them back to dataframes
cat_oheComp = pd.DataFrame(cat_transComp, columns=cat_transNames, index=df_le.index)
df_oheComp = cat_oheComp.join(df_le[[col for col in df_le.columns if col not in cat_vars]])

df_oheComp.sample(5)

Unnamed: 0,symboling_1,symboling_2,symboling_3,symboling_4,symboling_5,make_1,make_2,make_3,make_4,make_5,make_6,make_7,make_8,make_9,make_10,make_11,make_12,make_13,make_14,make_15,make_16,make_17,fuel_type_1,aspiration_1,num_of_doors_1,body_style_1,body_style_2,body_style_3,body_style_4,drive_wheels_1,drive_wheels_2,engine_type_1,engine_type_2,engine_type_3,engine_type_4,num_of_cylinders_1,num_of_cylinders_2,num_of_cylinders_3,num_of_cylinders_4,fuel_system_1,fuel_system_2,fuel_system_3,fuel_system_4,fuel_system_5,normalized_losses,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
5,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,192,101.2,176.8,64.8,54.3,2395,108,3.5,2.8,8.8,101,5800,23,29,16925
36,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,113,93.1,166.8,64.2,54.1,1945,91,3.03,3.15,9.0,68,5000,31,38,6695
61,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,128,94.5,165.3,63.8,54.5,1918,97,3.15,3.29,9.4,69,5200,31,37,6649
122,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,168,94.5,168.7,64.0,52.6,2169,98,3.19,3.03,9.0,70,4800,29,34,8058
157,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,95,109.1,188.8,68.9,55.5,3217,145,3.01,3.4,23.0,106,4800,26,27,22470


In [11]:
# reverting  previously imputed records back to null/nan
df_oheMiss = df_oheComp.copy()

for col in missing_idx:
    for missCol in df_oheMiss:
        if col in missCol:
            for idx, rec in enumerate(df_oheMiss[missCol]):
                if idx in missing_idx[col]:
                    df_oheMiss[missCol][idx] = np.nan

The data is now ready to be fed to the imputer.

# HotDeck Script Implementation Tests

I have to create new numerical and categorical variables to use in the imputer because data was onehot encoded since the last time cat_vars was created thus categorical variables should have changed.

It is important to note that wether in the trtaining or testing phase of the imputer, there are predictions made and outputed. Therefore, it can be considered that the imputer is tested even when it is only used in the training phase. Because of this, in our test the imputer will only undergo the `.fit` method.

In [12]:
# new cat vars since ohe was done on the data
num_vars_ = [col for col in df_le.columns if col not in cat_vars]
cat_vars_ = [col for col in df_oheComp.columns if col not in num_vars_]

In [13]:
# instantiating the imputer
hot = hotDeckImputer(num_vars_, cat_vars_, 6)
# fitting to data with missing values
fitted_df = hot.fit(df_oheMiss)

In [14]:
print(f"Train data fed to the imputer had an average of {df_oheMiss.isna().sum().mean()} missing values per column.\n")
print(f"The resulting data after imputation has an average of {fitted_df.isna().sum().mean()} missing values per column")

Train data fed to the imputer had an average of 13.0 missing values per column.

The resulting data after imputation has an average of 0.0 missing values per column


## Investigating model performance

We will be using accuracy score for categorical variables and RMSE for numerical variables.

In [15]:
# variable to store the scores
score_dict = {}

# calculate the error scores for the imputation
for col, idxList in missing_idx.items():
    original = df_oheComp.iloc[idxList][[_ for _ in df_oheComp.columns if col in _]]
    predicted = fitted_df.iloc[idxList][[_ for _ in df_oheComp.columns if col in _]]
    
    if col in num_vars_:
        score = np.sqrt(mean_squared_error(original, predicted))
        score_dict[col] = score
    else:
        for var in original.columns:
            score = accuracy_score(original[var], predicted[var])
            score_dict[var] = score

# create a score dataframe
scores = pd.DataFrame.from_dict(score_dict, orient="index", columns=['Score']).reset_index()
scores.columns = ["Column", "Score"]

# # export the scores to a csv file for future reference
scores.to_csv("../MINI_DIVA/ModelIterations/HotDeckPCAIter.csv")

## Original/Baseline imputer

These are the results of imputation using the basic(first iteration) imputer.

In [16]:
# check the scores of the original hotdeck model
baseline_res = pd.read_csv("../MINI_DIVA/ModelIterations/HotDeckBaseIterRes.csv", index_col="Unnamed: 0")
baseline_res.sample(5, random_state=40)

Unnamed: 0,Column,Score
4,symboling_5,0.923077
33,length,5.503932
20,make_15,0.692308
58,price,7303.721224
40,engine_type_4,0.923077


There is a relatively higher error in imputation of numerical variables.

    To note: Price, peak_rpm, and curb_weight with Greater Than 400 RMSE.

It is clear that the imputer struggles with continuous numerical data. We will look into improving the perfomance by tweaking the model parameters. 

After doing various combinations of hyperparameter tuning we found out that there is no significant change in the results of the imputer. Therefore, we rule out further hyperparameter tuning as a way of improving the imputer's perfomance.


## Removing outliers in continuous numerical data and Balancing imbalanced categorical data

In this section we look at the results of a version of the imputer that has issues such as outliers and imbalanced categorical variables sorted out. 

We use SMOTE to oversample the minority class in an imbalanced categorical variable and use a threshold of `2 * IQR` to drop outliers when building the underlying models. 

In [17]:
# check the scores of hotdeck imputer with outliers removed and balanced classes
outlier_res = pd.read_csv("../MINI_DIVA/ModelIterations/HotDeckOutlierIter.csv", index_col="Unnamed: 0")
outlier_res.sample(5, random_state=40)

Unnamed: 0,Column,Score
4,symboling_5,1.0
33,length,5.50393
20,make_15,0.846154
58,price,7279.550447
40,engine_type_4,0.692308


There is a minor reduction in the error of the imputation. It is also important to note that the error of imputating of the categorical variables is more standardized (less variance in accuracies). 

We will stick with the current version and continue to improve on it.

## Principal Component Analysing (PCA) the data

With outliers and imbalances out of the way, we need to look at reducing model complexity. 

This should improve the perfomance of the models as there is a very high chance that the test data (Data with missing values) will have significantly fewer samples than the train data. Basically, there is a very high chance of overfitting of the imputer. 

In [18]:
# check the scores of hotdeck imputer with outliers removed and balanced classes
pca_res = pd.read_csv("../MINI_DIVA/ModelIterations/HotDeckPCAIter.csv", index_col="Unnamed: 0")
pca_res.sample(5, random_state=40)

Unnamed: 0,Column,Score
4,symboling_5,1.0
33,length,5.252267
20,make_15,0.846154
58,price,7371.083197
40,engine_type_4,0.692308


The test and trials of the hot deck imputer is successful.