# Multiple Imputation with Chain Equations (MICE) Algorithm

### Notes

Choosing appropriate imputation methods is crucial to maintaining consistent data distributions. Missing data often occurs in independent variables for various reasons, such as, human error, data entry problems, or equipment malfunctions during data collection.

Most of machine learning models expect the data to be complete without any trace of null values. Rows with missing data points can detrimentally affect the models’ predictive power, introducing bias, reducing the model’s accuracy and making some statistical analyses inapplicable. This is why it is essential to tackle missing data effectively.

The act of replacing missing values by estimates, is called missing data imputation.

MICE (Multiple Imputation by Chained Equations) is a powerful technique for handling missing data in datasets. Instead of simply deleting rows with missing values or replacing them with a single estimate (like the mean), MICE creates multiple plausible versions of the complete dataset, reflecting the uncertainty around the missing values.

In [114]:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import KNNImputer, SimpleImputer, IterativeImputer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler


In [214]:
#data = pd.read_csv('https://raw.githubusercontent.com/Impact-Insights/Data-Engineering-Projects/refs/heads/main/Customer%20Purchases%20Project/Customer_Purchase_Data.csv?token=GHSAT0AAAAAAC7OUMXN6GPU7MWDK5RBLW3GZ6G5I4Q')
data = pd.read_csv(r'Customer_Purchase_Data.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,Customer_ID,Age,Salary,City,Gender,Purchase_Amount,Signup_Date,Marital_Status,Education
0,0,1,40.0,70000.0,New York,,619,2022-01-01,Widowed,PhD
1,1,2,45.0,30000.0,Houston,Female,1945,2022-01-02,Single,PhD
2,2,3,35.0,30000.0,New York,Male,2783,2022-01-03,,Master
3,3,4,45.0,30000.0,,Other,3913,2022-01-04,Married,Bachelor
4,4,5,45.0,30000.0,New York,Male,4771,2022-01-05,Married,


In [215]:
data = data.drop(columns = ['Unnamed: 0'])

## Cleaning, Encoding and Imputing values in the Gender Column.

### Extracting Data to work on.

In [216]:
g_data = data[['Age', 'Salary', 'Purchase_Amount', 'Gender']]
g_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Gender
0,40.0,70000.0,619,
1,45.0,30000.0,1945,Female
2,35.0,30000.0,2783,Male
3,45.0,30000.0,3913,Other
4,45.0,30000.0,4771,Male


### Standardizing and Normalizing the data in categorical columns.

In [217]:
g_data['Gender'].unique()

array([nan, 'Female', 'Male', 'Other', 'male'], dtype=object)

In [218]:
g_data.loc[:, 'Gender'] = g_data.loc[:, 'Gender'].replace('male', 'Male')
g_data['Gender'].unique()

array([nan, 'Female', 'Male', 'Other'], dtype=object)

In [219]:
#creating columns for encoding
impute_col = 'Gender'
impute_col_encoded = 'Gender_Encoded'

### Encoding the data into numerical representation for the categories.

In [220]:
le = LabelEncoder()

Fitting the data into the model

In [221]:
le.fit(g_data[impute_col])

Transforming our categories into numerical classes, getting the results in a dictionary.

In [222]:
dict(zip(le.classes_, le.transform(le.classes_)))

{'Female': np.int64(0),
 'Male': np.int64(1),
 'Other': np.int64(2),
 nan: np.int64(3)}

### Applying Label Transformation
Creating the numeric form of the 'categories' into the `Gender_Encoded` column

In [223]:
g_data[impute_col_encoded] = le.transform(g_data[impute_col])
#g_data[['Age', 'Salary', 'Purchase_Amount', 'Gender', 'Gender_Encoded']].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  g_data[impute_col_encoded] = le.transform(g_data[impute_col])


In [224]:
g_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Gender,Gender_Encoded
0,40.0,70000.0,619,,3
1,45.0,30000.0,1945,Female,0
2,35.0,30000.0,2783,Male,1
3,45.0,30000.0,3913,Other,2
4,45.0,30000.0,4771,Male,1


### Converting the 3 values back to thier equivalent (NaN) in the `Gender_Encoded` column using the `map` and `lambda` function.

In [225]:
g_data.loc[:, impute_col_encoded] = g_data.loc[:, impute_col_encoded].map(lambda x: np.nan if x == 3 else x)
g_data.head()

 nan  1.  1.  1. nan nan nan nan nan  2.  2.  1. nan  1. nan nan  2.  2.
  0.  2.  0.  0.  0.  0.  0.  1.  2.  0. nan  2.  2.  0.  1.  0.  1. nan
  2. nan  1.  1. nan  1. nan  2. nan  2. nan  1. nan  1. nan nan  0.  1.
  0.  1.  0.  2. nan  1.  1. nan  1.  1.  0.  1.  2.  2. nan  1. nan  2.
  1.  1. nan  2.  1. nan  2. nan  0.  1. nan  0.  1.  2.  1.]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  g_data.loc[:, impute_col_encoded] = g_data.loc[:, impute_col_encoded].map(lambda x: np.nan if x == 3 else x)


Unnamed: 0,Age,Salary,Purchase_Amount,Gender,Gender_Encoded
0,40.0,70000.0,619,,
1,45.0,30000.0,1945,Female,0.0
2,35.0,30000.0,2783,Male,1.0
3,45.0,30000.0,3913,Other,2.0
4,45.0,30000.0,4771,Male,1.0


In [226]:
g_data[impute_col_encoded].value_counts()

Gender_Encoded
1.0    33
0.0    21
2.0    20
Name: count, dtype: int64

### Starting the MICE Imputation

In [227]:
g_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Gender,Gender_Encoded
0,40.0,70000.0,619,,
1,45.0,30000.0,1945,Female,0.0
2,35.0,30000.0,2783,Male,1.0
3,45.0,30000.0,3913,Other,2.0
4,45.0,30000.0,4771,Male,1.0


In [228]:
imputer = IterativeImputer(random_state=100, max_iter=10)

Creating a copy of the data we are working on to be able to retrace back.

In [229]:
g_data_train = g_data[['Age', 'Salary', 'Purchase_Amount', 'Gender_Encoded']].copy(deep=True)   #Deep copy of the data
g_data_train.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Gender_Encoded
0,40.0,70000.0,619,
1,45.0,30000.0,1945,0.0
2,35.0,30000.0,2783,1.0
3,45.0,30000.0,3913,2.0
4,45.0,30000.0,4771,1.0


Fitting the data into the model and transforming it.

In [230]:
imputer.fit(g_data_train)

In [231]:
g_data_imputed = imputer.transform(g_data_train)
g_data_imputed[: 5].round()

array([[4.000e+01, 7.000e+04, 6.190e+02, 1.000e+00],
       [4.500e+01, 3.000e+04, 1.945e+03, 0.000e+00],
       [3.500e+01, 3.000e+04, 2.783e+03, 1.000e+00],
       [4.500e+01, 3.000e+04, 3.913e+03, 2.000e+00],
       [4.500e+01, 3.000e+04, 4.771e+03, 1.000e+00]])

Getting only the values from our `Gender_Encoded` column.

In [232]:
g_data_imputed[:, 3].round()

array([1., 0., 1., 2., 1., 1., 0., 1., 0., 0., 0., 0., 2., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2., 1., 1., 1., 1., 1.,
       2., 2., 0., 2., 0., 0., 0., 0., 0., 1., 2., 0., 1., 2., 2., 0., 1.,
       0., 1., 1., 2., 1., 1., 1., 1., 1., 1., 2., 1., 2., 1., 1., 1., 1.,
       1., 1., 0., 1., 0., 1., 0., 2., 1., 1., 1., 1., 1., 1., 0., 1., 2.,
       2., 1., 1., 1., 2., 1., 1., 1., 2., 1., 1., 2., 1., 0., 1., 1., 0.,
       1., 2., 1.])

#### Replacing the imputed values from the `g_data_imputed` values into our `g_data` dataset.

In [233]:
g_data.loc[:, [impute_col_encoded]] = g_data_imputed[:, 3].round().astype('int')
g_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Gender,Gender_Encoded
0,40.0,70000.0,619,,1.0
1,45.0,30000.0,1945,Female,0.0
2,35.0,30000.0,2783,Male,1.0
3,45.0,30000.0,3913,Other,2.0
4,45.0,30000.0,4771,Male,1.0


#### Inverting our encoded categories back into the original form.

In [234]:
g_data['Gender_Encoded'].unique()

array([1., 0., 2.])

In [235]:
gender_imputed = le.inverse_transform(g_data['Gender_Encoded'].astype('int').round())
gender_imputed[:10]

array(['Male', 'Female', 'Male', 'Other', 'Male', 'Male', 'Female',
       'Male', 'Female', 'Female'], dtype=object)

#### Replace the values in the actual colomn

In [236]:
gender_imputed[10:]

array(['Female', 'Female', 'Other', 'Male', 'Male', 'Male', 'Male',
       'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male',
       'Male', 'Male', 'Other', 'Other', 'Male', 'Male', 'Male', 'Male',
       'Male', 'Other', 'Other', 'Female', 'Other', 'Female', 'Female',
       'Female', 'Female', 'Female', 'Male', 'Other', 'Female', 'Male',
       'Other', 'Other', 'Female', 'Male', 'Female', 'Male', 'Male',
       'Other', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Other',
       'Male', 'Other', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male',
       'Female', 'Male', 'Female', 'Male', 'Female', 'Other', 'Male',
       'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Other',
       'Other', 'Male', 'Male', 'Male', 'Other', 'Male', 'Male', 'Male',
       'Other', 'Male', 'Male', 'Other', 'Male', 'Female', 'Male', 'Male',
       'Female', 'Male', 'Other', 'Male'], dtype=object)

In [240]:
g_data.loc[:,'Gender_Encoded'] = gender_imputed
g_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Gender,Gender_Encoded
0,40.0,70000.0,619,,Male
1,45.0,30000.0,1945,Female,Female
2,35.0,30000.0,2783,Male,Male
3,45.0,30000.0,3913,Other,Other
4,45.0,30000.0,4771,Male,Male


In [241]:
g_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Age              88 non-null     float64
 1   Salary           90 non-null     float64
 2   Purchase_Amount  105 non-null    int64  
 3   Gender           74 non-null     object 
 4   Gender_Encoded   105 non-null    object 
dtypes: float64(2), int64(1), object(2)
memory usage: 4.2+ KB


#### Null values Imputed in the `Gender` column.

In [242]:
data['Gender'] = g_data['Gender_Encoded']

In [243]:
data['Gender'].unique()

array(['Male', 'Female', 'Other'], dtype=object)

In [244]:
data.head()

Unnamed: 0,Customer_ID,Age,Salary,City,Gender,Purchase_Amount,Signup_Date,Marital_Status,Education
0,1,40.0,70000.0,New York,Male,619,2022-01-01,Widowed,PhD
1,2,45.0,30000.0,Houston,Female,1945,2022-01-02,Single,PhD
2,3,35.0,30000.0,New York,Male,2783,2022-01-03,,Master
3,4,45.0,30000.0,,Other,3913,2022-01-04,Married,Bachelor
4,5,45.0,30000.0,New York,Male,4771,2022-01-05,Married,


## Cleaning, Encoding and Imputing values in the City Column.

#### Extracting Data to work on.

In [245]:
data.head()

Unnamed: 0,Customer_ID,Age,Salary,City,Gender,Purchase_Amount,Signup_Date,Marital_Status,Education
0,1,40.0,70000.0,New York,Male,619,2022-01-01,Widowed,PhD
1,2,45.0,30000.0,Houston,Female,1945,2022-01-02,Single,PhD
2,3,35.0,30000.0,New York,Male,2783,2022-01-03,,Master
3,4,45.0,30000.0,,Other,3913,2022-01-04,Married,Bachelor
4,5,45.0,30000.0,New York,Male,4771,2022-01-05,Married,


In [246]:
c_data = data[['Age', 'Salary', 'Purchase_Amount', 'City']].copy(deep=True)
c_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,City
0,40.0,70000.0,619,New York
1,45.0,30000.0,1945,Houston
2,35.0,30000.0,2783,New York
3,45.0,30000.0,3913,
4,45.0,30000.0,4771,New York


#### Standardizing and Normalizing the data in `City` columns.

In [247]:
c_data['City'].unique()

array(['New York', 'Houston', nan, 'Los Angeles', 'New-York', 'Phoenix',
       'Chicago'], dtype=object)

In [248]:
c_data['City'] = c_data['City'].replace('New-York', 'New York')

In [249]:
c_data['City'].unique()

array(['New York', 'Houston', nan, 'Los Angeles', 'Phoenix', 'Chicago'],
      dtype=object)

#### Encoding and applying Label Transformation the data into numerical representation for the categories.


In [250]:
encoder = LabelEncoder()

In [251]:
c_data['City_Encoded'] = encoder.fit_transform(c_data['City'])

In [252]:
c_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,City,City_Encoded
0,40.0,70000.0,619,New York,3
1,45.0,30000.0,1945,Houston,1
2,35.0,30000.0,2783,New York,3
3,45.0,30000.0,3913,,5
4,45.0,30000.0,4771,New York,3


Getting the encoded classes/cateories.

In [253]:
dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))

{'Chicago': np.int64(0),
 'Houston': np.int64(1),
 'Los Angeles': np.int64(2),
 'New York': np.int64(3),
 'Phoenix': np.int64(4),
 nan: np.int64(5)}

#### Converting the `5` class back to `NaN` values.

In [254]:
c_data.loc[:, 'City_Encoded'] = c_data.loc[:, 'City_Encoded'].map(lambda x: np.nan if x == 5 else x)

  2.  1. nan  4.  3.  3.  0. nan  2.  4.  1.  2. nan  1.  0.  0.  3. nan
  4.  1.  2. nan nan  0.  3.  3.  1.  0. nan  4.  0.  1.  1.  0.  1.  0.
  2.  0.  0.  1.  1.  3.  3.  2.  3.  0.  1.  3.  3.  2. nan  2.  0.  1.
  2.  3.  1.  1.  3.  2.  3. nan  1.  4.  4.  0.  3.  3.  0.  0.  0.  1.
 nan  3.  1.  0.  3.  1. nan  1. nan  0.  3.  1.  3. nan  3.]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  c_data.loc[:, 'City_Encoded'] = c_data.loc[:, 'City_Encoded'].map(lambda x: np.nan if x == 5 else x)


In [255]:
c_data['City_Encoded'].value_counts()

City_Encoded
3.0    28
1.0    23
0.0    19
2.0    13
4.0     7
Name: count, dtype: int64

### Starting the MICE Imputation.

Instantiating the `IterativeImputer` model.

In [256]:
imputer = IterativeImputer(random_state=100, max_iter=10)

Getting the data to use for imputation.

In [257]:
c_data_train = c_data[['Age', 'Salary', 'Purchase_Amount', 'City_Encoded']].copy(deep=True)

Fitting and tranforming the data using the imputer initiated.

In [259]:
c_data_imputed = imputer.fit_transform(c_data_train)

In [260]:
c_data_imputed[:4].round()

array([[4.000e+01, 7.000e+04, 6.190e+02, 3.000e+00],
       [4.500e+01, 3.000e+04, 1.945e+03, 1.000e+00],
       [3.500e+01, 3.000e+04, 2.783e+03, 3.000e+00],
       [4.500e+01, 3.000e+04, 3.913e+03, 2.000e+00]])

Imputed City_Encoded column. 

In [261]:
c_data_imputed[:, 3].round()

array([3., 1., 3., 2., 3., 2., 1., 1., 2., 2., 3., 3., 4., 3., 3., 0., 3.,
       2., 2., 1., 2., 4., 3., 3., 0., 2., 2., 4., 1., 2., 2., 1., 0., 0.,
       3., 2., 4., 1., 2., 2., 2., 0., 3., 3., 1., 0., 2., 4., 0., 1., 1.,
       0., 1., 0., 2., 0., 0., 1., 1., 3., 3., 2., 3., 0., 1., 3., 3., 2.,
       2., 2., 0., 1., 2., 3., 1., 1., 3., 2., 3., 2., 1., 4., 4., 0., 3.,
       3., 0., 0., 0., 1., 2., 3., 1., 0., 3., 1., 2., 1., 2., 0., 3., 1.,
       3., 2., 3.])

#### Replacing the imputed values from the `c_data_imputed` values into our `c_data` dataset.

In [262]:
c_data.loc[:, 'City_Encoded'] = c_data_imputed[:, 3].round().astype('int')


In [263]:
c_data['City_Encoded'].unique()

array([3., 1., 2., 4., 0.])

#### Inverting our encoded categories back into the original form.

In [264]:
c_data_imputed = encoder.inverse_transform(c_data['City_Encoded'].astype('int').round())

In [265]:
c_data_imputed[:10] 

array(['New York', 'Houston', 'New York', 'Los Angeles', 'New York',
       'Los Angeles', 'Houston', 'Houston', 'Los Angeles', 'Los Angeles'],
      dtype=object)

In [266]:
c_data['City_Encoded'] = c_data_imputed

In [267]:
c_data.head(10)

Unnamed: 0,Age,Salary,Purchase_Amount,City,City_Encoded
0,40.0,70000.0,619,New York,New York
1,45.0,30000.0,1945,Houston,Houston
2,35.0,30000.0,2783,New York,New York
3,45.0,30000.0,3913,,Los Angeles
4,45.0,30000.0,4771,New York,New York
5,30.0,1000000.0,4064,Los Angeles,Los Angeles
6,35.0,50000.0,2951,Houston,Houston
7,35.0,50000.0,1736,Houston,Houston
8,35.0,30000.0,3059,,Los Angeles
9,45.0,50000.0,204,Los Angeles,Los Angeles


#### Null values Imputed in the `City` column.

In [268]:
data['City'] = c_data['City_Encoded']

In [269]:
data.head(10)

Unnamed: 0,Customer_ID,Age,Salary,City,Gender,Purchase_Amount,Signup_Date,Marital_Status,Education
0,1,40.0,70000.0,New York,Male,619,2022-01-01,Widowed,PhD
1,2,45.0,30000.0,Houston,Female,1945,2022-01-02,Single,PhD
2,3,35.0,30000.0,New York,Male,2783,2022-01-03,,Master
3,4,45.0,30000.0,Los Angeles,Other,3913,2022-01-04,Married,Bachelor
4,5,45.0,30000.0,New York,Male,4771,2022-01-05,Married,
5,6,30.0,1000000.0,Los Angeles,Male,4064,2022-01-06,,
6,7,35.0,50000.0,Houston,Female,2951,2022-01-07,Widowed,Master
7,8,35.0,50000.0,Houston,Male,1736,2022-01-08,Married,PhD
8,9,35.0,30000.0,Los Angeles,Female,3059,2022-01-09,Widowed,High School
9,10,45.0,50000.0,Los Angeles,Female,204,2022-01-10,Married,PhD


## Cleaning, Encoding and Imputing values in the `Marital_Status` Column.

#### Extracting Data to work on.

In [270]:
data.head()

Unnamed: 0,Customer_ID,Age,Salary,City,Gender,Purchase_Amount,Signup_Date,Marital_Status,Education
0,1,40.0,70000.0,New York,Male,619,2022-01-01,Widowed,PhD
1,2,45.0,30000.0,Houston,Female,1945,2022-01-02,Single,PhD
2,3,35.0,30000.0,New York,Male,2783,2022-01-03,,Master
3,4,45.0,30000.0,Los Angeles,Other,3913,2022-01-04,Married,Bachelor
4,5,45.0,30000.0,New York,Male,4771,2022-01-05,Married,


In [271]:
ms_data = data[['Age', 'Salary', 'Purchase_Amount', 'Marital_Status']].copy(deep=True)
ms_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Marital_Status
0,40.0,70000.0,619,Widowed
1,45.0,30000.0,1945,Single
2,35.0,30000.0,2783,
3,45.0,30000.0,3913,Married
4,45.0,30000.0,4771,Married


#### Standardizing and Normalizing the data in the `Marital_Status` column.

In [272]:
ms_data['Marital_Status'].unique()

array(['Widowed', 'Single', nan, 'Married', 'Divorced'], dtype=object)

#### Encoding and Applying Label Transformation on the data into numerical representation of the categories.

In [273]:
ms_data['MS_Encoded'] = encoder.fit_transform(ms_data['Marital_Status'])
ms_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Marital_Status,MS_Encoded
0,40.0,70000.0,619,Widowed,3
1,45.0,30000.0,1945,Single,2
2,35.0,30000.0,2783,,4
3,45.0,30000.0,3913,Married,1
4,45.0,30000.0,4771,Married,1


In [274]:
ms_data['MS_Encoded'].unique()

array([3, 2, 4, 1, 0])

In [275]:
dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))

{'Divorced': np.int64(0),
 'Married': np.int64(1),
 'Single': np.int64(2),
 'Widowed': np.int64(3),
 nan: np.int64(4)}

#### Converting the class back to the `NaN` values.

In [276]:
ms_data.loc[:, 'MS_Encoded'] = ms_data.loc[:, 'MS_Encoded'].map(lambda x: np.nan if x == 4 else x)
ms_data['MS_Encoded'].value_counts()

  2.  3.  0.  3.  3.  1.  0.  3.  2.  2. nan  0.  0. nan  3.  0.  2.  2.
  1.  0.  3. nan nan  3.  1. nan  0.  1.  0.  2.  1. nan  1.  1.  1.  1.
  0.  2.  3.  1. nan  1. nan  0. nan  3.  2. nan nan  2.  3.  1. nan  2.
  0.  2.  0.  3.  1.  2. nan  3.  2. nan  2.  0.  2.  2.  2.  3.  2. nan
  0.  0.  2.  3.  3. nan  2.  0.  3.  2.  3.  2. nan  1.  1.]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  ms_data.loc[:, 'MS_Encoded'] = ms_data.loc[:, 'MS_Encoded'].map(lambda x: np.nan if x == 4 else x)


MS_Encoded
2.0    24
1.0    22
3.0    21
0.0    17
Name: count, dtype: int64

### Starting the MICE Imputation.

Getting the data to work on for imputation.

In [277]:
ms_data_train = ms_data[['Age', 'Salary', 'Purchase_Amount', 'MS_Encoded']].copy(deep=True)

Fitting and Transforming the data into the IterativeImputer model.

In [278]:
ms_data_imputed = imputer.fit_transform(ms_data_train)
ms_data_imputed[:5].round()

array([[4.000e+01, 7.000e+04, 6.190e+02, 3.000e+00],
       [4.500e+01, 3.000e+04, 1.945e+03, 2.000e+00],
       [3.500e+01, 3.000e+04, 2.783e+03, 2.000e+00],
       [4.500e+01, 3.000e+04, 3.913e+03, 1.000e+00],
       [4.500e+01, 3.000e+04, 4.771e+03, 1.000e+00]])

Getting the imputed values

In [279]:
ms_data_imputed[:, 3].round()

array([3., 2., 2., 1., 1., 1., 3., 1., 3., 1., 1., 0., 1., 2., 2., 1., 3.,
       1., 2., 3., 0., 3., 3., 1., 0., 3., 2., 2., 2., 0., 0., 1., 3., 0.,
       2., 2., 1., 0., 3., 2., 2., 3., 1., 1., 0., 1., 0., 2., 1., 2., 1.,
       1., 1., 1., 0., 2., 3., 1., 2., 1., 2., 0., 1., 3., 2., 2., 2., 2.,
       3., 1., 2., 2., 0., 2., 0., 3., 1., 2., 2., 3., 2., 2., 2., 0., 2.,
       2., 2., 3., 2., 2., 0., 0., 2., 3., 3., 1., 2., 0., 3., 2., 3., 2.,
       2., 1., 1.])

In [280]:
ms_data.loc[:, 'MS_Encoded'] = ms_data_imputed[:, 3].round().astype('int')
ms_data['MS_Encoded'].unique()

array([3., 2., 1., 0.])

#### Inverting our encoded categories back into the original categories.

In [281]:
ms_data_imputed = encoder.inverse_transform(ms_data['MS_Encoded'].astype('int').round())
ms_data_imputed[:10]

array(['Widowed', 'Single', 'Single', 'Married', 'Married', 'Married',
       'Widowed', 'Married', 'Widowed', 'Married'], dtype=object)

#### Replacing the imputed values from the `ms_data_imputed` values into our `ms_data` dataset.

In [282]:
ms_data['MS_Encoded'] = ms_data_imputed
ms_data.head(5)

Unnamed: 0,Age,Salary,Purchase_Amount,Marital_Status,MS_Encoded
0,40.0,70000.0,619,Widowed,Widowed
1,45.0,30000.0,1945,Single,Single
2,35.0,30000.0,2783,,Single
3,45.0,30000.0,3913,Married,Married
4,45.0,30000.0,4771,Married,Married


#### Null values imputed in the `Marital_Status` column.

In [283]:
data['Marital_Status'] = ms_data['MS_Encoded']
data.head(5)

Unnamed: 0,Customer_ID,Age,Salary,City,Gender,Purchase_Amount,Signup_Date,Marital_Status,Education
0,1,40.0,70000.0,New York,Male,619,2022-01-01,Widowed,PhD
1,2,45.0,30000.0,Houston,Female,1945,2022-01-02,Single,PhD
2,3,35.0,30000.0,New York,Male,2783,2022-01-03,Single,Master
3,4,45.0,30000.0,Los Angeles,Other,3913,2022-01-04,Married,Bachelor
4,5,45.0,30000.0,New York,Male,4771,2022-01-05,Married,


## Cleaning, Encoding and Imputing values in the `Education` Column.

In [302]:
ed_data = data[['Age', 'Salary', 'Purchase_Amount', 'Education']].copy(deep=True)
ed_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Education
0,40.0,70000.0,619,PhD
1,45.0,30000.0,1945,PhD
2,35.0,30000.0,2783,Master
3,45.0,30000.0,3913,Bachelor
4,45.0,30000.0,4771,


#### Standardizing and Normalizing the data in the `Education` column.

In [310]:
data['Education'] = data['Education'].replace('bachelor', 'Bachelor')

In [311]:
ed_data = data[['Age', 'Salary', 'Purchase_Amount', 'Education']].copy(deep=True)
ed_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Education
0,40.0,70000.0,619,PhD
1,45.0,30000.0,1945,PhD
2,35.0,30000.0,2783,Master
3,45.0,30000.0,3913,Bachelor
4,45.0,30000.0,4771,


In [312]:
ed_data['Education'].unique()

array(['PhD', 'Master', 'Bachelor', nan, 'High School'], dtype=object)

In [313]:
ed_data['Ed_Encoded'] = encoder.fit_transform(ed_data['Education'])
ed_data.head()

Unnamed: 0,Age,Salary,Purchase_Amount,Education,Ed_Encoded
0,40.0,70000.0,619,PhD,3
1,45.0,30000.0,1945,PhD,3
2,35.0,30000.0,2783,Master,2
3,45.0,30000.0,3913,Bachelor,0
4,45.0,30000.0,4771,,4


In [314]:
dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))

{'Bachelor': np.int64(0),
 'High School': np.int64(1),
 'Master': np.int64(2),
 'PhD': np.int64(3),
 nan: np.int64(4)}

In [315]:
ed_data['Ed_Encoded'].value_counts()

Ed_Encoded
4    29
3    22
0    19
1    18
2    17
Name: count, dtype: int64

In [316]:
ed_data['Ed_Encoded'] = ed_data['Ed_Encoded'].map(lambda x: np.nan if x == 4 else x)
ed_data.head(10)

Unnamed: 0,Age,Salary,Purchase_Amount,Education,Ed_Encoded
0,40.0,70000.0,619,PhD,3.0
1,45.0,30000.0,1945,PhD,3.0
2,35.0,30000.0,2783,Master,2.0
3,45.0,30000.0,3913,Bachelor,0.0
4,45.0,30000.0,4771,,
5,30.0,1000000.0,4064,,
6,35.0,50000.0,2951,Master,2.0
7,35.0,50000.0,1736,PhD,3.0
8,35.0,30000.0,3059,High School,1.0
9,45.0,50000.0,204,PhD,3.0


In [317]:
ed_data_train = ed_data[['Age', 'Salary', 'Purchase_Amount', 'Ed_Encoded']].copy(deep=True)

In [318]:
ed_data_imputed = imputer.fit_transform(ed_data_train)

In [319]:
ed_data_imputed[:10]

array([[4.00000000e+01, 7.00000000e+04, 6.19000000e+02, 3.00000000e+00],
       [4.50000000e+01, 3.00000000e+04, 1.94500000e+03, 3.00000000e+00],
       [3.50000000e+01, 3.00000000e+04, 2.78300000e+03, 2.00000000e+00],
       [4.50000000e+01, 3.00000000e+04, 3.91300000e+03, 0.00000000e+00],
       [4.50000000e+01, 3.00000000e+04, 4.77100000e+03, 1.09381848e+00],
       [3.00000000e+01, 1.00000000e+06, 4.06400000e+03, 1.55275977e+01],
       [3.50000000e+01, 5.00000000e+04, 2.95100000e+03, 2.00000000e+00],
       [3.50000000e+01, 5.00000000e+04, 1.73600000e+03, 3.00000000e+00],
       [3.50000000e+01, 3.00000000e+04, 3.05900000e+03, 1.00000000e+00],
       [4.50000000e+01, 5.00000000e+04, 2.04000000e+02, 3.00000000e+00]])

In [320]:
ed_data_imputed[:, 3].round().astype('int')

array([ 3,  3,  2,  0,  1, 16,  2,  3,  1,  3,  2,  1,  3,  2,  1,  1,  1,
        0,  0,  0,  2,  2,  2,  2,  2,  0,  3,  1,  0,  0,  0,  1,  2,  2,
        0,  1,  0,  2,  0,  0,  2,  1,  2,  2,  1,  1,  3,  1,  1,  2,  3,
        3,  3,  2,  2,  3,  2,  0,  0,  2,  2,  2,  2,  0,  3,  0,  3,  3,
        2,  1,  1,  2,  2,  3,  1,  3,  1,  1,  1,  1,  0,  3,  2,  2,  1,
        2,  2,  2,  3,  2,  3,  2,  2,  3,  1,  0,  1,  1,  1,  2,  3,  3,
        2,  0,  1])

In [321]:
ed_data.loc[:, 'Ed_Encoded'] = ed_data_imputed[:, 3].round().astype('int')


#### Inverting our encoded categories back into the original form.

In [322]:
ed_data['Ed_Encoded'] = encoder.inverse_transform(ed_data['Ed_Encoded'].astype('int').round())

ValueError: y contains previously unseen labels: [16]

In [300]:
ed_data['Ed_Encoded'].unique()

array(['PhD', 'Master', 'Bachelor', nan, 'High School'], dtype=object)