# Data imputation

## Uploading needed libraries

In [51]:
# General libraries
import pandas as pd
import numpy as np

# KNN Imputer
from sklearn.impute import KNNImputer


Types of Imputation:
- Mean/ Mode/ Median Imputation
- KNN Imputation
- Random Forest Imputation                      (advanced method - Not dissused in this notebook)

Uses of Imputation:
- Handling missing values in datasets
- Handling outliers                             (to be disscused in outliers section)
- Data Generation >> Handling Imbalancing       (to be disscused in Imbalancing section)

## Mean/ Mode/ Median Imputation

In [80]:
# Create a sample DataFrame with missing values
data = {
    'Age': [25, np.nan, 22, 30, np.nan, 40],
    'Salary': [50000, 54000, np.nan, 62000, 58000, np.nan],
    'Gender': ['Male', 'Female', 'Female', np.nan, 'Male', 'Female']
}

sample_df = pd.DataFrame(data)

sample_df

Unnamed: 0,Age,Salary,Gender
0,25.0,50000.0,Male
1,,54000.0,Female
2,22.0,,Female
3,30.0,62000.0,
4,,58000.0,Male
5,40.0,,Female


In [81]:
# Mean Imputation for 'Age' and 'Salary'

# Make a copy of the main dataframe in order not to change values in the dataframe itself
cpy_df = sample_df.copy()

cpy_df['Age'].fillna(cpy_df['Age'].mean(), inplace=True)
cpy_df['Salary'].fillna(cpy_df['Salary'].mean(), inplace=True)

cpy_df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cpy_df['Age'].fillna(cpy_df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cpy_df['Salary'].fillna(cpy_df['Salary'].mean(), inplace=True)


Unnamed: 0,Age,Salary,Gender
0,25.0,50000.0,Male
1,29.25,54000.0,Female
2,22.0,56000.0,Female
3,30.0,62000.0,
4,29.25,58000.0,Male
5,40.0,56000.0,Female


In [82]:
# Median Imputation for 'Age' and 'Salary'

# Make a copy of the main dataframe in order not to change values in the dataframe itself
cpy_df = sample_df.copy()

# Peform imputation 
cpy_df['Age'].fillna(cpy_df['Age'].median(), inplace= True)
cpy_df['Salary'].fillna(cpy_df['Salary'].median(), inplace= True)
                                             
cpy_df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cpy_df['Age'].fillna(cpy_df['Age'].median(), inplace= True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cpy_df['Salary'].fillna(cpy_df['Salary'].median(), inplace= True)


Unnamed: 0,Age,Salary,Gender
0,25.0,50000.0,Male
1,27.5,54000.0,Female
2,22.0,56000.0,Female
3,30.0,62000.0,
4,27.5,58000.0,Male
5,40.0,56000.0,Female


In [83]:
# Mode Imputation for 'Gender'

# Make a copy of the main dataframe in order not to change values in the dataframe itself
cpy_df = sample_df.copy()

cpy_df ['Gender'].fillna(sample_df['Gender'].mode()[0], inplace=True)

cpy_df 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cpy_df ['Gender'].fillna(sample_df['Gender'].mode()[0], inplace=True)


Unnamed: 0,Age,Salary,Gender
0,25.0,50000.0,Male
1,,54000.0,Female
2,22.0,,Female
3,30.0,62000.0,Female
4,,58000.0,Male
5,40.0,,Female


## KNN Imputation

In [79]:
print("Original DataFrame:")
print(sample_df)

knn_cpy = sample_df.copy()

# Convert categorical variable into numerical format for KNN  (Categorical Encoding will be disscused later)
knn_cpy['Gender'] = knn_cpy['Gender'].map({'Male': 0, 'Female': 1})

# Initialize KNNImputer with desired number of neighbors (its preferred to be an odd number)
imputer = KNNImputer(n_neighbors=3)

# Perform imputation
df_imputed = pd.DataFrame(imputer.fit_transform(knn_cpy), columns=knn_cpy.columns)

# Convert 'Gender' back to original categorical values
df_imputed['Gender'] = df_imputed['Gender'].round().map({0: 'Male', 1: 'Female'})

print("\nDataFrame after KNN Imputation:")
print(df_imputed)

Original DataFrame:
    Age   Salary  Gender
0  25.0  50000.0    Male
1   NaN  54000.0  Female
2  22.0      NaN  Female
3  30.0  62000.0     NaN
4   NaN  58000.0    Male
5  40.0      NaN  Female

DataFrame after KNN Imputation:
         Age   Salary  Gender
0  25.000000  50000.0    Male
1  29.000000  54000.0  Female
2  22.000000  54000.0  Female
3  30.000000  62000.0  Female
4  30.666667  58000.0    Male
5  40.000000  58000.0  Female


#### **EXPLANATION:**

#### **KNNImputer(n_neighbors)**

It is a class from the **sklearn.impute** module used to perform imputation of missing values using the k-nearest neighbors algorithm. 

**Parameters:**

n_neighbors: (Type: Integer.) This parameter specifies the number of neighboring samples to use for imputation. The algorithm looks at the k nearest neighbors to estimate the missing value.

Why Odd Number?: It’s often preferred to use an odd number for n_neighbors when dealing with classification problems to stucking in voting.

#### **fit_transform(X)**

It is a method of the KNNImputer class that performs two actions:
- Fit: It calculates the necessary statistics from the dataset, such as the distances between samples.
- Transform: It applies the imputation to replace missing values based on the nearest neighbors.

**Parameters:** 

X: The input data to fit the model and perform the transformation. This should be a numerical matrix (DataFrame or ndarray) where missing values are to be imputed.
It returnsnumpy array with missing values imputed based on the nearest neighbors.