#  Missing Value Imputation & Data Preprocessing Pipeline

This notebook demonstrates **different strategies to handle missing values** and **feature preprocessing** using Scikit-learn.
It covers **KNN Imputation**, **Iterative Imputation**, **Categorical Encoding**, and **Pipeline construction** for real-world datasets.

## Objective

- Understand missing value handling techniques
- Apply KNN and Iterative Imputation
- Encode categorical variables properly
- Build a clean preprocessing pipeline
- Prepare data for machine learning models

In [None]:
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [None]:
import pandas as pd

### Creating a Sample Dataset

We start with a **small synthetic dataset** containing:
- Numerical features: `height`, `weight`
- Categorical feature: `gender`

This dataset is used to **demonstrate imputation techniques clearly**.


In [None]:
cls_data = pd.DataFrame({'height': [180, 165, 170, 185, 160, 175],
        'weight': [80, 55, 65, 90, 50, 75],
        'gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male']})
cls_data

In [None]:
import numpy as np

### Introducing Missing Values
- Missing values are intentionally introduced into the dataset to simulate
real-world data issues and observe how different imputers behave.


In [None]:
cls_data.iloc[2, 1] = np.nan

In [None]:
cls_data

In [None]:
knn_imputer = KNNImputer().set_output(transform = 'pandas')
knn_imputer

In [None]:
knn_imputer.fit_transform(cls_data[['height', 'weight']])

In [None]:
cls_data = pd.DataFrame({'height': [180, 165, 170, 185, 160, 175],
        'weight': [80, 55, 65, 90, 50, 75],
        'gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male']})
cls_data

In [None]:
cls_data.iloc[3, 0]=np.nan

In [None]:
cls_data

###  KNN Imputation (Numerical Features)

**KNN Imputer** replaces missing values by averaging values of the
nearest neighbors based on feature similarity.

This method works best when:
- Data is numeric
- Features are on similar scales


In [None]:
knn_imputer.fit_transform(cls_data[['height','weight']])

### Encoding Categorical Variables

Machine learning models require numerical inputs.
The categorical variable `gender` is encoded into numerical values:

- Male → 0
- Female → 1


In [None]:
cls_data['gender'] = cls_data['gender'].map({'Male': 0,'Female':1})
cls_data

In [None]:
cls_data.iloc[2, 2] = np.nan

In [None]:
cls_data

In [None]:
knn_imputer.fit_transform(cls_data)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

###  Iterative Imputation

**Iterative Imputer** models each feature with missing values as a function
of other features and iteratively predicts missing entries.

Advantages:
- Handles complex feature relationships
- Works well for mixed data types


In [None]:
iter_imputer = IterativeImputer(estimator= DecisionTreeClassifier(),
                                max_iter=100,
                                initial_strategy='most_frequent').set_output(transform = 'pandas')
iter_imputer

In [None]:
iter_imputer.fit_transform(cls_data)

### Loading the Real Dataset

A real-world dataset is loaded to apply preprocessing techniques
on practical data and prepare it for machine learning.

In [None]:
data = pd.read_csv(r"C:\Users\sande\Downloads\credit_approval_uci.csv")

In [None]:
data.head()

In [None]:
data.shape

### Feature–Target Separation

- `X` → Input features
- `y` → Target variable

This separation is required before model training.


In [None]:
# Segregate data into X, y
X = data.drop('target', axis = 1)
y = data['target']

### Train–Test Split

The dataset is split into training and testing sets using **stratified sampling**
to preserve class distribution in the target variable.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

###  Identifying Categorical Features

Categorical columns are detected automatically and prepared
for encoding using `OrdinalEncoder`.


In [None]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

In [None]:
le =OrdinalEncoder()
le

In [None]:
X_train.info()

In [None]:
X_train.select_dtypes('object')

In [None]:
cat_col = X_train.select_dtypes('object').columns
cat_col

In [None]:
from sklearn.compose import ColumnTransformer

###  ColumnTransformer for Encoding

`ColumnTransformer` allows applying different transformations
to different columns while keeping the dataset structure intact.


In [None]:
ct = ColumnTransformer([('Cat_enc', le, cat_col)],
                       remainder = 'passthrough',
                       verbose_feature_names_out= False).set_output(transform='pandas')
ct

In [None]:
ct.fit_transform(X_train)

In [None]:
X_train.select_dtypes('float64')

In [None]:
num_col = X_train.select_dtypes('float64','int64').columns
num_col

In [None]:
cat_col

In [None]:
X_train = ct.fit_transform(X_train)
X_train.head()

### Advanced Imputation Strategy

- **Categorical features** → Iterative Imputer with RandomForestClassifier
- **Numerical features** → Iterative Imputer with RandomForestRegressor

This improves imputation quality by respecting feature types.


In [None]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

In [None]:
iter_imputer_cat = IterativeImputer(estimator = RandomForestClassifier(),
                                    max_iter = 100,
                                    initial_strategy = 'most_frequent')
iter_imputer_cat

In [None]:
iter_imputer_num = IterativeImputer(estimator = RandomForestRegressor(),
                                    max_iter = 100,
                                    initial_strategy = 'mean')
iter_imputer_num

In [None]:
ct2 = ColumnTransformer([('cat_imp', iter_imputer_cat, cat_col),
                         ('num_imp', iter_imputer_num, num_col)],
                        remainder = 'passthrough',
                        verbose_feature_names_out=False).set_output(transform = 'pandas')
ct2

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
pl = Pipeline([('ct1',ct),('ct2',ct2)])
pl

In [None]:
X_test_transformed = pl.fit_transform(X_test)
X_test_transformed.head()