## Automatic Selection of Imputation Technique

Instead of blindly applying a single imputation method, we can **automatically choose the best technique** based on data behavior.

### Why Automatic Selection?
- Different features have different missing patterns
- One imputation method does not fit all cases
- Reduces manual bias and improves model performance

---

## How the Selection is Done

### 1. Analyze Missing Percentage
- **Low missing (< 5%)**
  - Mean / Median (numerical)
  - Mode (categorical)

- **Moderate to High missing**
  - Random sample imputation
  - Missing indicator + imputation
  - `"missing"` category for categorical data

---

### 2. Check Relationship with Target
- Compare target distribution for:
  - Missing values
  - Non-missing values
- If distributions differ → missingness is **informative**

---

### 3. Preserve Data Distribution
- If variance reduction is unacceptable:
  - Avoid mean/median
  - Prefer random sample imputation

---

### 4. Consider Model Type
- **Linear Models**
  - Random sample imputation
  - Missing indicator

- **Tree-based Models**
  - Missing category
  - Missing indicator
  - Simple imputation often sufficient

---

### 5. Production Constraints
- Large datasets → avoid memory-heavy techniques
- Pipelines → prefer sklearn imputers
- Deployment → reproducibility matters

---

## Key Insight
There is no “best” imputation method —  
the best method is **data-driven, model-aware, and scalable**.

---

### Final Rule
> Let the data decide the imputation technique, not assumptions.


In [17]:

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

In [3]:
df = pd.read_csv('train.csv')

In [4]:
df.drop(columns=['PassengerId','Name','Ticket','Cabin'],inplace=True)

In [5]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [6]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [9]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [8]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5,S
733,2,male,23.0,0,0,13.0,S
382,3,male,32.0,0,0,7.925,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.275,S


In [10]:
numerical_features = ['Age', 'Fare']
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = ['Embarked', 'Sex']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe',OneHotEncoder(handle_unknown='ignore'))
])

### Preprocessing Pipelines for Numerical and Categorical Features

To handle missing values and prepare data correctly for machine learning models, we apply **separate preprocessing pipelines** for numerical and categorical features. This ensures each feature type is treated with the most appropriate strategy.

---

## Numerical Features Pipeline
**Features:** `Age`, `Fare`

### Steps
1. **Imputation (Median)**
   - Replaces missing values with the median.
   - Robust to outliers and skewed distributions.

2. **Scaling (StandardScaler)**
   - Standardizes features to zero mean and unit variance.
   - Required for linear, distance-based, and gradient-based models.

### Why This Pipeline?
- Preserves robustness against extreme values.
- Ensures numerical stability during model training.

---

## Categorical Features Pipeline
**Features:** `Embarked`, `Sex`

### Steps
1. **Imputation (Most Frequent)**
   - Replaces missing values with the most common category.
   - Simple and effective baseline method.

2. **Encoding (OneHotEncoder)**
   - Converts categories into numeric format.
   - `handle_unknown='ignore'` ensures safe handling of unseen categories.

### Why This Pipeline?
- Maintains dataset size without dropping rows.
- Makes categorical data usable for ML models.

---

## Key Benefits of This Approach
- Feature-type–specific preprocessing
- Prevents data leakage
- Fully compatible with sklearn pipelines
- Scalable and production-ready

---

In [11]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

In [12]:
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

In [13]:
from sklearn import set_config
set_config(display='diagram')

In [14]:
clf

0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [15]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'preprocessor__cat__imputer__strategy': ['most_frequent', 'constant'],
    'classifier__C': [0.1, 1.0, 10, 100]
}

In [19]:
grid_search = GridSearchCV(
    clf,
    param_grid,
    cv=10
)

In [20]:
grid_search.fit(X_train, y_train)

0,1,2
,estimator,Pipeline(step...egression())])
,param_grid,"{'classifier__C': [0.1, 1.0, ...], 'preprocessor__cat__imputer__strategy': ['most_frequent', 'constant'], 'preprocessor__num__imputer__strategy': ['mean', 'median']}"
,scoring,
,n_jobs,
,refit,True
,cv,10
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,0.1
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [21]:
print(f"Best params:")
print(grid_search.best_params_)

Best params:
{'classifier__C': 0.1, 'preprocessor__cat__imputer__strategy': 'most_frequent', 'preprocessor__num__imputer__strategy': 'mean'}


In [22]:
print(f"Internal CV score: {grid_search.best_score_:.3f}")

Internal CV score: 0.784


In [23]:
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results = cv_results.sort_values("mean_test_score", ascending=False)
cv_results[['param_classifier__C','param_preprocessor__cat__imputer__strategy','param_preprocessor__num__imputer__strategy','mean_test_score']]

Unnamed: 0,param_classifier__C,param_preprocessor__cat__imputer__strategy,param_preprocessor__num__imputer__strategy,mean_test_score
0,0.1,most_frequent,mean,0.783725
1,0.1,most_frequent,median,0.783725
2,0.1,constant,mean,0.783725
3,0.1,constant,median,0.783725
4,1.0,most_frequent,mean,0.782316
5,1.0,most_frequent,median,0.782316
6,1.0,constant,mean,0.782316
7,1.0,constant,median,0.782316
8,10.0,most_frequent,mean,0.782316
9,10.0,most_frequent,median,0.782316
