### Handling Missing Values with Imputation

**Objective:**

By the end of this lesson, students will be able to:

- Understand why missing values occur in datasets.
- Learn different methods for handling missing values.
- Apply imputation techniques to fill missing values in datasets using Scikit-learn.


#### Imputation Techniques
    There are several strategies for filling in missing values. Common imputation techniques include:

1. Mean/Median/Mode Imputation (for numerical/categorical data)

    - **Mean**: Replace missing numerical values with the mean of the column.
    - **Median**: Replace missing numerical values with the median.
    - **Mode**: Replace missing categorical values with the most frequent category.
      

2. Predictive Imputation:

    - Predict the missing values using machine learning algorithms based on the relationships in the dataset.

3. Using Custom Methods:

    - Sometimes, custom logic based on domain knowledge is required.


In [12]:
# number : num_columns : mean, median
# categorical : mode

In [2]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [4]:
data = pd.DataFrame({
    'Age': [25, np.nan, 35, 40, np.nan],
    'Salary': [50000, 54000, 58000, np.nan, 60000],
    'City': ['New York', 'Paris', 'Paris', 'London', np.nan],
    'Purchased': [0, 1, 0, 1, 0]})

In [6]:
X = data.drop(columns='Purchased')
y = data['Purchased']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [8]:
numeric_features = ['Age', 'Salary']
categorical_features = ['City']

In [24]:
# Numeric pipeline: Impute missing values and scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), 
    ('scaler', StandardScaler())                
])

In [26]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Replace missing with mode
    ('encoder', OneHotEncoder())    # One-hot encode categories
]) 

In [28]:
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

In [30]:
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

In [32]:
pipeline.predict(X_test)

array([0])