# KNN Imputation:

* Overview: KNN (K-Nearest Neighbors) imputation fills in missing values by finding the 'k' most similar rows (neighbors) in the dataset and averaging their values.
* Key Point: Simple and intuitive, but may struggle with high-dimensional data or datasets with outliers.

**KNN Imputation General Guidelines**

The authors of the KNN imputation algorithm suggest the following:

* Small Amount of Missing Data: If less than 20% of your data is missing, KNN imputation will work more accurately.
* Value of K: The specific number of neighbors (K) used in the imputation doesn’t need to be exact. As long as K is between 10 and 20, the results will generally be good.

**Link for additional references**
* [KNN](https://www.dropbox.com/scl/fo/le47x1fez8y7g7akw9bo9/AItqo18_QDZMGI39Lc8vdo0/Section-07-Multivariate-Imputation?dl=0&rlkey=7257ih8lct4v0nkroy7if74i1&subfolder_nav_tracking=1)

### KNN imputation with Scikitlearn

**Additional notes**

* Same K for All Variables: When using KNN imputation, the same number of neighbors (K) is applied to all variables with missing data. This means that the same setting of K is used uniformly across different features.

* Optimizing K for Prediction: You can't easily optimize K specifically to improve the imputation of missing values in a single variable. However, you can optimize K to enhance the prediction of the target variable in the dataset.

* Optimizing K for Target Prediction: While KNN imputation uses a fixed K for all variables, optimizing K might be more effective for predicting the target variable in a supervised learning context.


**Why is it a regression problem?**

In the context of imputation with separate KNN models, it becomes a regression problem because each variable with missing values is treated as a target variable that needs to be predicted. In simple termsn, when you impute missing values, you are essentially trying to predict what those missing values should be. This involves estimating a value based on the information available from other variables in the dataset.

**Alternative Approach:**

If the goal is to predict missing values as accurately as possible, it’s better to use separate KNN models for each variable with missing data. This approach involves treating each variable's imputation as a separate regression problem, where each variable is predicted based on other variables.

* Single KNN Imputer: Uses one imputer for all missing data, treating all variables together.
* Separate KNN Imputers: Uses distinct imputers for each variable, modeling each one individually based on the remaining variables.

### Template

In [None]:
# Step 1: Import the KNNImputer from scikit-learn
# This class is used for imputing missing values using the K-nearest neighbors algorithm
from sklearn.impute import KNNImputer

# Step 2: Identify columns with missing data
# This loop checks each column in the dataset for missing values and prints the column name and count of missing values if any
for var in data.columns:
    if data[var].isnull().sum() > 1:
        print(var, data[var].isnull().sum())

In [None]:
# Step 3: Separate the data into training and testing sets
# Remove 'SalePrice' from the feature list as it is the target variable
cols_to_use.remove('SalePrice')

# Split the data into training and testing sets
# 30% of the data is used for testing, and the rest for training
X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],  # Features for training
    data['SalePrice'],  # Target variable
    test_size=0.3,  # 30% test data
    random_state=0)  # Seed for reproducibility

# Step 4: Print the shape of the training and testing sets
# Displays the number of rows and columns in the training and testing datasets
X_train.shape, X_test.shape

# Step 5: Reset index for training and testing sets
# This ensures the indices are sequential and aligned, which is useful for comparison later
X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)

In [None]:
# Step 1: Initialize the KNNImputer
# This imputer replaces missing values using the K-nearest neighbors method
# n_neighbors: Number of neighbors to use (K)
# weights: How to weight the neighbors ('distance' means closer neighbors have more weight)
# metric: Distance metric to use (nan_euclidean handles missing values)
# add_indicator: Whether to add a column indicating missing values (False here)
imputer = KNNImputer(
    n_neighbors=5,  # Number of neighbors to consider
    weights='distance',  # Weight neighbors by their distance
    metric='nan_euclidean',  # Metric used to compute distance, handling NaNs
    add_indicator=False  # Do not add a missing indicator column
)

# Step 2: Fit the imputer on the training data
# The imputer learns the structure of the data to make imputations
imputer.fit(X_train)

# Step 3: Transform the training and testing data
# Replaces missing values using the KNN algorithm
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

# Step 4: Convert the results to DataFrames
# The imputer returns a NumPy array, so we convert it to DataFrame for easier manipulation
train_t = pd.DataFrame(train_t, columns=X_train.columns)
test_t = pd.DataFrame(test_t, columns=X_test.columns)

# Step 5: Display the first few rows of the transformed training data
# Shows the imputed training data
train_t.head()

# Step 6: Check variables without NA after imputation
# Counts the remaining missing values in specific columns of the transformed data
train_t[['LotFrontage', 'MasVnrArea', 'GarageYrBlt']].isnull().sum()

# Step 7: View original missing values in the training data
# Displays the original missing values for 'MasVnrArea' in the training data
X_train[X_train['MasVnrArea'].isnull()]['MasVnrArea']

# Step 8: View imputed values for missing 'MasVnrArea'
# Shows the imputed values for 'MasVnrArea' in the transformed training data
train_t[X_train['MasVnrArea'].isnull()]['MasVnrArea']

### KNN imputation - Feature engine

* We can use Feature-engine to apply the KNNImputer to a slice of the dataframe.

In [None]:
# Step 1: Initialize the SklearnTransformerWrapper with KNNImputer
# This wrapper integrates the KNNImputer with Feature-engine, applying it to specified variables
imputer = SklearnTransformerWrapper(
    transformer=KNNImputer(weights='distance'),  # Use KNNImputer with distance-based weighting
    variables=cols_to_use  # Apply imputation to these variables
)

# Step 2: Fit the wrapper and KNNImputer on the training data
# This prepares the imputer by learning from the training data
imputer.fit(X_train)

# Step 3: Transform the training and testing data
# Replaces missing values using the trained KNNImputer
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

# Step 4: Display the first few rows of the transformed training data
# Shows the imputed training data as a DataFrame
train_t.head()

# Step 5: Check for remaining missing values in 'MasVnrArea' after imputation
# Verifies that there are no more missing values in the specified column
train_t['MasVnrArea'].isnull().sum()

### Automatically find best imputation parameters
* We can optimise the parameters of the KNN imputation to better predict our outcome.

In [None]:
# import extra classes for modelling
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# separate intro train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],  # just the features
    data['SalePrice'],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0)  # for reproducibility

X_train.shape, X_test.shape

In [None]:
# Step 1: Define the pipeline
# The pipeline includes an imputer, a scaler, and a regressor
pipe = Pipeline(steps=[
    ('imputer', KNNImputer(
        n_neighbors=5,  # Number of neighbors for imputation
        weights='distance',  # Weighting for neighbors
        add_indicator=False  # Do not add missing indicator columns
    )),

    ('scaler', StandardScaler()),  # Standardize features
    ('regressor', Lasso(max_iter=2000))  # Lasso regression with specified max iterations
])

# Step 2: Set up the parameter grid for GridSearchCV
# Define the range of parameters to test for each step in the pipeline
param_grid = {
    'imputer__n_neighbors': [3, 5, 10],  # Different numbers of neighbors to test
    'imputer__weights': ['uniform', 'distance'],  # Different weighting methods
    'imputer__add_indicator': [True, False],  # Whether to add missing indicators
    'regressor__alpha': [10, 100, 200],  # Regularization strength for Lasso regression
}

# Step 3: Initialize GridSearchCV
# Perform an exhaustive search over the parameter grid with cross-validation
grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='r2')

# Step 4: Fit GridSearchCV on the training data
# Train the pipeline with all combinations of parameters to find the best model
grid_search.fit(X_train, y_train)

# Step 5: Print the best score on the training set
# Display the R^2 score of the best model on the training data
print(("Best linear regression from grid search: %.3f"
       % grid_search.score(X_train, y_train)))

# Step 6: Print the performance on the test set
# Display the R^2 score of the best model on the test data
print(("Best linear regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

# Step 7: Display the best parameters
# Show the parameter combination that resulted in the best model
grid_search.best_params_

**Key Differences**

* First Code Snippet: Integrates KNN imputation within a pipeline that also includes scaling and regression. It performs grid search to optimize hyperparameters for both imputation and regression.
* Second Code Snippet: Focuses solely on KNN imputation. It directly applies the imputer to the data without additional steps like scaling or regression.

# MICE (Multiple Imputation by Chained Equations):

**Additional notes**
* Overview: MICE creates multiple datasets where missing values are imputed using models based on other variables. Each dataset reflects different plausible values for the missing data, capturing uncertainty.
* Key Point: Generates multiple versions of the dataset and combines results for robust analysis.

**MICE: Assumptions**
* Data is MAR (Missing At Random): The missing data mechanism depends on observed data but not on the unobserved data.
* Modeling Capability: The missing values in a variable can be predicted using other variables in the dataset, and do not rely on information from external sources.

**Link for additional references**

***To check variable nature, variable relationship, which variables should we use as predictors and more considerations, click the link***

* [MICE](https://www.dropbox.com/scl/fo/le47x1fez8y7g7akw9bo9/AItqo18_QDZMGI39Lc8vdo0/Section-07-Multivariate-Imputation?dl=0&preview=02_MICE.pdf&rlkey=7257ih8lct4v0nkroy7if74i1&subfolder_nav_tracking=1)

**Side notes**
* We will implement MICE using various machine learning models to estimate the missing values.
* Same model will be used to predict NA in all variables
* Can't use classification for binary variables and regression for continuous variables
* For a more sophisticated imputation, we would have to assemble the imputers / models manually.

# missForest:

**Additional notes**
* Overview: missForest is a machine learning-based method that uses random forests to predict and impute missing values. It iteratively improves predictions by refining the model with each iteration.
* Key Point: Handles both categorical and numerical data effectively and can capture complex relationships between variables.

**Link for additional references**
* [missForest](https://www.dropbox.com/scl/fo/le47x1fez8y7g7akw9bo9/AItqo18_QDZMGI39Lc8vdo0/Section-07-Multivariate-Imputation?rlkey=7257ih8lct4v0nkroy7if74i1&subfolder_nav_tracking=1&dl=0)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import BayesianRidge

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Step 1: Create an MICE imputer using Bayesian Ridge regression as the estimator
# IterativeImputer performs multiple imputation by chained equations (MICE)
imputer = IterativeImputer(
    estimator=BayesianRidge(),  # Estimator used to predict missing values
    initial_strategy='mean',    # Initial imputation strategy before iterative process
    max_iter=10,                # Number of iterations for the imputation process
    imputation_order='ascending', # Order in which variables are imputed
    n_nearest_features=None,    # Number of features to consider for nearest neighbors
    skip_complete=True,         # Skip variables that do not have missing values
    random_state=0              # Seed for reproducibility
)

# Step 2: Fit the imputer on the training data
# This process prepares the imputer by learning patterns to handle missing values
imputer.fit(X_train)

# Step 3: Transform the training and testing data
# Replaces missing values in the datasets using the fitted imputer
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

# Step 4: Verify that there are no missing values after imputation
# Counts remaining missing values in the transformed training data
pd.DataFrame(train_t, columns=X_train.columns).isnull().sum()

### Lets compare imputation with different models

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Step 1: Split the data into training and testing sets
# The target variable 'A16' is separated from the features
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('A16', axis=1),  # Features
    data['A16'],                # Target
    test_size=0.3,              # Proportion of data to be used for testing
    random_state=0              # Seed for reproducibility
)

# Display the shape of the training and testing sets
X_train.shape, X_test.shape

# Step 2: Initialize multiple IterativeImputer instances with different estimators
# These imputers will replace missing values using different regression models

imputer_bayes = IterativeImputer(
    estimator=BayesianRidge(),  # Bayesian Ridge regression as the estimator
    max_iter=10,                # Number of iterations for the imputation process
    random_state=0              # Seed for reproducibility
)

imputer_knn = IterativeImputer(
    estimator=KNeighborsRegressor(n_neighbors=5),  # KNN regression as the estimator
    max_iter=10,                                   # Number of iterations
    random_state=0                                 # Seed for reproducibility
)

imputer_nonLin = IterativeImputer(
    estimator=DecisionTreeRegressor(max_features='sqrt', random_state=0),  # Decision Tree regression as the estimator
    max_iter=500,                                                          # Number of iterations
    random_state=0                                                         # Seed for reproducibility
)

imputer_missForest = IterativeImputer(
    estimator=ExtraTreesRegressor(n_estimators=10, random_state=0),  # Extra Trees regression as the estimator
    max_iter=100,                                                    # Number of iterations
    random_state=0                                                   # Seed for reproducibility
)

# Step 3: Fit each imputer on the training data
# The imputers learn the data structure to replace missing values
imputer_bayes.fit(X_train)
imputer_knn.fit(X_train)
imputer_nonLin.fit(X_train)
imputer_missForest.fit(X_train)

# Step 4: Transform the training data to replace missing values
# Each imputer is used to fill missing values in the training set
X_train_bayes = imputer_bayes.transform(X_train)
X_train_knn = imputer_knn.transform(X_train)
X_train_nonLin = imputer_nonLin.transform(X_train)
X_train_missForest = imputer_missForest.transform(X_train)

# Step 5: Convert transformed numpy arrays to DataFrames
# Transform the numpy arrays into DataFrames with the same column names
predictors = [var for var in X_train.columns if var != 'A16']  # List of predictor variables

X_train_bayes = pd.DataFrame(X_train_bayes, columns=predictors)
X_train_knn = pd.DataFrame(X_train_knn, columns=predictors)
X_train_nonLin = pd.DataFrame(X_train_nonLin, columns=predictors)
X_train_missForest = pd.DataFrame(X_train_missForest, columns=predictors)