# Renaming Columns
**Use Case:** Shortening column names, fixing typos, removing spaces, etc.

In [None]:
import pandas as pd
df = df.rename(columns={'old_name': 'new_name'})

# Dropping Columns
**Use Case:**
Removing unnecessary columns.

In [None]:
import pandas as pd
df = df.drop(columns=['column_to_drop'])

# Pipelines
**Use Case:** \
To increase code simplicity, Pipelines are used. \
Pipelines allow you to conduct many steps such as preprocessing \
in minimal amounts of code.

**Code Logic:** \
First, define the steps of your pipeline (ensure each step is compatible with pipelines. \
Second, create the pipeline with the aformentioned steps. \
Thirdly, use the pipeline. Common methods are fit, transform, fit_transform, and predict.

**Actionable Next Steps:** \
Feature selection \
Model Evaluation \
Feature Engineering

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Define the steps of the pipeline
steps = [
    ('scaler', StandardScaler()),  # Transformer
    ('model', LogisticRegression())  # Estimator
]

# Create the pipeline
pipeline = Pipeline(steps)

# Use the pipeline to fit and predict
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

# Custom Transformer Pipelines
**Use Case:** \
Create your own preprocessing pipelnes when you have a transformation \
unique to your dataset (e.g. string transformations)

**Code Logic:** \
First, define a class name and pass BaseEstimator (for parameter tuning) \
and TransformerMixin (for transform & fit methods) \
Second, define the constructor (\_\_init\_\_) \
Third, define the fit method
Fourth, define the transform method

Afterwards your Pipeline class can be added to steps similar to the previous code cell. \ 

**Actionable Next Steps:** \
Feature selection \
Model Evaluation \
Feature Engineering

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, param1):
        self.param1 = param1

    def fit(self, X, y=None):
        # This should return self unless something different happens in train and test
        return self

    def transform(self, X):
        # Depending on the data type of 'X', you might need to return a DataFrame, a Series or a numpy array
        X_transformed = X.copy()  # creating a copy to avoid changes to original dataset
        X_transformed = X_transformed + self.param1  # an example operation using 'param1'
        return X_transformed

pipeline = Pipeline([
    ('custom', CustomTransformer(param1=value)),
    # ... other steps in the pipeline ...
])

# GridSearchCV with Pipelines
**Use Case:** \
This allows you to use Cross Validation alongside preprocessing that prevents data leakage. \
Moreover, you can hyper parameter tune each step of the Pipeline.

**Code Logic:** \
First, initialize a dictionary with the name of the parameters you want to tune. \
Second, pass the pipeline into GridSearchCV. \

**Actionable Next Steps:** \
Feature selection \
Model Evaluation \
Feature Engineering

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Define the steps of the pipeline
steps = [
    ('scaler', StandardScaler()),  # Transformer
    ('model', LogisticRegression())  # Estimator
]

# Create the pipeline
pipeline = Pipeline(steps)

# Define the parameter grid for the grid search
param_grid = {
    'scaler__with_mean': [True, False],
    'model__C': [0.1, 1.0, 10.0],
}

# Create the grid search object
grid_search = GridSearchCV(pipeline, param_grid, cv=5)

# KBinsDiscretizer
**Use Case:** \
Good for reducing noise, handling outliers, or improving model preformance/simplicity.

**Code Logic:**
* columns_to_bin should be initialized with a list of column names you want to discretize.
* adjust the parameters of KBinsDiscretizer as needed
* perform fit_transform on training data and transform on test data
* lastly, you can print the edges from fit_transform. However, it is not strictly needed to perform discretization

**Actionable Next Steps:** \
EDA \
Feature Engineering \
Modeling 

**Notes:** \
[Full documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import KBinsDiscretizer

X1 = df1.copy()
X2 = df2.copy()

# Define the columns you want to bin
columns_to_bin = ['Age']

# Define the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
        ('discretizer', KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform'), columns_to_bin)
    ])),
])

X1_transformed = pipeline.fit_transform(X1)
X2_transformed = pipeline.transform(X2)

# Fit the pipeline
X1['Age'] = pd.DataFrame(X1_transformed, columns=columns_to_bin)
X2['Age'] = pd.DataFrame(X2_transformed, columns=columns_to_bin)

# Access the 'preprocessor' step in the pipeline
preprocessor = pipeline.named_steps['preprocessor']

# Access the 'discretizer' transformer in the preprocessor
discretizer = preprocessor.named_transformers_['discretizer']

# Print the bin edges
print(discretizer.bin_edges_[0])

[23.  31.8 40.6 49.4 58.2 67. ]


# Pipeline Example
**Use Case:** \
In this example, we are using a Pipeline to preprocess our data, model, and hyperparameter tune. 

**Code Logic:** \
During the preprocessing, we handle features differently based on the feature's datatype. The preprocessing pipeline is then neatly wrapped up and fed into a modeling pipeline, where it utilizes GridSearchCV to tune the hyperparameters.

**Next Steps**: \
Model Evaluation

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# Step 1: Identify the columns for each feature type
numeric_features = df.select_dtypes(include=['int64', 'float64']).columns
categorical_features = df.select_dtypes(include=['object']).columns
boolean_features = df.select_dtypes(include=['bool']).columns
datetime_features = df.select_dtypes(include=['datetime']).columns

# Step 2: Create preprocessing pipelines for each feature type
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

boolean_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

datetime_transformer = Pipeline(steps=[
    ('date_features', DateFeatures()),
    ('imputer', SimpleImputer(strategy='most_frequent'))])

# Step 3: Use ColumnTransformer to apply the appropriate preprocessing to each subset of columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('bool', boolean_transformer, boolean_features),
        ('date', datetime_transformer, datetime_features)])

# Step 4: Add a model to the pipeline
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', RandomForestClassifier())])

# Now you can fit and predict using the model pipeline
model.fit(X_train, y_train)
predictions = model.predict(X_test)

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 5, 10],
}

# Create the grid search object
grid_search = GridSearchCV(model, param_grid, cv=5)

# Fit the grid search
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_

# Get the best model
best_model = grid_search.best_estimator_

# OrdinalEncoder
**Use Case:** \
Converts categorical data with ordinality into numerical data through ordinal encoding.

**Code Logic:**
1. Define a list of categories you want to encode (Ascending), match it with the order of the columns you are encoding.
2. Set the parameters for your ordinal encoder.
3. Define a list of columns you want to encode.
4. Use OrdinalEncoder in your pipeline.
5. Output the results

**Actionable Next Steps:** \
EDA \
Feature Engineering \
Modeling 

**Notes:** \
`remainder=passthrough` ensures that features that aren't transformed are still used. \
`min_frequency` requires a minimum version of sklearn 1.3. At the time of writing this (June 17th 2024), Kaggle notebooks use sklearn 1.2.2. I could not find any work arounds so I swithced to running the notebook locally on Jupyter Notebook. A work around for Google Colab can be found in [this reddit post](https://www.reddit.com/r/GoogleColab/s/XxRXtUekR5)

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define the categories for the 'Gender' column
feature_categories = [
    ['M', 'F'],
    ['F','M']
]

# Define the OrdinalEncoder
ordinal_encoder = OrdinalEncoder(min_frequency=10,categories=feature_categories)

# Define the columns you want to apply the OrdinalEncoder to
columns_to_encode = ['Gender','Gender2']

# Define the ColumnTransformer
preprocessor = ColumnTransformer(
       transformers=[
        ('ord', ordinal_encoder, columns_to_encode)
    ],
    remainder='passthrough',
)

# Define your pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
])

output = pipeline.fit_transform(data2)
pd.DataFrame(output, columns=pipeline.named_steps['preprocessor'].get_feature_names_out())

Unnamed: 0,ord__Gender,ord__Gender2,remainder__Respondent,remainder__Age,remainder__Systolic Blood Pressure,remainder__Diastolic Blood Pressure,remainder__Height (cm),remainder__Weight (kg),remainder__BMI (kg/m2),remainder__Waist Circumference (cm),remainder__Fasting Glucose (mg/dL),remainder__Total Cholesterol (mg/dL),remainder__Triglycerides (mg/dL),remainder__Fat,remainder__Visceral Fat,remainder__Working Period
0,0.0,1.0,1.0,46.0,110.0,80.0,157.5,65.00,26.40,86.0,83.0,138.0,266.0,27.4,13.0,24.0
1,0.0,1.0,2.0,45.0,120.0,80.0,174.5,76.55,25.13,91.0,99.0,207.0,268.0,25.1,11.0,22.0
2,0.0,1.0,3.0,42.0,130.0,80.0,163.0,59.25,23.30,75.0,104.0,246.0,121.0,28.2,4.0,22.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
362,0.0,1.0,363.0,51.0,110.0,70.0,158.5,65.50,26.20,81.0,93.0,211.0,160.0,28.6,13.5,12.0
363,0.0,1.0,364.0,52.0,120.0,80.0,154.5,55.00,23.20,74.0,75.0,176.0,83.0,24.6,9.5,24.0
364,0.0,1.0,365.0,45.0,120.0,90.0,157.0,67.10,27.20,83.0,84.0,239.0,129.0,28.8,14.0,14.0


# One Hot Encoder
**Use Case:** \
One Hot Encoder is used when your categorical data has no ordinality.

**Code Logic:**
1. Set the parameters for OneHotEncoder
2. Encorporate it into your pipeline
3. Output results

**Actionable Next Steps:** \
EDA \
Feature Engineering \
Modeling 

**Notes:** \
`drop` parameter in OneHotEncoder can be added to remove redundant columns.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define the OneHotEncoder
one_hot_encoder = OneHotEncoder(min_frequency=10)

# Define the columns you want to apply the OneHotEncoder to
columns_to_encode = ['Gender']

inner_pipeline = Pipeline([
    ('onehot', one_hot_encoder),
])

# Define the ColumnTransformer
preprocessor = ColumnTransformer(
       transformers=[
        ('pipe', inner_pipeline, columns_to_encode),
    ],
    remainder='passthrough',
)

# Define your pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
])

output = pipeline.fit_transform(data2,data2['Total Cholesterol (mg/dL)'])
pd.DataFrame(output, columns=pipeline.named_steps['preprocessor'].get_feature_names_out())

Unnamed: 0,pipe__Gender_F,pipe__Gender_M,remainder__Respondent,remainder__Age,remainder__Systolic Blood Pressure,remainder__Diastolic Blood Pressure,remainder__Height (cm),remainder__Weight (kg),remainder__BMI (kg/m2),remainder__Waist Circumference (cm),remainder__Fasting Glucose (mg/dL),remainder__Total Cholesterol (mg/dL),remainder__Triglycerides (mg/dL),remainder__Fat,remainder__Visceral Fat,remainder__Working Period
0,0.0,1.0,1.0,46.0,110.0,80.0,157.5,65.00,26.40,86.0,83.0,138.0,266.0,27.4,13.0,24.00
1,0.0,1.0,2.0,45.0,120.0,80.0,174.5,76.55,25.13,91.0,99.0,207.0,268.0,25.1,11.0,22.00
2,0.0,1.0,3.0,42.0,130.0,80.0,163.0,59.25,23.30,75.0,104.0,246.0,121.0,28.2,4.0,22.00
3,0.0,1.0,4.0,35.0,130.0,80.0,165.0,57.80,21.20,76.0,84.0,270.0,170.0,25.6,6.0,8.00
4,0.0,1.0,5.0,46.0,150.0,100.0,153.0,61.60,26.30,90.0,89.0,173.0,114.0,25.8,13.0,24.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
360,0.0,1.0,361.0,49.0,110.0,78.0,161.5,60.90,23.50,78.0,79.0,190.0,167.0,25.5,9.0,5.94
361,0.0,1.0,362.0,43.0,130.0,80.0,152.5,77.40,33.28,102.0,86.0,246.0,86.0,33.0,22.0,25.00
362,0.0,1.0,363.0,51.0,110.0,70.0,158.5,65.50,26.20,81.0,93.0,211.0,160.0,28.6,13.5,12.00
363,0.0,1.0,364.0,52.0,120.0,80.0,154.5,55.00,23.20,74.0,75.0,176.0,83.0,24.6,9.5,24.00


# Target Encoder
**Use Case:** \
This is good for when your categorical data has ordinality, however you do not know the correct order. \
`TargetEncoder` implements K-Fold Target encoding to minimize data leakage.

**Code Logic:**
1. Set the parameters for TargetEncoder
2. Encorporate it into your pipeline
3. Output results

**Actionable Next Steps:** \
EDA \
Feature Engineering \
Modeling 

**Notes:** \
Target Encoding is prone to overfitting if your training data isn't a good representation of your testing data. \
Be careful when you have multiple preprocessing steps for a single column, you will need to create a separate pipeline to prevent duplicate columns from being generated. \
OrdinalEncoder isn't strictly necessary, it is just for example purposes.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import TargetEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

ordinal_encoder = OrdinalEncoder(min_frequency=10)
# Define the OrdinalEncoder
target_encoder = TargetEncoder(target_type='continuous',random_state=314)

# Define the columns you want to apply the OrdinalEncoder to
columns_to_encode = ['Gender']

inner_pipeline = Pipeline([
    ('ordinal', ordinal_encoder),
    ('target', target_encoder)
])

# Define the ColumnTransformer
preprocessor = ColumnTransformer(
       transformers=[
        ('pipe', inner_pipeline, columns_to_encode),
    ],
    remainder='passthrough',
)

# Define your pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
])

output = pipeline.fit_transform(data2,data2['Total Cholesterol (mg/dL)'])
pd.DataFrame(output, columns=pipeline.named_steps['preprocessor'].get_feature_names_out())

Unnamed: 0,pipe__Gender,remainder__Respondent,remainder__Age,remainder__Systolic Blood Pressure,remainder__Diastolic Blood Pressure,remainder__Height (cm),remainder__Weight (kg),remainder__BMI (kg/m2),remainder__Waist Circumference (cm),remainder__Fasting Glucose (mg/dL),remainder__Total Cholesterol (mg/dL),remainder__Triglycerides (mg/dL),remainder__Fat,remainder__Visceral Fat,remainder__Working Period
0,200.295974,1.0,46.0,110.0,80.0,157.5,65.00,26.40,86.0,83.0,138.0,266.0,27.4,13.0,24.0
1,198.020538,2.0,45.0,120.0,80.0,174.5,76.55,25.13,91.0,99.0,207.0,268.0,25.1,11.0,22.0
2,198.020538,3.0,42.0,130.0,80.0,163.0,59.25,23.30,75.0,104.0,246.0,121.0,28.2,4.0,22.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
362,198.020538,363.0,51.0,110.0,70.0,158.5,65.50,26.20,81.0,93.0,211.0,160.0,28.6,13.5,12.0
363,199.475641,364.0,52.0,120.0,80.0,154.5,55.00,23.20,74.0,75.0,176.0,83.0,24.6,9.5,24.0
364,199.475641,365.0,45.0,120.0,90.0,157.0,67.10,27.20,83.0,84.0,239.0,129.0,28.8,14.0,14.0


# KNN Imputation
**Use Case:** \
Uses K Nearest Neighbors to impute missing data.

**Code Logic:**
1. Standard Scale
2. Impute using `KNNImputer` (adjust the parameters accordingly)
3. Reverse Standard Scaling

**Actionable Next Steps:** \
EDA \
Feature Engineering \
Modeling 

**Notes:** \
Although Standard Scaling isn't strictly necessary, it is highly advised. \
Reversing standard scaling is optional.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline

# Define a custom transformer for StandardScaler that stores mean and std
class CustomStandardScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = StandardScaler()
        self.mean_ = None
        self.scale_ = None

    def fit(self, X, y=None):
        self.scaler.fit(X)
        self.mean_ = self.scaler.mean_
        self.scale_ = self.scaler.scale_
        return self

    def transform(self, X, y=None):
        return self.scaler.transform(X)

# Define a custom transformer to reverse standardization
class InverseStandardScaler(BaseEstimator, TransformerMixin):
    def __init__(self, scaler):
        self.scaler = scaler

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X * self.scaler.scale_ + self.scaler.mean_

# Define the transformers
scaler = CustomStandardScaler()
knn_imputer = KNNImputer(n_neighbors=1)
inverse_scaler = InverseStandardScaler(scaler)

# Define your pipeline
pipeline = Pipeline(steps=[
    ('scaler', scaler),
    ('knn', knn_imputer),
    ('inverse_scaler', inverse_scaler)
])

# Apply the pipeline to the data
output = pipeline.fit_transform(data4)

# Convert the output to a DataFrame
output_df = pd.DataFrame(output, columns=data4.columns)
display(data4)
display(output_df)

Unnamed: 0,A,B,C
0,3.0,2,20
1,2.0,3,15
2,,1,10


Unnamed: 0,A,B,C
0,3.0,2.0,20.0
1,2.0,3.0,15.0
2,3.0,1.0,10.0


# Custom KNN Imputation (EXPERIMENTAL)
**Use Case:** \
K Nearest Neighbors generally assigns equal weights to all features when calculating distance. This can lead to irrelevant features affecting total distance between rows. In this proposed Custom KNN Imputation, the distance of each feature is multiplied by a weight before calculating the total distance between rows. In this case, weights are set to Pearson's correlation coefficient, ensuring that features with the highest correlation to the missing value affect the total distance the most.

**Code Logic:**
1. Define a function to handle missing values
2. Define a custom distance function
3. Define a custom imputer class that calculates weights and assigns them to a KNN Imputer
4. Implement the custom imputer in a pipeline

**Actionable Next Steps:** \
EDA \
Feature Engineering \
Modeling 

**Notes:** \
This method fits a single KNN Imputer to each feature, which can be computationally expensive. \
[Reference](https://www.mdpi.com/2073-8994/15/12/2185)

In [None]:
import numpy as np
from scipy.spatial.distance import minkowski as scipy_minkowski
import math
import numbers

def is_scalar_nan(x):
    return (
        not isinstance(x, numbers.Integral)
        and isinstance(x, numbers.Real)
        and math.isnan(x)
    )
def minkowski(u, v, p=2, w=None, missing_values=np.nan):
    """Calculate the Minkowski distance between two 1D arrays.

    This function ignores coordinates with a missing value in u or v.

    Parameters
    ----------
    u, v : array-like of shape (n_features,)
        Input arrays.

    p : integer, default=2
        The power of the Minkowski distance.

    w : array-like of shape (n_features,), default=None
        The weights for each value in the Minkowski calculation. If `None`, then equal weights are assumed.

    missing_values : np.nan, float or int, default=np.nan
        Representation of missing value.

    Returns
    -------
    distance : float
        The Minkowski distance between vectors `u` and `v`.
    """

    # Ensure that u and v are numpy arrays
    u = np.asarray(u)
    v = np.asarray(v)

    # Create a boolean mask for missing values
    missing_u = np.isnan(u) if is_scalar_nan(missing_values) else u == missing_values
    missing_v = np.isnan(v) if is_scalar_nan(missing_values) else v == missing_values

    # Calculate the Minkowski distance for non-missing coordinates
    mask = ~(missing_u | missing_v)
    u = u[mask]
    v = v[mask]
    if w is not None:
        w = np.asarray(w)
        w = w[mask]

    return scipy_minkowski(u, v, p, w)

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import KNNImputer
from functools import partial
import pandas as pd

class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, n_neighbors=1):
        self.n_neighbors = n_neighbors
        self.imputers_ = None
        self.weights_ = None

    def fit(self, X, y=None):
        self.imputers_ = {}
        self.weights_ = {}
        for column in X.columns:
            # Calculate weights based on correlation coefficients
            weights = X.corrwith(X[column]).abs().fillna(0).values
            # Store the weights
            self.weights_[column] = weights
            # Create a KNNImputer with these weights and fit it
            custom_minkowski = partial(minkowski, w=weights)
            # Define the OneHotEncoder
            imputer = KNNImputer(n_neighbors=self.n_neighbors ,metric=custom_minkowski)
           # imputer = KNNImputer(n_neighbors=self.n_neighbors, weights=weights)
            imputer.fit(X)
            self.imputers_[column] = imputer
        return self

    def transform(self, X):
        X_imputed = pd.DataFrame()
        for column in X.columns:
            # Use the fitted KNNImputer for this column to transform it
            imputer = self.imputers_[column]
            X_imputed[column] = pd.DataFrame(imputer.transform(X), columns=X.columns)[column]
        return X_imputed

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define the CustomImputer
custom_imputer = CustomImputer(n_neighbors=1)

# Define your pipeline
pipeline = Pipeline(steps=[
    ('knn',custom_imputer)
])

output = pipeline.fit_transform(data4)
output

Unnamed: 0,A,B,C
0,3.0,2.0,20.0
1,2.0,3.0,15.0
2,2.0,1.0,10.0


# KNN Imputer with Custom Metric
**Use Case:** \
Define your own 

**Code Logic:**
1. Set the parameters for KNN Imputer (set `metric` for custom distance calculations)
2. Implement KNN in a pipeline
3. Use output

**Actionable Next Steps:** \
EDA \
Feature Engineering \
Modeling 

**Notes:** \
The minkowski metric that can handle nan values is defined in the `Custom KNN Imputation (EXPERIMENTAL)` chapter.

In [None]:
from sklearn.impute import KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define the KNNImputer
knn_imputer = KNNImputer(n_neighbors=1,metric=minkowski)

# Define your pipeline
pipeline = Pipeline(steps=[
    ('knn',knn_imputer)
])

output = pipeline.fit_transform(data4)
pd.DataFrame(output, columns=pipeline.get_feature_names_out())

Unnamed: 0,A,B,C
0,3.0,2.0,20.0
1,2.0,3.0,15.0
2,2.0,1.0,10.0


# Iterative Imputer
**Use Case:** \
Iterative imputer allows you to fit a model for imputation. It starts by using simple imputation, then iteratively predicts missing values using a chosen model.

**Code Logic:**
1. `estimator` is the imputation model, default is `BayesianRidge()`
2. `max_iter` is the maximum iterations of imputation
3. `fill_value` is the initial imputation method before using a model
4. `random_state` is the random seed
5. `verbose` allows you to view the changes and debug the imputation iterations.
   
**Actionable Next Steps:** \
EDA \
Feature Engineering \
Modeling 

**Notes:** \
A nice [visual explanation](https://youtu.be/zX-pacwVyvU?si=DQXhsZqSzEyfU_vf) on how this imputation works. \
Depending on the model used, you might need to scale the data before imputation.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.pipeline import Pipeline
import xgboost as xgb

# Define the IterativeImputer
iterative_imputer = IterativeImputer(estimator=xgb.XGBRegressor(n_estimators=100),max_iter=10,fill_value='median',random_state=314,verbose=3)

# Define your pipeline
pipeline = Pipeline(steps=[
    ('iterative',iterative_imputer)
])

output = pipeline.fit_transform(data4)
pd.DataFrame(output, columns=pipeline.get_feature_names_out())

[IterativeImputer] Completing matrix with shape (3, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.11
[IterativeImputer] Change: 0.4991161823272705, scaled tolerance: 0.02 
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.14
[IterativeImputer] Change: 0.0, scaled tolerance: 0.02 
[IterativeImputer] Early stopping criterion reached.


Unnamed: 0,A,B,C
0,3.0,2.0,20.0
1,2.0,3.0,15.0
2,2.999116,1.0,10.0
