<font color="red" size=6>8_Missing_data</font>
<p><font color="yellow" size=5>Imputation Techniques</font></p>

In scikit-learn, imputation techniques are used to handle missing values in datasets. The library provides various tools to perform imputation efficiently, including strategies in the SimpleImputer and IterativeImputer classes.

<font color="pink" size=4>1.Mean Imputation</font>

Mean imputation replaces missing values in a dataset with the mean of the non-missing values for that feature (column). It is a simple and commonly used imputation technique for numerical data.

<font color="pink" size=4>Steps</font>
<ol>
   <li> Calculate the mean of the feature (excluding missing values).</li>
     <li>Replace all missing values in the feature with the calculated mean.</li></ol>

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Example Dataset with Missing Values
data = {
    'Feature1': [1, 2, np.nan, 4],
    'Feature2': [5, np.nan, np.nan, 8],
    'Feature3': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize SimpleImputer with 'mean' strategy
imputer = SimpleImputer(strategy='mean')

# Apply imputation
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nDataFrame after Mean Imputation:")
print(df_imputed)


Original DataFrame:
   Feature1  Feature2  Feature3
0       1.0       5.0         9
1       2.0       NaN        10
2       NaN       NaN        11
3       4.0       8.0        12

DataFrame after Mean Imputation:
   Feature1  Feature2  Feature3
0  1.000000       5.0       9.0
1  2.000000       6.5      10.0
2  2.333333       6.5      11.0
3  4.000000       8.0      12.0


<font color="orange">For Feature1:
<ol>
    <li>Mean = (1+2+4)/3=2.3333</li>
    <li>Missing value replaced with 2.3333</li></ol>

<font color="pink" size=4>2.Median Imputation</font>

Median imputation replaces missing values in a dataset with the median of the non-missing values for that feature (column). It is particularly useful for numerical data that is skewed or has outliers, as the median is more robust than the mean in such cases.

In [2]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Example Dataset with Missing Values
data = {
    'Feature1': [1, 2, np.nan, 100],  # Skewed due to an outlier (100)
    'Feature2': [5, np.nan, np.nan, 8],
    'Feature3': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize SimpleImputer with 'median' strategy
imputer = SimpleImputer(strategy='median')

# Apply imputation
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nDataFrame after Median Imputation:")
print(df_imputed)


Original DataFrame:
   Feature1  Feature2  Feature3
0       1.0       5.0         9
1       2.0       NaN        10
2       NaN       NaN        11
3     100.0       8.0        12

DataFrame after Median Imputation:
   Feature1  Feature2  Feature3
0       1.0       5.0       9.0
1       2.0       6.5      10.0
2       2.0       6.5      11.0
3     100.0       8.0      12.0


<font color="orange">Explanation</font>
  <li><b>For Feature1:</b></li>
       <ol><li> Median = 2.0 (sorted values: 1,2,100).</li>
       <li>Missing value replaced with 2.0</li></ol>

<font color="pink" size=4>3.Most Frequent (Mode) Imputation</font>

Most frequent imputation replaces missing values with the most frequently occurring value (mode) in the column. This technique is particularly suitable for categorical data, but it can also be applied to numerical data with repeating values.

In [4]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Example Dataset with Missing Values
data = {
    'Feature1': ['red', 'blue', np.nan, 'blue', 'red', 'blue'],  # Categorical
    'Feature2': [1, 2, 2, np.nan, 2, 2],  # Numerical
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize SimpleImputer with 'most_frequent' strategy
imputer = SimpleImputer(strategy='most_frequent')

# Apply imputation
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nDataFrame after Most Frequent Imputation:")
print(df_imputed)


Original DataFrame:
  Feature1  Feature2
0      red       1.0
1     blue       2.0
2      NaN       2.0
3     blue       NaN
4      red       2.0
5     blue       2.0

DataFrame after Most Frequent Imputation:
  Feature1 Feature2
0      red      1.0
1     blue      2.0
2     blue      2.0
3     blue      2.0
4      red      2.0
5     blue      2.0


<font color="orange">Explanation
<ol>
     <li><font color="sky blue">For Feature1 (categorical):</font></li>
       <ol><li> Most frequent value = blue (appears 3 times).</li>
        <li>Missing value replaced with blue.</li></ol>
     <li><font color="sky blue">For Feature2 (numerical):</li>
       <ol> <li>Most frequent value = 2.0 (appears 4 times).</li>
       <li> Missing value replaced with 2.0</li></ol>

<font color="pink" size=4>4.Constant Value Imputation</font>

Constant value imputation involves replacing missing values with a user-specified constant. This method can be used for both categorical and numerical data. It is particularly useful when you want to assign a meaningful default value to missing entries, such as "Unknown" for categorical data or 0 for numerical data.

In [5]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Example Dataset with Missing Values
data = {
    'Feature1': ['apple', 'banana', np.nan, 'orange'],  # Categorical
    'Feature2': [10, 20, np.nan, 40],  # Numerical
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize SimpleImputer with 'constant' strategy
imputer = SimpleImputer(strategy='constant', fill_value='missing')  # For categorical
df_categorical_imputed = pd.DataFrame(imputer.fit_transform(df[['Feature1']]), columns=['Feature1'])

imputer = SimpleImputer(strategy='constant', fill_value=0)  # For numerical
df_numerical_imputed = pd.DataFrame(imputer.fit_transform(df[['Feature2']]), columns=['Feature2'])

# Combine the results
df_imputed = pd.concat([df_categorical_imputed, df_numerical_imputed], axis=1)

print("\nDataFrame after Constant Value Imputation:")
print(df_imputed)


Original DataFrame:
  Feature1  Feature2
0    apple      10.0
1   banana      20.0
2      NaN       NaN
3   orange      40.0

DataFrame after Constant Value Imputation:
  Feature1  Feature2
0    apple      10.0
1   banana      20.0
2  missing       0.0
3   orange      40.0


<font color="pink" size=4>5.Iterative Imputer</font>

The IterativeImputer in scikit-learn is a multivariate imputation method that models each feature with missing values as a function of the other features. It iteratively predicts missing values by fitting a regression model for each feature with missing values, based on the other features.

**How It Works**

Initialize missing values using a simple method (e.g., mean, median, or other).

<ol>
<font color = "yellow">For each feature with missing values:</font>
<li>Treat it as the target variable.</li>
<li>Use the other features as predictors in a regression model to predict the missing values.</li>
<li>Repeat this process for all features with missing values iteratively.</li>
<li>Continue iterations until convergence or a specified maximum number of iterations.</li></ol>

**Key Parameters**

<ol>
<font color><li>estimator:</font>
    The model used to predict missing values (default: BayesianRidge).</li>
<font color><li>max_iter:</font> 
    Maximum number of iterations to run (default: 10).</li>
<font color><li>random_state:</font> 
    Ensures reproducibility of results.</li>
<font color><li>tol:</font> 
    Convergence threshold; stops iterations if changes are smaller than this value.</li>

In [7]:
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer  # Required to enable IterativeImputer
from sklearn.impute import IterativeImputer

# Example Dataset with Missing Values
data = {
    'Feature1': [1.0, 2.0, np.nan, 4.0],
    'Feature2': [np.nan, 3.0, 6.0, 8.0],
    'Feature3': [7.0, 8.0, 9.0, np.nan],
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)

# Apply the IterativeImputer
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nDataFrame after Iterative Imputation:")
print(df_imputed)


Original DataFrame:
   Feature1  Feature2  Feature3
0       1.0       NaN       7.0
1       2.0       3.0       8.0
2       NaN       6.0       9.0
3       4.0       8.0       NaN

DataFrame after Iterative Imputation:
   Feature1  Feature2  Feature3
0  1.000000  0.746042  7.000000
1  2.000000  3.000000  8.000000
2  3.162385  6.000000  9.000000
3  4.000000  8.000000  9.792037


**Explanation**


The imputer uses all available data to iteratively estimate missing values.
<ol>
<font color = "yellow">For example:</font>
<li>Missing value in Feature1 was estimated using relationships with Feature2 and Feature3.</li>
<li>Missing value in Feature3 was estimated using Feature1 and Feature2.</li></ol>

<font color = "yellow" size="5">5.1_IterativeImputer with Random Forest Regressor</font>

The IterativeImputer can be customized to use a Random Forest Regressor as the estimator for predicting missing values. This can improve imputation accuracy when the data has non-linear relationships or complex patterns, as Random Forest is a robust and flexible non-linear model.

In [8]:
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer  # Required to enable IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# Example Dataset with Missing Values
data = {
    'Feature1': [1.0, 2.0, np.nan, 4.0],
    'Feature2': [np.nan, 3.0, 6.0, 8.0],
    'Feature3': [7.0, 8.0, 9.0, np.nan],
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize IterativeImputer with RandomForestRegressor
imputer = IterativeImputer(estimator=RandomForestRegressor(n_estimators=100, random_state=0), max_iter=10, random_state=0)

# Apply the imputer
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nDataFrame after Iterative Imputation with Random Forest Regressor:")
print(df_imputed)


Original DataFrame:
   Feature1  Feature2  Feature3
0       1.0       NaN       7.0
1       2.0       3.0       8.0
2       NaN       6.0       9.0
3       4.0       8.0       NaN

DataFrame after Iterative Imputation with Random Forest Regressor:
   Feature1  Feature2  Feature3
0      1.00       4.4      7.00
1      2.00       3.0      8.00
2      2.62       6.0      9.00
3      4.00       8.0      8.53


<font color = "yellow" size="5">5.2_IterativeImputer with Decision Tree Regressor</font>

The IterativeImputer can be paired with a Decision Tree Regressor to impute missing values in a dataset. A Decision Tree Regressor is particularly useful when there are complex relationships in the data, and the dataset has non-linear patterns or interactions. It also handles categorical splits naturally when features are encoded numerically.

In [9]:
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer  # Required to enable IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.tree import DecisionTreeRegressor

# Example Dataset with Missing Values
data = {
    'Feature1': [1.0, 2.0, np.nan, 4.0],
    'Feature2': [np.nan, 3.0, 6.0, 8.0],
    'Feature3': [7.0, 8.0, 9.0, np.nan],
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize IterativeImputer with DecisionTreeRegressor
imputer = IterativeImputer(
    estimator=DecisionTreeRegressor(max_depth=5, random_state=0), 
    max_iter=10, 
    random_state=0
)

# Apply the imputer
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nDataFrame after Iterative Imputation with Decision Tree Regressor:")
print(df_imputed)


Original DataFrame:
   Feature1  Feature2  Feature3
0       1.0       NaN       7.0
1       2.0       3.0       8.0
2       NaN       6.0       9.0
3       4.0       8.0       NaN

DataFrame after Iterative Imputation with Decision Tree Regressor:
   Feature1  Feature2  Feature3
0       1.0       6.0       7.0
1       2.0       3.0       8.0
2       1.0       6.0       9.0
3       4.0       8.0       8.0


<font color = "yellow" size="4">Customizing the Decision Tree</font>

You can fine-tune the Decision Tree Regressor by specifying parameters such as max_depth, min_samples_split, and min_samples_leaf. For example:
from sklearn.tree import DecisionTreeRegressor

imputer = IterativeImputer(
    estimator=DecisionTreeRegressor(max_depth=3, min_samples_split=5, random_state=42),
    max_iter=15,
    random_state=42
)


**Best Practices**

<ol>
<font color = "yellow"><li>Preprocessing:</font>
        Encode categorical variables (e.g., using one-hot or label encoding).</li>
<font color = "yellow"><li>Regularization:</font>
        Use hyperparameters like max_depth and min_samples_leaf to prevent overfitting.</li>
<font color = "yellow"><li>Validation:</font>
        Check the imputation quality using cross-validation or by examining downstream model performance.</li></ol>

<font color = "yellow" size="5">6_KNNImputer</font>

The KNNImputer replaces missing values in a dataset by using the values of the k-nearest neighbors of each data point with a missing value. The key idea is that data points that are similar (i.e., near each other in feature space) are likely to have similar values.

In [16]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

# Example Dataset with Missing Values
data = {
    'Feature1': [1.0, 2.0, np.nan, 4.0],
    'Feature2': [np.nan, 3.0, 6.0, 8.0],
    'Feature3': [7.0, 8.0, 9.0, np.nan],
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize KNNImputer
imputer = KNNImputer(n_neighbors=2, weights='uniform')

# Apply the KNNImputer
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nDataFrame after KNN Imputation:")
print(df_imputed)


Original DataFrame:
   Feature1  Feature2  Feature3
0       1.0       NaN       7.0
1       2.0       3.0       8.0
2       NaN       6.0       9.0
3       4.0       8.0       NaN

DataFrame after KNN Imputation:
   Feature1  Feature2  Feature3
0       1.0       4.5       7.0
1       2.0       3.0       8.0
2       2.5       6.0       9.0
3       4.0       8.0       8.0


<font color = "yellow" size="5">7_MissingIndicator in Scikit-learn</font>

The MissingIndicator class in scikit-learn is a transformer used to create an indicator matrix indicating where missing values (NaN) are located in the original dataset. It is useful when you want to retain the information about the presence of missing values as part of the features for a machine learning model.

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import MissingIndicator

# Example dataset with missing values
data = {
    'Feature1': [1.0, 2.0, np.nan, 4.0],
    'Feature2': [np.nan, 3.0, 6.0, 8.0],
    'Feature3': [7.0, 8.0, 9.0, np.nan]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize the MissingIndicator
indicator = MissingIndicator()

# Fit and transform the data
indicator_matrix = indicator.fit_transform(df)

# Create a DataFrame for the indicator matrix
df_indicator = pd.DataFrame(indicator_matrix, columns=[f"{col}_missing" for col in df.columns])

print("\nMissing Value Indicator Matrix:")
print(df_indicator)


Original DataFrame:
   Feature1  Feature2  Feature3
0       1.0       NaN       7.0
1       2.0       3.0       8.0
2       NaN       6.0       9.0
3       4.0       8.0       NaN

Missing Value Indicator Matrix:
   Feature1_missing  Feature2_missing  Feature3_missing
0             False              True             False
1             False             False             False
2              True             False             False
3             False             False              True


<font color = "yellow" size="5">8_Using Pipelines for Imputation in Scikit-learn</font>

In scikit-learn, Pipelines provide a convenient way to organize and streamline the process of transforming data and applying models. A pipeline allows you to chain multiple steps together, such as data preprocessing, imputation, feature scaling, and model training.

By combining imputation into a pipeline, you can ensure that missing data is handled consistently before training your model. This is particularly useful for creating robust workflows in machine learning and ensuring the steps are applied in the correct order.

In [24]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Example Dataset with Missing Values
data = {
    'Feature1': [1.0, 2.0, np.nan, 4.0],
    'Feature2': [np.nan, 3.0, 6.0, 8.0],
    'Feature3': [7.0, 8.0, 9.0, np.nan],
    'Target': [0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Split the dataset into features and target
X = df.drop('Target', axis=1)
y = df['Target']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Define a pipeline with imputation and model
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Imputation step
    ('classifier', RandomForestClassifier())     # Model step
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")


Model Accuracy: 0.00%


<font color = "yellow" size="5">9_ColumnTransformer</font>
<ol>
It allows you to apply different preprocessing techniques to different subsets of columns in your dataset. It’s typically used in a pipeline to apply transformations like imputation, scaling, or encoding to the relevant features.</ol>

We'll use a small dataset with both numerical and categorical columns. We'll:
<ol>
    <li>Impute missing values in numerical columns.</li>
    <li>Scale the numerical columns.</li>
    <li>One-hot encode the categorical columns.</li></ol>

In [25]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Example dataset with missing values
data = {
    'Age': [25, 30, np.nan, 35],
    'Salary': [50000, 60000, 65000, np.nan],
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York']
}

df = pd.DataFrame(data)

# Define the columns for transformation
numerical_features = ['Age', 'Salary']
categorical_features = ['Gender', 'City']

# Create the ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values
            ('scaler', StandardScaler())                 # Scale numerical columns
        ]), numerical_features),                         # Apply to numerical columns
        
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values for categorical
            ('onehot', OneHotEncoder())                            # One-hot encode categorical columns
        ]), categorical_features)                             # Apply to categorical columns
    ],
    remainder='passthrough'  # Columns not specified will be passed through unchanged
)

# Apply the transformations
X_transformed = preprocessor.fit_transform(df)

# Dynamically generate column names for the transformed data
# Get the column names for OneHotEncoding
ohe_columns = preprocessor.transformers_[1][1].named_steps['onehot'].get_feature_names_out(categorical_features)

# Combine column names for numerical features and the one-hot encoded columns
columns = numerical_features + ohe_columns.tolist()

# Convert the result into a DataFrame for better visualization
X_transformed_df = pd.DataFrame(X_transformed, columns=columns)

print(X_transformed_df)


        Age    Salary  Gender_Female  Gender_Male  City_Chicago  \
0 -1.414214 -1.543033            0.0          1.0           0.0   
1  0.000000  0.308607            1.0          0.0           0.0   
2  0.000000  1.234427            1.0          0.0           1.0   
3  1.414214  0.000000            0.0          1.0           0.0   

   City_Los Angeles  City_New York  
0               0.0            1.0  
1               1.0            0.0  
2               0.0            0.0  
3               0.0            1.0  


**Explanation of Key Parts:**

<ol>
<font color = "yellow"><li>Imputation:</font>
        We use SimpleImputer(strategy='mean') for numerical columns (Age, Salary) to replace missing values with the column mean.
        For categorical columns (Gender, City), we use SimpleImputer(strategy='most_frequent') to impute missing values with the most frequent category.</li>

<font color = "yellow"><li>Scaling:</font>
        We apply StandardScaler to scale the numerical columns to have zero mean and unit variance.</li>

<font color = "yellow"><li>One-Hot Encoding:</font>
        OneHotEncoder is used for the categorical columns (Gender, City) to create binary columns representing each category.</li>

<font color = "yellow"><li>Dynamic Column Names:</font>
        For the categorical columns, we retrieve the column names generated by OneHotEncoder using get_feature_names_out().
        The final column names are combined from the numerical columns and the one-hot encoded categorical columns.</li>

<font color = "yellow"><li>remainder='passthrough':</font>
        This ensures that any columns not specified in the ColumnTransformer are passed through unchanged. In this case, all columns are processed, so it doesn’t affect the result.</li>