# **Machine Learning**

# Simple to Advance Methods to Handle Missing values 

# Mean Imputation

> **Definition**: Mean imputation replaces missing values with the mean (average) of the available values in a feature.
>
> **Use Case**: Best used for continuous data without outliers, as outliers can skew the mean.

In [114]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [115]:
# Let's Load the Penguins Dataset using Seaborn.
df = sns.load_dataset("penguins")
print(df.head())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female  


In [116]:
# Let's First check for missing values and thier percentage  in the Dataset
print(df.isnull().sum().sort_values(ascending=False))
print("Percentage of Missing Values")
print(df.isnull().sum().sort_values(ascending=False) / len(df) * 100)

sex                  11
bill_depth_mm         2
bill_length_mm        2
flipper_length_mm     2
body_mass_g           2
island                0
species               0
dtype: int64
Percentage of Missing Values
sex                  3.197674
bill_depth_mm        0.581395
bill_length_mm       0.581395
flipper_length_mm    0.581395
body_mass_g          0.581395
island               0.000000
species              0.000000
dtype: float64


In [117]:
# Let's impute bill_depth_mm, bill_length_mm using mean of them
df["bill_depth_mm"] = df["bill_depth_mm"].fillna(df["bill_depth_mm"].mean()) 
df["bill_length_mm"] = df["bill_length_mm"].fillna(df["bill_length_mm"].mean())
# Let's again check for missing values after imputation
print("Missing Values after Imputation")
print(df.isnull().sum().sort_values(ascending=False))
print("Percentage of Missing Values after Imputation")
print(df.isnull().sum().sort_values(ascending=False) / len(df) * 100)


Missing Values after Imputation
sex                  11
flipper_length_mm     2
body_mass_g           2
bill_length_mm        0
island                0
species               0
bill_depth_mm         0
dtype: int64
Percentage of Missing Values after Imputation
sex                  3.197674
flipper_length_mm    0.581395
body_mass_g          0.581395
bill_length_mm       0.000000
island               0.000000
species              0.000000
bill_depth_mm        0.000000
dtype: float64


# Median Imputation

> **Definition**: Median imputation replaces missing values with the median (the middle value) of the available values in a feature.
>
> **Use Case**: Ideal for continuous data with outliers, as the median is less affected by extreme values.

In [118]:
# Let's impute flipper_length_mm, body_mass_g using median of them
df["flipper_length_mm"] = df["flipper_length_mm"].fillna(df["flipper_length_mm"].median())
df["body_mass_g"] = df["body_mass_g"].fillna(df["body_mass_g"].median())
# Let's again check for missing values after imputation
print("Missing Values after Imputation")
print(df.isnull().sum().sort_values(ascending=False))
print("Percentage of Missing Values after Imputation")
print(df.isnull().sum().sort_values(ascending=False) / len(df) * 100)

Missing Values after Imputation
sex                  11
island                0
species               0
bill_length_mm        0
bill_depth_mm         0
flipper_length_mm     0
body_mass_g           0
dtype: int64
Percentage of Missing Values after Imputation
sex                  3.197674
island               0.000000
species              0.000000
bill_length_mm       0.000000
bill_depth_mm        0.000000
flipper_length_mm    0.000000
body_mass_g          0.000000
dtype: float64


# Mode Imputation

> **Definition**: Mode imputation replaces missing values with the mode (the most frequently occurring value) of the available values in a feature.
> 
> **Use Case**: Commonly used for categorical data, where the most common category is substituted for missing values.

In [119]:
# Let's Impute missing values of sex column using mode
df["sex"] = df["sex"].fillna(df["sex"].mode()[0])
# Let's again check for missing values after imputation
print("Missing Values after Imputation")
print(df.isnull().sum().sort_values(ascending=False))
print("Percentage of Missing Values after Imputation")
print(df.isnull().sum().sort_values(ascending=False) / len(df) * 100)

Missing Values after Imputation
species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64
Percentage of Missing Values after Imputation
species              0.0
island               0.0
bill_length_mm       0.0
bill_depth_mm        0.0
flipper_length_mm    0.0
body_mass_g          0.0
sex                  0.0
dtype: float64


------

# K-Nearest Neighbors (KNN) Imputer

> **Definition**: KNN imputation uses the K-Nearest Neighbors algorithm to fill in missing values by looking at the *k* closest instances in the dataset.

> **How It Works**:
> 1. **Identify Neighbors**: For a data point with missing values, the algorithm finds the *k* nearest neighbors ? > based on available features.
> 2. **Impute Values**: The missing values are replaced with the average (for continuous data) or the mode (for categorical data) of the neighbors' corresponding values.
>
> **Justification:**  
> - For continuous (numeric) data, the average of the nearest neighbors provides a reasonable estimate that maintains the overall distribution.  
> - For categorical data, using the mode ensures the imputed value is the most common among similar records, preserving the categorical nature and reducing bias from rare categories.

> **Advantages**:
> - **Contextual Imputation**: Takes into account the relationships between features, potentially leading to more ?accurate imputations.
> - **Flexibility**: Can handle both continuous and categorical data.
> 
> **Disadvantages**:
> - **Computationally Intensive**: Can be slow for large datasets, as it requires calculating distances to all other points.
> - **Sensitivity to Noise**: Performance can be affected by irrelevant features and outliers.

> **Use Case**: KNN imputation is particularly useful in datasets where the missing values are related to the values of other features.

In [120]:
# Load the titanic dataset using seaborn library
df = sns.load_dataset("titanic")
print(df.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


In [121]:
# Let's First check for missing values and thier percentage  in the Dataset
print("Missing Values")
print(df.isnull().sum().sort_values(ascending=False))
print("Percentage of Missing Values")
print(df.isnull().sum().sort_values(ascending=False) / len(df) * 100)

Missing Values
deck           688
age            177
embarked         2
embark_town      2
sex              0
pclass           0
survived         0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64
Percentage of Missing Values
deck           77.216611
age            19.865320
embarked        0.224467
embark_town     0.224467
sex             0.000000
pclass          0.000000
survived        0.000000
fare            0.000000
parch           0.000000
sibsp           0.000000
class           0.000000
adult_male      0.000000
who             0.000000
alive           0.000000
alone           0.000000
dtype: float64


In [122]:
# Import KNN Imputer from sklearn
from sklearn.impute import KNNImputer
# Initialize KNN Imputer
imputer = KNNImputer(n_neighbors=5)
# Impute missing values of age and embarked using KNNImputer
df["age"] = imputer.fit_transform(df[["age"]])
# Let's again check for missing values after imputation
print("Missing Values after Imputation")
print(df.isnull().sum().sort_values(ascending=False))
print("Percentage of Missing Values after Imputation")
print(df.isnull().sum().sort_values(ascending=False) / len(df) * 100)

Missing Values after Imputation
deck           688
embarked         2
embark_town      2
age              0
survived         0
pclass           0
sex              0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64
Percentage of Missing Values after Imputation
deck           77.216611
embarked        0.224467
embark_town     0.224467
age             0.000000
survived        0.000000
pclass          0.000000
sex             0.000000
fare            0.000000
parch           0.000000
sibsp           0.000000
class           0.000000
adult_male      0.000000
who             0.000000
alive           0.000000
alone           0.000000
dtype: float64


----


# Regression Imputer for Imputation

> **Definition**: A Regression Imputer fills in missing values by predicting them based on other features using a regression model.
>
> **How It Works**:
> 1. **Train a Regression Model**: Use available data to train a regression model where the target is the feature with missing values.
> 2. **Predict Missing Values**: Apply the model to predict and fill in the missing values based on other features.
>

> **Advantages**:
> - Utilizes relationships in the data for more accurate imputations.
>
> **Disadvantages**:
> - Assumes linearity and can be complex to implement.

In [123]:
# Load the titanic dataset using seaborn library
df = sns.load_dataset("titanic")
# Let's First check for missing values and thier percentage  in the Dataset
print("Missing Values")
print(df.isnull().sum().sort_values(ascending=False))
print("Percentage of Missing Values")
print(df.isnull().sum().sort_values(ascending=False) / len(df) * 100)

Missing Values
deck           688
age            177
embarked         2
embark_town      2
sex              0
pclass           0
survived         0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64
Percentage of Missing Values
deck           77.216611
age            19.865320
embarked        0.224467
embark_town     0.224467
sex             0.000000
pclass          0.000000
survived        0.000000
fare            0.000000
parch           0.000000
sibsp           0.000000
class           0.000000
adult_male      0.000000
who             0.000000
alive           0.000000
alone           0.000000
dtype: float64


In [124]:
# Import Regression imputer to impute missing values
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# call the IterativeImputer class with max_iter = 20
imputer = IterativeImputer(max_iter=20, random_state=42)

#impute missing values with regression imputer
df['age'] = imputer.fit_transform(df[['age']])

# check the number of missing values in each columnm after imputation
print("Missing Values after Imputation")
print(df.isnull().sum().sort_values(ascending=False))

Missing Values after Imputation
deck           688
embarked         2
embark_town      2
age              0
survived         0
pclass           0
sex              0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64


---

# Random Forest Imputer for Imputation

> **Definition**: A Random Forest Imputer fills in missing values by using a Random Forest model to predict them based on other features.
>
> **How It Works**:
> 1. **Train a Random Forest Model**: A Random Forest is trained on the dataset, using the feature with missing values as the target variable.
> 2. **Impute Missing Values**: The model predicts the missing values by aggregating predictions from multiple decision trees within the forest.

> **Advantages**:
> - **Handles Non-Linearity**: Effectively captures complex relationships in the data.
> - **Robust to Overfitting**: The ensemble approach helps mitigate overfitting, making it reliable for imputation.
>
> **Disadvantages**:
> - **Computationally Intensive**: Requires more resources and time compared to simpler imputation methods.
> - **Complexity**: More challenging to implement and tune than basic imputation techniques.
>
> **Use Case**: Suitable for datasets with complex interactions between features and when high accuracy in imputation is required.

In [125]:
# Import important libraries for advanced imputation techniques
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [126]:
# Let's Load the Diamond Dataset using Seaborn
df = sns.load_dataset("planets")
print(df.head())

            method  number  orbital_period   mass  distance  year
0  Radial Velocity       1         269.300   7.10     77.40  2006
1  Radial Velocity       1         874.774   2.21     56.95  2008
2  Radial Velocity       1         763.000   2.60     19.84  2011
3  Radial Velocity       1         326.030  19.40    110.62  2007
4  Radial Velocity       1         516.220  10.50    119.47  2009


In [127]:
# Let's First check for missing values and thier percentage  in the Dataset
print("Missing Values")
print(df.isnull().sum().sort_values(ascending=False))
print("Percentage of Missing Values")
print(df.isnull().sum().sort_values(ascending=False) / len(df) * 100)

Missing Values
mass              522
distance          227
orbital_period     43
method              0
number              0
year                0
dtype: int64
Percentage of Missing Values
mass              50.434783
distance          21.932367
orbital_period     4.154589
method             0.000000
number             0.000000
year               0.000000
dtype: float64


In [128]:
# Check the shape of Dataset
print("Shape of Dataset:", df.shape)

Shape of Dataset: (1035, 6)


In [129]:
# Getting info of our dataset
print("Info of Dataset:")
df.info()

Info of Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035 entries, 0 to 1034
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          1035 non-null   object 
 1   number          1035 non-null   int64  
 2   orbital_period  992 non-null    float64
 3   mass            513 non-null    float64
 4   distance        808 non-null    float64
 5   year            1035 non-null   int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 48.6+ KB


In [130]:
# Encode and store encoders for each object column
encoders = {}
for col in obj_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    encoders[col] = le

In [131]:
# Dataset after Encoding
print(df.head())

   method  number  orbital_period   mass  distance  year
0       7       1         269.300   7.10     77.40  2006
1       7       1         874.774   2.21     56.95  2008
2       7       1         763.000   2.60     19.84  2011
3       7       1         326.030  19.40    110.62  2007
4       7       1         516.220  10.50    119.47  2009


In [132]:
# Let's split the dataset into two parts one with missing values and the other one without missing values
df_with_missing = df[df["mass"].isna()]
df_without_missing = df.dropna()
# Let's check the shape of both datasets
print("Shape of original Dataset:", df.shape)
print("Shape of Dataset with Missing Values:", df_with_missing.shape)
print("Shape of Dataset without Missing Values:", df_without_missing.shape)


Shape of original Dataset: (1035, 6)
Shape of Dataset with Missing Values: (522, 6)
Shape of Dataset without Missing Values: (498, 6)


In [133]:
# Let's have a look on the df_with_missing dataset
print(df_with_missing.head())

    method  number  orbital_period  mass  distance  year
7        7       1       798.50000   NaN     21.41  1996
20       7       5         0.73654   NaN     12.53  2011
25       7       1       116.68840   NaN     18.11  1996
26       7       1       691.90000   NaN     81.50  2012
29       2       1             NaN   NaN     45.52  2005


In [134]:
# Let's have a look on the df_without_missing dataset
print(df_without_missing.head())

   method  number  orbital_period   mass  distance  year
0       7       1         269.300   7.10     77.40  2006
1       7       1         874.774   2.21     56.95  2008
2       7       1         763.000   2.60     19.84  2011
3       7       1         326.030  19.40    110.62  2007
4       7       1         516.220  10.50    119.47  2009


In [135]:
# Find and remove outliers in the mass column of dataset
outliers = df_without_missing[(df_without_missing['mass'] < 0.1) | (df_without_missing['mass'] > 10)]
df_without_missing = df_without_missing[~df_without_missing.index.isin(outliers.index)]
# Prepare features and target, using only columns without missing values
features = df_without_missing.drop(columns=['mass'])
target = df_without_missing['mass']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
# Train Random Forest Regressor for imputation
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Predict and evaluate
predictions = rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
r2 = r2_score(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)
print(f"RMSE: {rmse}")
print(f"R2 Score: {r2}")
print(f"MAE: {mae}")



RMSE: 2.3118523032875506
R2 Score: 0.08624307992542535
MAE: 1.7529973770270268


In [136]:
# Ignoring the warnings
import warnings
warnings.filterwarnings("ignore")
# Predict missing values
y_pred = rf.predict(df_with_missing.drop(['mass'], axis=1))
# Add the predicted values to the df_with_missing DataFrame
df_with_missing['mass'] = y_pred
# Combine the datasets back together
df_imputed = pd.concat([df_without_missing, df_with_missing], ignore_index=True)
# Let's check the shape of the final dataset
print("Shape of Final Dataset after Imputation:", df_imputed.shape)
# Let's check for missing values in the final dataset
print("Missing Values in Final Dataset:")
print(df_imputed.isnull().sum().sort_values(ascending=False))

Shape of Final Dataset after Imputation: (889, 6)
Missing Values in Final Dataset:
distance          212
orbital_period     43
number              0
method              0
mass                0
year                0
dtype: int64


In [137]:
# Let's decode the encoded columns back to their original values
for col in obj_cols:
    df_imputed[col] = encoders[col].inverse_transform(df_imputed[col])

print("Final Imputed Dataset:")
print(df_imputed.head())

Final Imputed Dataset:
            method  number  orbital_period  mass  distance  year
0  Radial Velocity       1         269.300  7.10     77.40  2006
1  Radial Velocity       1         874.774  2.21     56.95  2008
2  Radial Velocity       1         763.000  2.60     19.84  2011
3  Radial Velocity       1         185.840  4.80     76.39  2008
4  Radial Velocity       1        1773.400  4.64     18.15  2002


# MICE (Multiple Imputation by Chained Equations) for Imputation

> **Definition**: MICE is an advanced statistical method that performs multiple imputations by modeling each feature with missing values as a function of other features in the dataset.
>
> **How It Works**:
> 1. **Iterative Process**: MICE creates multiple datasets by imputing missing values several times, iteratively updating the imputations.
> 2. **Chained Equations**: Each feature with missing values is modeled using regression or other methods, conditioned on the other features.
> 3. **Multiple Datasets**: Generates several complete datasets, allowing for variability in the imputations.

> **Advantages**:
> - **Captures Uncertainty**: Produces multiple imputations, reflecting the uncertainty around missing values.
> - **Flexible**: Can handle different types of variables (continuous, categorical) and relationships.
>
> **Disadvantages**:
> - **Complex Implementation**: More complicated to set up and requires careful tuning.
> - **Computationally Intensive**: Can be resource-heavy, especially with large datasets.
>
> **Use Case**: Ideal for datasets with significant missing values and where the > relationships between variables are complex and important.

In [155]:
# Import important libraries for advanced imputation techniques
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.preprocessing import LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [156]:
# Load the Planet Dataset using Seaborn
df = sns.load_dataset("planets")
print(df.head())

            method  number  orbital_period   mass  distance  year
0  Radial Velocity       1         269.300   7.10     77.40  2006
1  Radial Velocity       1         874.774   2.21     56.95  2008
2  Radial Velocity       1         763.000   2.60     19.84  2011
3  Radial Velocity       1         326.030  19.40    110.62  2007
4  Radial Velocity       1         516.220  10.50    119.47  2009


In [157]:
# Let's First check for missing values and thier percentage  in the Dataset
print("Missing Values")
print(df.isnull().sum().sort_values(ascending=False))
print("Percentage of Missing Values")
print(df.isnull().sum().sort_values(ascending=False) / len(df) * 100)

Missing Values
mass              522
distance          227
orbital_period     43
method              0
number              0
year                0
dtype: int64
Percentage of Missing Values
mass              50.434783
distance          21.932367
orbital_period     4.154589
method             0.000000
number             0.000000
year               0.000000
dtype: float64


In [158]:
# Getting the info of our dataset
print("Info of Dataset:")
df.info()

Info of Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035 entries, 0 to 1034
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   method          1035 non-null   object 
 1   number          1035 non-null   int64  
 2   orbital_period  992 non-null    float64
 3   mass            513 non-null    float64
 4   distance        808 non-null    float64
 5   year            1035 non-null   int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 48.6+ KB


In [159]:
# Columns to Encode
obj_cols = df.select_dtypes(include=['object']).columns.tolist()
# Encode and store encoders for each object column
encoders = {}
for col in obj_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    encoders[col] = le
# Dataset after Encoding
print(df.head())    


   method  number  orbital_period   mass  distance  year
0       7       1         269.300   7.10     77.40  2006
1       7       1         874.774   2.21     56.95  2008
2       7       1         763.000   2.60     19.84  2011
3       7       1         326.030  19.40    110.62  2007
4       7       1         516.220  10.50    119.47  2009


In [160]:
# Call the IterativeImputer class with max_iter = 20
imputer = IterativeImputer(max_iter=20, random_state=42)
# Columns to Impute
obj_cols = df[["mass", "distance", "orbital_period"]]
# Impute missing values for each column
for col in obj_cols:
    df[col] = imputer.fit_transform(df[[col]])
# Dataset after imputation
print("Datasert after Imputation:")
print(df.isnull().sum().sort_values(ascending=False))    


Datasert after Imputation:
method            0
number            0
orbital_period    0
mass              0
distance          0
year              0
dtype: int64


In [161]:
# Let's Decode the encoded columns back to their original values
for col in encoders.keys():
    df[col] = encoders[col].inverse_transform(df[col])
print("Final Imputed Dataset:")
print(df.head())

Final Imputed Dataset:
            method  number  orbital_period   mass  distance  year
0  Radial Velocity       1         269.300   7.10     77.40  2006
1  Radial Velocity       1         874.774   2.21     56.95  2008
2  Radial Velocity       1         763.000   2.60     19.84  2011
3  Radial Velocity       1         326.030  19.40    110.62  2007
4  Radial Velocity       1         516.220  10.50    119.47  2009
