This section focuses on identifying and addressing missing values (often represented as `NaN`) in your dataset, which is a crucial step as most machine learning algorithms cannot handle them directly.

## Handling Missing Data in Python


This document provides a hands-on look at handling missing data:

* **Identifying:** Using `Pandas` `.isnull().sum()` and percentages. Mentions visualization libraries like `missingno`.
* **Deletion:** Demonstrating row (`.dropna()`) and column (`.dropna(axis=1, thresh=...)`) removal.
* **Simple Imputation:** Using `SimpleImputer` with strategies like `mean`, `median` (for numerical), `most_frequent` (for categorical), and `constant`.
* **Advanced Imputation:** Introducing `KNNImputer` and `IterativeImputer`, noting their requirement for numerical data and increased complexity.
* **Missing Indicators:** Showing how `SimpleImputer(add_indicator=True)` can create binary features flagging imputed values.
* **Implementation Notes:** Emphasizing the critical importance of fitting imputers only on training data to prevent data leakage.

---

This covers the core techniques for addressing missing values in your datasets.

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Enable experimental features (like IterativeImputer)
from sklearn.experimental import enable_iterative_imputer
# Now import imputers
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

# Optional: for visualizing missing data (install with: pip install missingno)
# import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns

# --- 1. Create Sample Data with Missing Values ---
data = {
    'Age': [25, 45, np.nan, 55, 22, 38, 42, np.nan, 29],
    'Salary': [50000, 80000, 60000, 95000, 48000, np.nan, 72000, 85000, 52000],
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male', np.nan, 'Male', 'Female'],
    'Experience': [2, 20, 10, 30, 1, 12, 15, 25, 5],
    'Education': ['Bachelor', 'Master', 'Bachelor', 'PhD', 'High School', 'Master', 'PhD', 'Master', np.nan]
}
df = pd.DataFrame(data)

print("--- Original DataFrame with Missing Values ---")
print(df)
df.info() # Initial check for non-null counts and dtypes
print("-" * 30)


# --- 2. Identifying Missing Values ---

print("--- Identifying Missing Values ---")

# a) Count missing values per column
print("Missing values count per column (.isnull().sum()):\n", df.isnull().sum())

# b) Percentage of missing values per column
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("\nMissing values percentage per column:\n", missing_percentage.round(2))

# c) Visualizing missingness (using missingno - optional)
# Provides visual patterns of missing data.
# print("\nVisualizing missing data patterns (requires 'missingno' library)...")
# try:
#     import missingno as msno
#     msno.matrix(df)
#     plt.title("Missing Data Matrix", fontsize=16)
#     plt.show()
#     msno.heatmap(df)
#     plt.title("Missing Data Correlation Heatmap", fontsize=16)
#     plt.show()
# except ImportError:
#     print("Install 'missingno' library (`pip install missingno`) to visualize patterns.")

print("-" * 30)


# --- 3. Strategy 1: Deletion ---
# Removing rows or columns with missing data. Use with caution due to data loss.

print("--- Strategy 1: Deletion ---")
# Create copies to avoid modifying the original df for later examples
df_copy_del = df.copy()

# a) Listwise Deletion (Row Removal)
# Remove rows containing *any* NaN values.
df_dropped_rows = df_copy_del.dropna() # Default axis=0, how='any'
print(f"Shape before dropping rows: {df_copy_del.shape}")
print(f"Shape after dropping rows with any NaN: {df_dropped_rows.shape}")
# print("\nDataFrame after dropping rows:\n", df_dropped_rows)

# b) Column Deletion (Feature Removal)
# Remove columns where the percentage of missing values exceeds a threshold.
threshold_percent = 50 # Example: Drop columns with more than 50% missing
threshold_count = len(df_copy_del) * (1 - threshold_percent / 100)
df_dropped_cols = df_copy_del.dropna(axis=1, thresh=threshold_count) # Keep cols with at least threshold_count non-NaNs
print(f"\nShape before dropping columns: {df_copy_del.shape}")
print(f"Shape after dropping columns > {threshold_percent}% missing: {df_dropped_cols.shape}")
# print("\nDataFrame after dropping columns:\n", df_dropped_cols)

# Note: Deletion is simple but often not ideal due to information loss.
print("-" * 30)


# --- 4. Strategy 2: Imputation (Filling Values) ---
# Replacing missing values. Often preferred over deletion.
# IMPORTANT: Fit imputers ONLY on training data in a real ML workflow.
# For demonstration, we apply to the whole sample DataFrame here.

print("--- Strategy 2: Imputation ---")
df_copy_imp = df.copy()

# Separate columns by type for appropriate imputation
numerical_cols = df_copy_imp.select_dtypes(include=np.number).columns
categorical_cols = df_copy_imp.select_dtypes(include='object').columns
print(f"Numerical columns: {list(numerical_cols)}")
print(f"Categorical columns: {list(categorical_cols)}")

# a) Simple Imputation (using SimpleImputer)
print("\n--- a) Simple Imputation ---")

# Mean Imputation (Numerical) - Sensitive to outliers
imputer_mean = SimpleImputer(strategy='mean')
# Use .copy() to avoid SettingWithCopyWarning if df_copy_imp is later modified
df_copy_imp_mean = df_copy_imp.copy()
df_copy_imp_mean[numerical_cols] = imputer_mean.fit_transform(df_copy_imp_mean[numerical_cols])
print("DataFrame after Mean Imputation (Numerical):\n", df_copy_imp_mean[numerical_cols].head())


# Median Imputation (Numerical) - Robust to outliers
imputer_median = SimpleImputer(strategy='median')
# Use .copy() to avoid SettingWithCopyWarning
df_copy_imp_median = df_copy_imp.copy()
df_copy_imp_median[numerical_cols] = imputer_median.fit_transform(df_copy_imp_median[numerical_cols])
print("\nDataFrame after Median Imputation (Numerical):\n", df_copy_imp_median[numerical_cols].head())


# Mode Imputation (Categorical)
imputer_mode = SimpleImputer(strategy='most_frequent')
# Use .copy() to avoid SettingWithCopyWarning
df_copy_imp_mode = df_copy_imp.copy()
df_copy_imp_mode[categorical_cols] = imputer_mode.fit_transform(df_copy_imp_mode[categorical_cols])
print("\nDataFrame after Mode Imputation (Categorical):\n", df_copy_imp_mode[categorical_cols].head())


# Constant Imputation
imputer_constant_num = SimpleImputer(strategy='constant', fill_value=-99)
imputer_constant_cat = SimpleImputer(strategy='constant', fill_value='Unknown')
# Example application:
# df_copy_imp_const = df_copy_imp.copy()
# df_copy_imp_const['Salary'] = imputer_constant_num.fit_transform(df_copy_imp_const[['Salary']])
# df_copy_imp_const['Education'] = imputer_constant_cat.fit_transform(df_copy_imp_const[['Education']])
# print("\nDataFrame after Constant Imputation (Example):\n", df_copy_imp_const[['Salary', 'Education']].head())

# Reset for next examples
df_copy_imp = df.copy()
print("-" * 20)

# b) Advanced Imputation
print("\n--- b) Advanced Imputation ---")

# KNN Imputation (Uses k-Nearest Neighbors)
# Requires all data to be numerical. Need to encode categoricals first (covered in Section III).
# For demonstration, let's impute only numerical columns.
print("KNN Imputation (on numerical columns):")
df_copy_knn = df.copy() # Use a fresh copy
try:
    knn_imputer = KNNImputer(n_neighbors=3) # Use 3 neighbors
    df_knn_imputed_num = knn_imputer.fit_transform(df_copy_knn[numerical_cols])
    df_copy_knn[numerical_cols] = df_knn_imputed_num
    print("DataFrame after KNN Imputation (Numerical):\n", df_copy_knn[numerical_cols].head())
except Exception as e:
    print(f"KNN Imputation failed (might need encoding first): {e}")


# Multivariate Imputation (IterativeImputer - e.g., MICE)
# Models each feature with missing values as a function of other features.
# Also requires numerical data.
print("\nIterative Imputation (on numerical columns):")
df_copy_iter = df.copy() # Use a fresh copy
try:
    # Note: enable_iterative_imputer was imported at the top
    iter_imputer = IterativeImputer(max_iter=10, random_state=42) # max_iter controls iterations
    df_iter_imputed_num = iter_imputer.fit_transform(df_copy_iter[numerical_cols])
    df_copy_iter[numerical_cols] = df_iter_imputed_num
    print("DataFrame after Iterative Imputation (Numerical):\n", df_copy_iter[numerical_cols].head().round(2))
except Exception as e:
    print(f"Iterative Imputation failed (might need encoding first): {e}")

print("-" * 30)


# --- 5. Strategy 3: Missing Indicator Feature ---
# Add binary columns indicating where data was originally missing.
# Can be used alongside imputation.

print("--- Strategy 3: Missing Indicator Feature ---")
df_copy_indicator = df.copy() # Use a fresh copy

# Using SimpleImputer with add_indicator=True
imputer_indicator = SimpleImputer(strategy='median', add_indicator=True)

# Apply to numerical features
df_imputed_with_indicator_num = imputer_indicator.fit_transform(df_copy_indicator[numerical_cols])
# Get feature names (original + indicator names)
indicator_names = imputer_indicator.get_feature_names_out(numerical_cols)
df_processed_num = pd.DataFrame(df_imputed_with_indicator_num, columns=indicator_names)

print("DataFrame after Median Imputation with Missing Indicators (Numerical):\n", df_processed_num.head())

# Note: The indicator columns show True where the original value was NaN.
# This allows the model to potentially learn from the pattern of missingness.

print("-" * 30)

# --- 6. Implementation Notes (Recap) ---
print("--- Implementation Notes ---")
print("1. Identify and understand the extent and pattern of missing data.")
print("2. Choose a strategy (deletion, imputation, indicators) based on the data and problem.")
print("3. CRITICAL: In ML workflows, fit imputers ONLY on the training data.")
print("4. Apply the *fitted* imputer to transform both training and test data.")
print("5. Consider using Pipelines (Section VIII) to manage imputation correctly within cross-validation.")
print("-" * 30)

--- Original DataFrame with Missing Values ---
    Age   Salary  Gender  Experience    Education
0  25.0  50000.0    Male           2     Bachelor
1  45.0  80000.0  Female          20       Master
2   NaN  60000.0  Female          10     Bachelor
3  55.0  95000.0    Male          30          PhD
4  22.0  48000.0  Female           1  High School
5  38.0      NaN    Male          12       Master
6  42.0  72000.0     NaN          15          PhD
7   NaN  85000.0    Male          25       Master
8  29.0  52000.0  Female           5          NaN
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Age         7 non-null      float64
 1   Salary      8 non-null      float64
 2   Gender      8 non-null      object 
 3   Experience  9 non-null      int64  
 4   Education   8 non-null      object 
dtypes: float64(2), int64(1), object(2)
memory usage: 492.0+ bytes