# DA5401 A6: Imputation via Regression for Missing Data
## Objective: 

This assignment challenges you to apply linear and non-linear regression to impute
missing values in a dataset. The effectiveness of your imputation methods will be measured
indirectly by assessing the performance of a subsequent classification task, comparing the
regression-based approach against simpler imputation strategies.

## 1. Problem Statement
You are a machine learning engineer working on a credit risk assessment project. You have
been provided with the UCI Credit Card Default Clients Dataset. This dataset has missing
values in several important feature columns. The presence of missing data prevents the
immediate application of many classification algorithms.
Your task is to implement three different strategies for handling the missing data and then use
the resulting clean datasets to train and evaluate a classification model. This will demonstrate
how the choice of imputation technique significantly impacts final model performance.
You will submit a Jupyter Notebook with your complete code, visualizations, and a plausible
story that explains your findings. The notebook should be well-commented, reproducible, and
easy to follow.
### Dataset:
- UCI Credit Card Default Clients Dataset (with missing values): Kaggle - Credit Card
Default Clients Dataset (https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset)
    - Note: While the original UCI dataset is relatively clean, for this assignment, you
should artificially introduce Missing At Random (MAR) values (e.g., replace
5% of the values in the 'AGE' and 'BILL_AMT' columns with NaN) before starting
Part A, to simulate a real-world scenario with a substantial missing data problem.

## 2. Tasks
### Part A: Data Preprocessing and Imputation

In [1]:
import pandas as pd
import numpy as np

In [4]:
# 1. Load and Prepare Data [4]: Load the dataset and, as instructed in the note above, artificially introduce MAR missing values (5-10% in 2-3 numerical feature columns). The target variable is 'default payment next month'

df = pd.read_csv('/Users/navaneethakrishnan/Desktop/DAL/assignment_6_Navaneeth272001/UCI_Credit_Card.csv') #please change the path accordingly
print("Initial shape:", df.shape)
print(df[['AGE', 'BILL_AMT1']].describe())

np.random.seed(42)

# 1️⃣ Compute conditional subsets
cond_bill = df['AGE'] > 50
cond_age = df['EDUCATION'] == 1

# 2️⃣ Calculate fraction needed for ~5% overall missing
frac_bill = 0.05 / cond_bill.mean()
frac_age = 0.05 / cond_age.mean()

# 3️⃣ Apply MAR missingness scaled to achieve ~5% total
mask_mar_bill = cond_bill & (np.random.rand(len(df)) < frac_bill)
mask_mar_age = cond_age & (np.random.rand(len(df)) < frac_age)

df.loc[mask_mar_bill, 'BILL_AMT1'] = np.nan
df.loc[mask_mar_age, 'AGE'] = np.nan

# Check result again
missing_summary = df[['AGE', 'BILL_AMT1']].isna().mean() * 100
print("\nAdjusted missing percentage after MAR injection:")
print(missing_summary.round(2))

# Define target column
target_col = "default.payment.next.month"

# Create target variable
y = df[target_col]

# Create feature matrix by dropping the target column
X = df.drop(columns=[target_col])

# Confirm shapes
print("\nFeature matrix and target vector created successfully.")
print("X shape:", X.shape)
print("y shape:", y.shape)

# Optional sanity check
print("\nTarget variable distribution:")
print(y.value_counts(normalize=True).round(3))

Initial shape: (30000, 25)
                AGE      BILL_AMT1
count  30000.000000   30000.000000
mean      35.485500   51223.330900
std        9.217904   73635.860576
min       21.000000 -165580.000000
25%       28.000000    3558.750000
50%       34.000000   22381.500000
75%       41.000000   67091.000000
max       79.000000  964511.000000

Adjusted missing percentage after MAR injection:
AGE          5.03
BILL_AMT1    4.99
dtype: float64

Feature matrix and target vector created successfully.
X shape: (30000, 24)
y shape: (30000,)

Target variable distribution:
default.payment.next.month
0    0.779
1    0.221
Name: proportion, dtype: float64


In [5]:
# 2. Imputation Strategy 1: Simple Imputation (Baseline):

#Create a clean dataset copy (Dataset A). For each column with missing values, fill the missing values with the median of that column. 
# 3️⃣ Create a clean dataset copy
df_clean = df.copy()

# 4️⃣ Identify columns with missing values
missing_cols = df_clean.columns[df_clean.isna().any()]
print("\nColumns with missing values:", list(missing_cols))

# 5️⃣ Fill missing values with column median (Dataset A)
df_A = df_clean.copy()
for col in missing_cols:
    median_value = df_A[col].median()
    df_A[col] = df_A[col].fillna(median_value)

# 6️⃣ Verify that all missing values are handled
print("\nMissing values after median imputation:")
print(df_A[missing_cols].isna().sum())

# Optional: check if dataset shape and stats remain consistent
print("\nShape of Dataset A:", df_A.shape)
print(df_A[['AGE', 'BILL_AMT1']].describe())



Columns with missing values: ['AGE', 'BILL_AMT1']

Missing values after median imputation:
AGE          0
BILL_AMT1    0
dtype: int64

Shape of Dataset A: (30000, 25)
                AGE      BILL_AMT1
count  30000.000000   30000.000000
mean      35.483300   49635.322700
std        9.034407   71505.228811
min       21.000000 -165580.000000
25%       28.000000    4150.250000
50%       34.000000   22399.000000
75%       41.000000   62774.250000
max       79.000000  964511.000000


Explain why the median is often preferred over the mean for imputation.

- Robustness to Outliers
    - The mean is sensitive to extreme values (outliers).
    - Example: if one person has a bill amount of ₹10,00,000 while others have around ₹10,000, the mean will be pulled upward.
    - The median, on the other hand, is not affected by outliers — it only depends on the middle value of the sorted data.
    - So it provides a more stable and representative imputation value when the data are skewed or contain outliers.
- Better for Skewed Distributions
    - Many real-world variables (e.g., income, bill amounts, age) are right-skewed — meaning there are a few very large values.
    - The mean in such cases doesn’t represent the “typical” observation, but the median does.
    - Median imputation preserves the central tendency better for non-normal (skewed) distributions.
- Preserves Rank and Spread Better
    - When you impute using the mean, you might flatten variability and distort the data’s distribution.
    - Using the median preserves relative ordering and the shape of the distribution more faithfully.
- Simplicity and Interpretability
    - Median imputation is simple, quick, and computationally efficient.
    - It doesn’t require complex modeling assumptions — it just replaces missing values with a robust measure of central tendency.

In [6]:
#3. Imputation Strategy 2: Regression Imputation (Linear):

#Create a second clean dataset copy (Dataset B). For a single column (your choice) with missing values, use a Linear Regression model to predict the missing values based on all other non-missing features.
from sklearn.linear_model import LinearRegression

# Make a fresh copy
df_B = df.copy()

# Choose the column with missing values (example: 'BILL_AMT1')
target_col_missing = 'BILL_AMT1'

# Split data into rows with and without missing target values
df_not_missing = df_B[df_B[target_col_missing].notna()]
df_missing = df_B[df_B[target_col_missing].isna()]

print(f"\nNumber of missing rows in '{target_col_missing}': {len(df_missing)}")

# Define features (all other columns) and target
X_train = df_not_missing.drop(columns=[target_col_missing])
y_train = df_not_missing[target_col_missing]

# Prepare rows where target_col_missing is NaN
X_pred = df_missing.drop(columns=[target_col_missing])

# Fill remaining missing numeric values (excluding the target column) with medians
X_train = X_train.fillna(X_train.median())
X_pred = X_pred.fillna(X_pred.median())

# Fit Linear Regression model on all available (non-missing) data
reg = LinearRegression()
reg.fit(X_train, y_train)

# Predict missing BILL_AMT1 values
predicted_values = reg.predict(X_pred)

# Replace missing values with predicted values
df_B.loc[df_B[target_col_missing].isna(), target_col_missing] = predicted_values

# Verify the imputation
print("\nMissing values after Linear Regression imputation:")
print(df_B[target_col_missing].isna().sum())

print("\nShape of Dataset B:", df_B.shape)
print(df_B[[target_col_missing]].describe())



Number of missing rows in 'BILL_AMT1': 1497

Missing values after Linear Regression imputation:
0

Shape of Dataset B: (30000, 25)
           BILL_AMT1
count   30000.000000
mean    51235.041649
std     73384.538489
min   -165580.000000
25%      3933.000000
50%     22460.291531
75%     66979.750000
max    964511.000000


Explain the underlying assumption of this method (Missing At Random).

- Missing At Random (MAR) Definition:
    - Missingness of a variable depends on other observed variables, but not on the value of the variable itself.
    - Mathematically:
        - P(missing in X_miss | X_miss, X_obs) = P(missing in X_miss | X_obs)
- How it relates to Linear Regression Imputation:
    - Missing values can be predicted using correlations with observed features.
    - Works well under MAR because missingness is systematically related to other variables, not the missing values themselves.
    - Regression predicts the conditional mean of the missing values given observed variables.


In [12]:
# 4. Imputation Strategy 3: Regression Imputation (Non-Linear):

#Create a third clean dataset copy (Dataset C). For the same column as in Strategy 2, use a non-linear regression model (e.g., K-Nearest Neighbors Regression or Decision Tree Regression) to predict the missing values.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

# Make a fresh copy
df_C = df.copy()

# Column with missing values
target_col_missing = 'BILL_AMT1'

# Split data into rows with and without missing target values
df_not_missing = df_C[df_C[target_col_missing].notna()]
df_missing = df_C[df_C[target_col_missing].isna()]

print(f"\nNumber of missing rows in '{target_col_missing}': {len(df_missing)}")

# Define features (all other columns) and target
X_train = df_not_missing.drop(columns=[target_col_missing])
y_train = df_not_missing[target_col_missing]

# Prepare rows where target_col_missing is NaN
X_pred = df_missing.drop(columns=[target_col_missing])

# Fill remaining missing numeric values (excluding the target column) with medians
X_train = X_train.fillna(X_train.median())
X_pred = X_pred.fillna(X_pred.median())

# --- Choose one non-linear regression model ---
# Option 1: K-Nearest Neighbors Regressor
model = KNeighborsRegressor(n_neighbors=5)

# Option 2: Decision Tree Regressor (uncomment the next line to use it instead)
# model = DecisionTreeRegressor(max_depth=10, random_state=42)

# Fit model on all available (non-missing) data
model.fit(X_train, y_train)

# Predict missing BILL_AMT1 values
predicted_values = model.predict(X_pred)

# Replace missing values with predicted values
df_C.loc[df_C[target_col_missing].isna(), target_col_missing] = predicted_values

# Verify the imputation
print("\nMissing values after non-linear regression imputation:")
print(df_C[target_col_missing].isna().sum())

print("\nShape of Dataset C:", df_C.shape)
print(df_C[[target_col_missing]].describe())



Number of missing rows in 'BILL_AMT1': 1497

Missing values after non-linear regression imputation:
0

Shape of Dataset C: (30000, 25)
           BILL_AMT1
count   30000.000000
mean    51138.641393
std     73273.520809
min   -165580.000000
25%      3699.250000
50%     22430.500000
75%     66982.000000
max    964511.000000


### Part B: Model Training and Performance Assessment

In [None]:
# 1. Data Split: For each of the three imputed datasets (A, B, C), split the data into training and testing sets. Also, create a fourth dataset (Dataset D) by simply removing all rows that contain any missing values (Listwise Deletion). Split Dataset D into training and testing sets.

