# DA5401 A6: Imputation via Regression for Missing Data

By Muhammad Ruhaib, BE22B005



In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

file_path = 'UCI_Credit_Card.csv'

# A: Data Preprocessing and Imputation

## Part A.1 — Load and prepare data

Here, we load the dataset and artificially introduce Missing At Random (MAR) missing values in 2-3 numerical columns (5% missing); I've chosen the columns 'AGE', 'BILL_AMT1' and 'BILL_AMT2'.

In [2]:
df_orig = pd.read_csv(file_path)
df = df_orig.copy()

#Some datasets have an index column named 'ID'; we keep it but not as feature
if 'ID' in df.columns:
    df = df.drop(columns=['ID'])

#We rename the target column for convenience
if 'default.payment.next.month' in df.columns:
    df = df.rename(columns={'default.payment.next.month':'default'})

print('Dataset shape:', df.shape)
print('Columns:', df.columns.tolist())

Dataset shape: (30000, 24)
Columns: ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default']


In [3]:
# Introducing MAR missingness into AGE, BILL_AMT1 and BILL_AMT2 (5% each):
rng = np.random.default_rng(seed=42)
df_mar = df.copy()
n = len(df_mar)
for col in ['AGE','BILL_AMT1','BILL_AMT2']:
    if col in df_mar.columns:
        mask = rng.choice([False, True], size=n, p=[0.95, 0.05])
        df_mar.loc[mask, col] = np.nan
    else:
        print(f"Warning: column {col} not found — skipping missing injection for it.")

print('Missing values per column after MAR injection:')
print(df_mar.isna().sum().loc[lambda x: x>0])
df_mar.head()

Missing values per column after MAR injection:
AGE          1493
BILL_AMT1    1507
BILL_AMT2    1538
dtype: int64


Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000.0,2,2,1,24.0,2,2,-1,-1,-2,-2,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,120000.0,2,2,2,26.0,-1,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,90000.0,2,2,2,34.0,0,0,0,0,0,0,,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,50000.0,2,2,1,37.0,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,50000.0,1,2,1,57.0,-1,0,-1,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


## Part A.2 — Imputation Strategy 1: Simple (Median) Imputation — Dataset A

Our first imputation method replaces missing values in each numeric column with the median of that column.
This approach is simple, fast, and effective, especially when the data are skewed or contain outliers.

In [4]:
# Cell: Median imputation (Dataset A)
dataset_A = df_mar.copy()
median_values = dataset_A.median(numeric_only=True)
dataset_A = dataset_A.fillna(median_values)

print('Any missing left in Dataset A?', dataset_A.isna().any().any())
dataset_A.head()

Any missing left in Dataset A? False


Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000.0,2,2,1,24.0,2,2,-1,-1,-2,-2,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,120000.0,2,2,2,26.0,-1,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,90000.0,2,2,2,34.0,0,0,0,0,0,0,22391.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,50000.0,2,2,1,37.0,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,50000.0,1,2,1,57.0,-1,0,-1,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


**Why median is often preferred over mean for imputation:**

The median is robust to outliers and skewed distributions; a few extreme values do not shift the median as much as they shift the mean. When a numerical feature is skewed (common in billing/amount features), median imputation produces more representative central estimates and avoids injecting unrealistic values.

## Part A.3 — Imputation Strategy 2: Regression Imputation (Linear) — Dataset B

Next, we move to Regression Imputation, where missing values are estimated using relationships between variables.

We will impute missing values in the AGE column using a Linear Regression model trained on other features in the dataset.
The key steps are:

1. Select rows where AGE is observed to train the model

2. Temporarily impute other predictor missing values using medians for stability

3. Predict the missing AGE values using the trained regression model

This method assumes that the relationship between predictors and the target (AGE) is approximately linear.

In [5]:
dataset_B = df_mar.copy()

target_col = 'AGE'
if target_col not in dataset_B.columns:
    raise KeyError(f"Target column {target_col} not found in dataset.")

#First, we prepare training data for regression (rows where AGE is not null)
train_mask = dataset_B[target_col].notna()
X_train = dataset_B.loc[train_mask].drop(columns=[target_col])
y_train = dataset_B.loc[train_mask, target_col]

#For predictors, we fill their missing values with median for the regression training stage
X_train_imputed = X_train.copy()
med = X_train_imputed.median(numeric_only=True)
X_train_imputed = X_train_imputed.fillna(med)

#This ensures we use only numeric predictors
X_train_imputed = X_train_imputed.select_dtypes(include=[np.number])

lr = LinearRegression()
lr.fit(X_train_imputed, y_train)

#Now, we can predict missing AGE values:
missing_mask = dataset_B[target_col].isna()
X_missing = dataset_B.loc[missing_mask].drop(columns=[target_col])
X_missing_imputed = X_missing.copy().fillna(med)  # same median used!
X_missing_imputed = X_missing_imputed.select_dtypes(include=[np.number])

imputed_values = lr.predict(X_missing_imputed)
dataset_B.loc[missing_mask, target_col] = imputed_values

#For any remaining missing values in other columns (we only imputed AGE), fill with median
dataset_B = dataset_B.fillna(dataset_B.median(numeric_only=True))

print('Any missing left in Dataset B?', dataset_B.isna().any().any())
dataset_B.head()

Any missing left in Dataset B? False


Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000.0,2,2,1,24.0,2,2,-1,-1,-2,-2,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,120000.0,2,2,2,26.0,-1,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,90000.0,2,2,2,34.0,0,0,0,0,0,0,22391.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,50000.0,2,2,1,37.0,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,50000.0,1,2,1,57.0,-1,0,-1,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


**Assumption behind regression imputation:**

Regression imputation assumes that data are Missing At Random (MAR): the probability that AGE is missing depends only on observed data (predictors) and not on the true (missing) AGE value itself. If missingness depends on the unobserved AGE itself (MNAR), regression imputation can be biased.

## Part A.4 — Imputation Strategy 3: Regression Imputation (Non-Linear) — Dataset C

While linear regression captures only straight-line relationships, real-world data often exhibit non-linear dependencies.
To model this complexity, we use a Decision Tree Regressor to predict missing values in the AGE column.

This model can split the data space into regions where different patterns exist, potentially improving imputation accuracy if AGE depends on other features in a non-linear fashion.
However, this flexibility also increases the risk of overfitting, especially when training data are limited. The procedure below mirrors the linear case but uses a nonlinear regressor instead.

In [6]:
#Cell: Non-linear regression (Decision Tree) imputation for 'AGE' (Dataset C)
dataset_C = df_mar.copy()
target_col = 'AGE'

#Prepare training data
train_mask = dataset_C[target_col].notna()
X_train = dataset_C.loc[train_mask].drop(columns=[target_col])
y_train = dataset_C.loc[train_mask, target_col]

#Impute predictor missing values with median for training
X_train_imp = X_train.fillna(X_train.median(numeric_only=True)).select_dtypes(include=[np.number])

dt = DecisionTreeRegressor(random_state=42, max_depth=8)
dt.fit(X_train_imp, y_train)

#Predict missing AGE
missing_mask = dataset_C[target_col].isna()
X_missing = dataset_C.loc[missing_mask].drop(columns=[target_col])
X_missing_imp = X_missing.fillna(X_train.median(numeric_only=True)).select_dtypes(include=[np.number])

imputed_values_dt = dt.predict(X_missing_imp)
dataset_C.loc[missing_mask, target_col] = imputed_values_dt

#Fill any remaining missing with median
dataset_C = dataset_C.fillna(dataset_C.median(numeric_only=True))

print('Any missing left in Dataset C?', dataset_C.isna().any().any())
dataset_C.head()

Any missing left in Dataset C? False


Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000.0,2,2,1,24.0,2,2,-1,-1,-2,-2,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,120000.0,2,2,2,26.0,-1,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,90000.0,2,2,2,34.0,0,0,0,0,0,0,22391.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,50000.0,2,2,1,37.0,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,50000.0,1,2,1,57.0,-1,0,-1,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


# B: Model Training and Performance Assessment

## Part B.1 — Data Split

We now prepare all four datasets for modeling:

* Dataset A — Median Imputation

* Dataset B — Linear Regression Imputation

* Dataset C — Non-linear Regression Imputation

* Dataset D — Listwise Deletion (rows with missing values dropped)

Each dataset is split into training (80%) and testing (20%) sets using stratified sampling to preserve the target variable distribution.

Listwise deletion is included as a benchmark, representing a traditional but potentially wasteful approach to handling missing data.

In [7]:
from sklearn.model_selection import train_test_split

#Ensuring target column exists and is named 'default'P
for name, dset in [('A',dataset_A), ('B',dataset_B), ('C',dataset_C)]:
    if 'default' not in dset.columns:
        raise KeyError('Target column "default" not found in dataset ' + name)

#We also create Dataset D: listwise deletion from the MAR dataset (drop rows with any NaN)
dataset_D = df_mar.dropna().copy()
print('Dataset D shape after listwise deletion:', dataset_D.shape)

#Function to split dataset into X_train, X_test, y_train, y_test
def split_dataset(df, target='default', test_size=0.2, random_state=42):
    X = df.drop(columns=[target])
    y = df[target]
    return train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)

A_X_train, A_X_test, A_y_train, A_y_test = split_dataset(dataset_A)
B_X_train, B_X_test, B_y_train, B_y_test = split_dataset(dataset_B)
C_X_train, C_X_test, C_y_train, C_y_test = split_dataset(dataset_C)
D_X_train, D_X_test, D_y_train, D_y_test = split_dataset(dataset_D)

print('Train/Test sizes (A):', A_X_train.shape, A_X_test.shape)
print('Train/Test sizes (D):', D_X_train.shape, D_X_test.shape)

Dataset D shape after listwise deletion: (25719, 24)
Train/Test sizes (A): (24000, 23) (6000, 23)
Train/Test sizes (D): (20575, 23) (5144, 23)


## Part B.2 — Classifier Setup

Since we are using Logistic Regression as our predictive model, it is essential to scale all numeric features to a comparable range.
We apply StandardScaler, which transforms features to have zero mean and unit variance based on the training data statistics, and then use the same transformation on the test set.

This ensures that all features contribute equally to the model’s optimization process.

In [8]:
scalers = {}
def scale_data(X_train, X_test):
    scaler = StandardScaler()
    num_cols = X_train.select_dtypes(include=[np.number]).columns
    X_train_num = X_train.copy()
    X_test_num = X_test.copy()
    X_train_num[num_cols] = scaler.fit_transform(X_train[num_cols])
    X_test_num[num_cols] = scaler.transform(X_test[num_cols])
    return X_train_num, X_test_num, scaler

A_X_train_s, A_X_test_s, scalers['A'] = scale_data(A_X_train, A_X_test)
B_X_train_s, B_X_test_s, scalers['B'] = scale_data(B_X_train, B_X_test)
C_X_train_s, C_X_test_s, scalers['C'] = scale_data(C_X_train, C_X_test)
D_X_train_s, D_X_test_s, scalers['D'] = scale_data(D_X_train, D_X_test)

print('Scaling complete. Example scaled feature (A):')
A_X_train_s.iloc[:, :5].head()

Scaling complete. Example scaled feature (A):


Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE
22788,-0.056866,0.80844,0.184523,0.856739,-0.262533
29006,-0.134081,0.80844,-1.077532,0.856739,-0.15087
16950,-1.21509,-1.23695,0.184523,-1.059367,1.635734
22280,0.406423,0.80844,-1.077532,0.856739,-0.709184
11346,1.101358,0.80844,-1.077532,0.856739,-0.374196


## Part B.3 — Model Evaluation

To evaluate how each imputation strategy affects downstream model performance, we train a Logistic Regression classifier on each dataset (A, B, C, and D).

Logistic Regression is well-suited here because:

* It provides interpretable results

* It assumes linear separability between classes

* It is sensitive to differences in data preprocessing and imputation quality

We will compare accuracy, precision, recall, and F1-score for each imputation method to understand their relative effectiveness.

In [9]:
from sklearn.metrics import classification_report

results = {}

def train_and_report(X_train, X_test, y_train, y_test, label):
    clf = LogisticRegression(max_iter=1000, solver='liblinear', random_state=42)
    clf.fit(X_train.select_dtypes(include=[np.number]), y_train)
    y_pred = clf.predict(X_test.select_dtypes(include=[np.number]))
    report = classification_report(y_test, y_pred, output_dict=True)
    print(f"\nClassification report for: {label}")
    print(classification_report(y_test, y_pred))
    return report, clf

results['A'], clf_A = train_and_report(A_X_train_s, A_X_test_s, A_y_train, A_y_test, 'Model A (Median Imputation)')
results['B'], clf_B = train_and_report(B_X_train_s, B_X_test_s, B_y_train, B_y_test, 'Model B (Linear Regression Imputation)')
results['C'], clf_C = train_and_report(C_X_train_s, C_X_test_s, C_y_train, C_y_test, 'Model C (Decision Tree Imputation)')
results['D'], clf_D = train_and_report(D_X_train_s, D_X_test_s, D_y_train, D_y_test, 'Model D (Listwise Deletion)')


Classification report for: Model A (Median Imputation)
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.68      0.24      0.35      1327

    accuracy                           0.81      6000
   macro avg       0.75      0.60      0.62      6000
weighted avg       0.79      0.81      0.77      6000


Classification report for: Model B (Linear Regression Imputation)
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.69      0.24      0.35      1327

    accuracy                           0.81      6000
   macro avg       0.75      0.60      0.62      6000
weighted avg       0.79      0.81      0.77      6000


Classification report for: Model C (Decision Tree Imputation)
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.69      0.24      0.35      13

# C: Comparative Analysis

## Part C.1 — Results Comparison

After obtaining classification reports for all models, we summarize the key performance metrics, particularly the F1-score for the positive class (default = 1).

This summary table allows us to visually compare how different imputation methods influence predictive accuracy and balance between false positives and false negatives.

In [10]:
summary_rows = []
for key, rep in results.items():
    acc = rep.get('accuracy', None)
    #For binary classification, sklearn uses labels '0' and '1' in the report dict
    f1_class1 = rep.get('1', {}).get('f1-score', None)
    precision_class1 = rep.get('1', {}).get('precision', None)
    recall_class1 = rep.get('1', {}).get('recall', None)
    summary_rows.append({'Model': key, 'Accuracy': acc, 'Precision_class1': precision_class1, 'Recall_class1': recall_class1, 'F1_class1': f1_class1})

summary_df = pd.DataFrame(summary_rows).set_index('Model')
summary_df

Unnamed: 0_level_0,Accuracy,Precision_class1,Recall_class1,F1_class1
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0.807167,0.684783,0.237378,0.352546
B,0.8075,0.686147,0.238885,0.354388
C,0.807333,0.685466,0.238131,0.353468
D,0.813375,0.72,0.253521,0.375


We see that all four models performed similarly, with Models A–C (using median, linear, and decision tree imputations) achieving identical accuracy of 0.81, and Model D (listwise deletion) slightly higher at 0.83. Precision and recall for the majority class (no default) remained high across all models, while recall for the minority class (default) was low (~0.24–0.25), resulting in modest F1-scores around 0.35–0.38. The near-identical performance of the three imputation methods suggests that the choice of imputation technique had minimal impact on predictive performance for this dataset. Although listwise deletion achieved marginally higher accuracy, it does so at the cost of losing data, whereas imputation methods retain the full dataset without degrading model performance, making them more practical overall.

## Part C.2 — Efficacy Discussion

Listwise deletion (Model D) achieved slightly higher accuracy (0.813) and F1-score (0.375) compared to the imputed models (A–C, all around 0.81 accuracy and 0.35 F1). While this might seem like better performance, it comes with a key trade-off: listwise deletion discards all rows containing missing values, reducing the dataset size and potentially introducing bias if the missingness is not completely random. The model is trained on fewer, cleaner samples, which may simplify the learning task but reduce its generalizability to the full data distribution. In contrast, imputation methods (A–C) preserve all observations, maintaining statistical power and representativeness even if their immediate performance metrics appear slightly lower.

Both regression-based imputation methods, Model B (linear regression) and Model C (decision tree regression), performed almost identically, with Model B showing a marginally higher F1-score (0.354 vs. 0.353). This negligible difference suggests that the relationship between the imputed variable (AGE) and the other predictors in this dataset is largely linear or weakly dependent, meaning a more flexible non-linear model provides little advantage. The absence of meaningful improvement from the decision tree model implies that the predictors do not form complex, non-linear interactions sufficient to warrant non-linear imputation. Thus, in this scenario, linear regression is sufficient and computationally simpler.

Given the small performance differences and conceptual trade-offs, median or linear regression imputation emerges as the best overall strategy for handling missing data in this dataset. These methods maintain the complete dataset, avoid information loss, and achieve nearly identical classification results to listwise deletion. Although listwise deletion yields slightly higher metrics, its reduced sample size risks bias and limits generalizability. Regression imputation, particularly the linear approach, balances simplicity, data retention, and interpretability, hence making it the most practical and theoretically sound choice for managing missing data under the MAR assumption in this credit default prediction context.