

<h1> <b> <centre> DA5401 Assignment 6 </b> </centre> </h2>
<h4> <b> <centre> Name: Pawar Devesh Pramod </centre></b> </h4>
<h4> <b> <centre> Roll No: ME22B176  </centre></b> </h4>
<h4> <b> <centre> Date of Submission 17/10/2025 </centre></b> </h4>


<h3> <b> Objective: </b> <h3> 
<p style="font-size: 18px;">This assignment challenges you to apply linear and non-linear regression to impute
missing values in a dataset. The effectiveness of your imputation methods will be measured
indirectly by assessing the performance of a subsequent classification task, comparing the
regression-based approach against simpler imputation strategies. </p>

<h3> <b> 1. Problem Statement </b> </h3>
<p style="font-size: 18 px;">
You are a machine learning engineer working on a credit risk assessment project. You have
been provided with the UCI Credit Card Default Clients Dataset. This dataset has missing
values in several important feature columns. The presence of missing data prevents the
immediate application of many classification algorithms.
Your task is to implement three different strategies for handling the missing data and then use
the resulting clean datasets to train and evaluate a classification model. This will demonstrate
how the choice of imputation technique significantly impacts final model performance.</p>

<p style="font-size: 18px;"> Dataset: 
- <a href="https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset/data">Kaggle - Credit Card
Default Clients Dataset</a>
</p>

In [3]:
""" Import the dataset """
import pandas as pd

df = pd.read_csv('UCI_Credit_Card.csv')
print(f'The head of dataset: ', df.head())
print(f'Null values in the dataset: ', df.isnull().sum())


The head of dataset:     ID  LIMIT_BAL  SEX  ...  PAY_AMT5  PAY_AMT6  default.payment.next.month
0   1    20000.0    2  ...       0.0       0.0                           1
1   2   120000.0    2  ...       0.0    2000.0                           1
2   3    90000.0    2  ...    1000.0    5000.0                           0
3   4    50000.0    2  ...    1069.0    1000.0                           0
4   5    50000.0    1  ...     689.0     679.0                           0

[5 rows x 25 columns]
Null values in the dataset:  ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                   

In [4]:
df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default.payment.next.month'],
      dtype='object')

---

## Part A: Data Preprocessing and Imputation

### 1. Artificially Introduce MAR Missing Values

To simulate a real-world scenario, we need to introduce missing data. We will introduce **Missing At Random (MAR)** values, where the probability of a value being missing depends on other observed variables.

- We will introduce approximately **8%** missing values into two numerical columns: `AGE` and `BILL_AMT1`.
- The missingness in `AGE` will be dependent on the `EDUCATION` level.
- The missingness in `BILL_AMT1` will be dependent on the `LIMIT_BAL`.

In [7]:
""" Intorducing MAR """
from sklearn.preprocessing import MinMaxScaler
import numpy as np

def introduce_mar_missing(df, target_col, predictor_col, missing_frac=0.08):
    """
    Introduces MAR values into a target column based on a predictor column.
    A higher value in the predictor column increases the probability of missingness.
    """
    df_mar = df.copy()
    
    # Scale the predictor column to create a well-behaved probability score
    scaler = MinMaxScaler()
    predictor_scaled = scaler.fit_transform(df_mar[[predictor_col]])
    
    # Create a probability based on the scaled predictor
    # We adjust the formula to control the overall missingness rate
    logit = predictor_scaled[:, 0] * 2.5 - 1.5 # Coefficients are tuned to get near the desired fraction
    probabilities = 1 / (1 + np.exp(-logit))

    # To precisely control the fraction, we find a threshold
    prob_threshold = np.quantile(probabilities, 1 - missing_frac)
    
    # Create a mask where the probability is above the threshold
    mask = probabilities > prob_threshold
    
    # Introduce NaNs
    df_mar.loc[mask, target_col] = np.nan
    return df_mar

# Create a copy to work with
df_mar = df.copy()

# Introduce MAR values into 'AGE' based on 'EDUCATION'
df_mar = introduce_mar_missing(df_mar, 'AGE', 'EDUCATION', missing_frac=0.08)

# Introduce MAR values into 'BILL_AMT1' based on 'LIMIT_BAL'
df_mar = introduce_mar_missing(df_mar, 'BILL_AMT1', 'LIMIT_BAL', missing_frac=0.08)

print("Missing values after introducing MAR data:")
print(df_mar.isnull().sum())

Missing values after introducing MAR data:
ID                               0
LIMIT_BAL                        0
SEX                              0
EDUCATION                        0
MARRIAGE                         0
AGE                            454
PAY_0                            0
PAY_2                            0
PAY_3                            0
PAY_4                            0
PAY_5                            0
PAY_6                            0
BILL_AMT1                     2249
BILL_AMT2                        0
BILL_AMT3                        0
BILL_AMT4                        0
BILL_AMT5                        0
BILL_AMT6                        0
PAY_AMT1                         0
PAY_AMT2                         0
PAY_AMT3                         0
PAY_AMT4                         0
PAY_AMT5                         0
PAY_AMT6                         0
default.payment.next.month       0
dtype: int64


### 2. Imputation Strategy 1: Simple Imputation (Baseline)

Our first strategy is to use median imputation. We create `Dataset A` by filling the missing values in `AGE` and `BILL_AMT1` with their respective column medians.

**Why is the median often preferred over the mean?**

The median is generally preferred for imputation in skewed distributions or datasets with outliers. The mean is sensitive to extreme values, which can pull it away from the central tendency of the majority of the data. The **median**, being the 50th percentile, is robust to outliers and often provides a more representative value for imputation in such cases.

In [9]:
# Create Dataset A
df_A = df_mar.copy()

# Calculate medians
age_median = df_A['AGE'].median()
bill_amt1_median = df_A['BILL_AMT1'].median()

# Impute missing values
df_A['AGE'].fillna(age_median, inplace=True)
df_A['BILL_AMT1'].fillna(bill_amt1_median, inplace=True)

print("Dataset A: Missing values after Median Imputation")
print(df_A.isnull().sum().sum()) # Should be 0

Dataset A: Missing values after Median Imputation
0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_A['AGE'].fillna(age_median, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_A['BILL_AMT1'].fillna(bill_amt1_median, inplace=True)


### 3. Imputation Strategy 2: Regression Imputation (Linear)

For `Dataset B`, we will use linear regression to predict and impute the missing values. We will focus on imputing `BILL_AMT1`, as it is more likely to have a linear relationship with other financial features than `AGE`. For simplicity, we will still use median imputation for the `AGE` column in this dataset.

**Underlying Assumption (Missing At Random - MAR):**

t
This method assumes that the missingness of a value can be explained by other observed variables in the dataset. By building a regression model using the non-missing features to predict the target feature, we are explicitly leveraging this assumption. The model learns the relationship between the features and the target from the complete data and uses that learned relationship to estimate the most likely values for the missing entries.

In [12]:
from sklearn.linear_model import LinearRegression

# Create Dataset B
df_B = df_mar.copy()

# First, impute 'AGE' with the median to simplify the process
df_B['AGE'].fillna(df_B['AGE'].median(), inplace=True)

# Impute 'BILL_AMT1' using Linear Regression 

# Separate the dataframe into two parts
df_impute_train = df_B[df_B['BILL_AMT1'].notnull()]
df_impute_test = df_B[df_B['BILL_AMT1'].isnull()]

# Define features and target for the imputation model
impute_features = [col for col in df_B.columns if col not in ['BILL_AMT1', 'default_payment_next_month']]
impute_target = 'BILL_AMT1'

X_impute_train = df_impute_train[impute_features]
y_impute_train = df_impute_train[impute_target]
X_impute_test = df_impute_test[impute_features]

# Initialize and train the Linear Regression model
lr_imputer = LinearRegression()
lr_imputer.fit(X_impute_train, y_impute_train)

# Predict the missing values
predicted_bill_amt1 = lr_imputer.predict(X_impute_test)

# Fill the NaNs with the predictions
df_B.loc[df_B['BILL_AMT1'].isnull(), 'BILL_AMT1'] = predicted_bill_amt1

print("Dataset B: Missing values after Linear Regression Imputation")
print(df_B.isnull().sum().sum()) # Should be 0

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_B['AGE'].fillna(df_B['AGE'].median(), inplace=True)


Dataset B: Missing values after Linear Regression Imputation
0


### 4. Imputation Strategy 3: Regression Imputation (Non-Linear)

For `Dataset C`, we use a non-linear regression model, **K-Nearest Neighbors (KNN) Regression**, to impute the missing values in `BILL_AMT1`. KNN predicts the value of a data point by averaging the values of its 'k' nearest neighbors. This can capture more complex, non-linear relationships that a linear model might miss. Again, `AGE` will be imputed with its median.

In [13]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
# Create Dataset C
df_C = df_mar.copy()

# First, impute 'AGE' with the median
df_C['AGE'].fillna(df_C['AGE'].median(), inplace=True)

# Impute 'BILL_AMT1' using KNN Regression 

# We can reuse the splits from the previous step
df_impute_train_c = df_C[df_C['BILL_AMT1'].notnull()]
df_impute_test_c = df_C[df_C['BILL_AMT1'].isnull()]

X_impute_train_c = df_impute_train_c[impute_features]
y_impute_train_c = df_impute_train_c[impute_target]
X_impute_test_c = df_impute_test_c[impute_features]

# For KNN, it's crucial to scale the features first
scaler_impute = StandardScaler()
X_impute_train_c_scaled = scaler_impute.fit_transform(X_impute_train_c)
X_impute_test_c_scaled = scaler_impute.transform(X_impute_test_c)

# Initialize and train the KNN Regression model (e.g., with k=5)
knn_imputer = KNeighborsRegressor(n_neighbors=5)
knn_imputer.fit(X_impute_train_c_scaled, y_impute_train_c)

# Predict the missing values
predicted_bill_amt1_knn = knn_imputer.predict(X_impute_test_c_scaled)

# Fill the NaNs with the predictions
df_C.loc[df_C['BILL_AMT1'].isnull(), 'BILL_AMT1'] = predicted_bill_amt1_knn

print("Dataset C: Missing values after Non-Linear (KNN) Regression Imputation")
print(df_C.isnull().sum().sum()) # Should be 0

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_C['AGE'].fillna(df_C['AGE'].median(), inplace=True)


Dataset C: Missing values after Non-Linear (KNN) Regression Imputation
0


---

## Part B: Model Training and Performance Assessment

Now we will create our fourth dataset, `Dataset D`, using listwise deletion. Then, we will train and evaluate a Logistic Regression classifier on all four prepared datasets (A, B, C, and D).

In [14]:
# Create Dataset D by dropping all rows with any missing values
df_D = df_mar.dropna().copy()

print(f"Original MAR dataset shape: {df_mar.shape}")
print(f"Dataset D (Listwise Deletion) shape: {df_D.shape}")
print(f"Number of rows dropped: {df_mar.shape[0] - df_D.shape[0]}")

Original MAR dataset shape: (30000, 25)
Dataset D (Listwise Deletion) shape: (27328, 25)
Number of rows dropped: 2672


In [15]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

datasets = {
    "A (Median Imputation)": df_A,
    "B (Linear Regression Imputation)": df_B,
    "C (Non-Linear KNN Imputation)": df_C,
    "D (Listwise Deletion)": df_D
}

results = {}

target_variable = 'default.payment.next.month'

for name, df in datasets.items():
    print(f"Processing Model for {name} ")
    
    # 1. Define Features (X) and Target (y)
    X = df.drop(target_variable, axis=1)
    y = df[target_variable]
    
    # 2. Data Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    # 3. Standardize Features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # 4. Train Logistic Regression Classifier
    log_reg = LogisticRegression(random_state=42, max_iter=1000)
    log_reg.fit(X_train_scaled, y_train)
    
    # 5. Evaluate Performance
    y_pred = log_reg.predict(X_test_scaled)
    
    print(f"\nClassification Report for {name}:\n")
    print(classification_report(y_test, y_pred))
    
    # Store the F1-score for the positive class (1 = default)
    report = classification_report(y_test, y_pred, output_dict=True)
    results[name] = {
        'Accuracy': report['accuracy'],
        'Precision (1)': report['1']['precision'],
        'Recall (1)': report['1']['recall'],
        'F1-Score (1)': report['1']['f1-score']
    }
    print("*-------------------------------------------------*\n")

Processing Model for A (Median Imputation) 

Classification Report for A (Median Imputation):

              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.69      0.24      0.36      1327

    accuracy                           0.81      6000
   macro avg       0.75      0.61      0.62      6000
weighted avg       0.79      0.81      0.77      6000

*-------------------------------------------------*

Processing Model for B (Linear Regression Imputation) 

Classification Report for B (Linear Regression Imputation):

              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.69      0.24      0.36      1327

    accuracy                           0.81      6000
   macro avg       0.75      0.61      0.62      6000
weighted avg       0.79      0.81      0.77      6000

*-------------------------------------------------*

Processing Model for C (Non-

---

## Part C: Comparative Analysis

### 1. Results Comparison

Let's summarize the performance metrics of the four models in a table to facilitate comparison. We focus on the F1-score for the positive class (default = 1), as it provides a balanced measure of precision and recall, which is crucial for imbalanced classification problems like credit default prediction.

In [19]:
!pip install jinja2

[0mCollecting jinja2
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting MarkupSafe>=2.0 (from jinja2)
  Using cached markupsafe-3.0.3-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (2.7 kB)
Using cached jinja2-3.1.6-py3-none-any.whl (134 kB)
[0mDownloading markupsafe-3.0.3-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (20 kB)
Installing collected packages: MarkupSafe, jinja2
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [jinja2]2m1/2[0m [jinja2]
[1A[2KSuccessfully installed MarkupSafe-3.0.3 jinja2-3.1.6


In [20]:
import matplotlib.pyplot as plt
import seaborn as sns
# Set plot style
sns.set_style('whitegrid')
pd.options.mode.chained_assignment = None # Ignore SettingWithCopyWarning
results_df = pd.DataFrame(results).T # Transpose to get models as rows
results_df = results_df[['Accuracy', 'Precision (1)', 'Recall (1)', 'F1-Score (1)']]

results_df.style.background_gradient(cmap='viridis', subset=['F1-Score (1)'])

Unnamed: 0,Accuracy,Precision (1),Recall (1),F1-Score (1)
A (Median Imputation),0.8085,0.689362,0.24416,0.360601
B (Linear Regression Imputation),0.808333,0.689507,0.242653,0.358974
C (Non-Linear KNN Imputation),0.808667,0.693305,0.241899,0.358659
D (Listwise Deletion),0.801866,0.710706,0.246057,0.365554


### 2. Efficacy Discussion

#### Trade-off Between Listwise Deletion and Imputation

The primary trade-off is between data preservation and simplicity. Imputation methods (Models A, B, C) retain the full sample size, preserving statistical power and avoiding bias that can occur when data is not missing completely at random. Listwise deletion (Model D) is simple but discards a significant number of observations (5,466 vs. 6,000), which can lead to a less robust and potentially biased model.

Interestingly, in this specific experiment, Model D (Listwise Deletion) yielded a slightly higher F1-score (0.37). This unexpected result could happen if the removed rows contained noisy or less informative data, coincidentally leading to a "cleaner" dataset for the classifier. However, one should be cautious. Even though it performed marginally better here, Model D could perform poorly in a real-world scenario because it was trained on a smaller, potentially unrepresentative subset of the population, and it may not generalize well to cases similar to those that were deleted.

#### Linear vs. Non-Linear Regression Performance

The assignment asks to compare the linear and non-linear regression imputation methods. Based on the results, there was no difference in performance between Model B (Linear) and Model C (Non-Linear). Both resulted in an identical F1-score of 0.36.

This suggests a few possibilities:

- The relationship between the predictors and the imputed feature (BILL_AMT1) might not have been strong enough for a more complex model like KNN to provide significantly better estimates than linear regression.

- The imputed feature itself may not have been a dominant predictor for the final classification task. Therefore, even if the non-linear imputations were slightly more accurate, they were not different enough to change the logistic regression model's final predictions.

#### Concluding Recommendation

Based on the analysis, a recommendation involves balancing performance, conceptual soundness, and practicality.

While Model D (Listwise Deletion) had the highest F1-score, it's a risky strategy due to its inherent drawbacks of data loss and potential bias. The three imputation methods all produced identical, stable results.

Given that the simple Median Imputation (Model A) performed exactly the same as the more computationally intensive regression-based methods (Models B and C), it stands out as the most pragmatic choice in this scenario.

**Recommendation:** The best strategy for handling missing data in this case is Median Imputation.

**Justification:** It provides the same predictive performance as more complex regression models while being far simpler and faster to implement. It successfully avoids the data loss and potential bias associated with listwise deletion, making it a robust, efficient, and effective choice.