##  Part A — Data Preprocessing and Imputation

### 1 Load and Prepare the Dataset
We begin by loading the UCI Credit Card Default Clients Dataset and cleaning the column names for ease of access.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error


df = pd.read_csv("/Users/nikhil.narayan/Documents/GOOGLE DRIVE/SEM_7/DA5401/ASN6/UCI_Credit_Card.csv")


df.head()


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


###  Introduce Artificial Missing Values (MAR)

We simulate a *Missing At Random (MAR)* scenario by randomly removing 5–10 % of values from a few numerical columns.


In [2]:
cols_with_missing = ['AGE', 'BILL_AMT1', 'BILL_AMT2']
missing_fraction = 0.08

rng = np.random.default_rng(42)
for col in cols_with_missing:
    n_missing = int(missing_fraction * len(df))
    idx = rng.choice(df.index, n_missing, replace=False)
    df.loc[idx, col] = np.nan

df[cols_with_missing].isna().mean() * 100


AGE          8.0
BILL_AMT1    8.0
BILL_AMT2    8.0
dtype: float64

###  2 Imputation Strategy 1 — Simple (Median) Imputation → Dataset A

We replace missing values in each column by the column median to get Dataset A.


In [3]:
df_A = df.copy()
for col in cols_with_missing:
    df_A[col].fillna(df_A[col].median(), inplace=True)
df_A[cols_with_missing].isna().sum()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_A[col].fillna(df_A[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_A[col].fillna(df_A[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin

AGE          0
BILL_AMT1    0
BILL_AMT2    0
dtype: int64

**Why median over mean?**

Median is **robust to outliers**.  
Financial data (like `BILL_AMT`) are often skewed; extreme values can distort the mean.  
The median represents the typical case more accurately, producing a stable central tendency for imputation.


###  3 Imputation Strategy 2 — Linear Regression Imputation → Dataset B

Use a Linear Regression model to predict the missing values in one column (e.g., `BILL_AMT1`) from all other features.


In [4]:
df_B = df.copy()
target_col = 'BILL_AMT1'

train_rows = df_B[df_B[target_col].notna()]
test_rows  = df_B[df_B[target_col].isna()]

predictors = [c for c in df_B.columns if c not in ['ID', target_col, 'default_payment_next_month']]

X_train = train_rows[predictors].fillna(train_rows.median())
y_train = train_rows[target_col]
X_pred  = test_rows[predictors].fillna(train_rows.median())

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
df_B.loc[df_B[target_col].isna(), target_col] = lin_reg.predict(X_pred)
df_B[target_col].isna().sum()


0

**Explanation — Linear Regression Imputation**

This approach assumes the data are *Missing At Random (MAR)*, meaning missingness depends on observed features (e.g., AGE or LIMIT_BAL) but not on the missing values themselves.  
Linear relationships among predictors allow us to estimate the most plausible values for `BILL_AMT1`.


###  4 Imputation Strategy 3 — Non-Linear Regression (KNN / Decision Tree) → Dataset C

We now apply a non-linear method that can capture complex relationships between features.


In [5]:
df_C = df.copy()

imputer = KNNImputer(n_neighbors=5)
num_df = df_C.select_dtypes(include=np.number)
df_C_imputed = pd.DataFrame(imputer.fit_transform(num_df), columns=num_df.columns)

for col in df_C_imputed.columns:
    df_C[col] = df_C_imputed[col]

df_C[target_col].isna().sum()


0

**Explanation — Non-Linear Regression Imputation**

K-Nearest Neighbors imputation estimates a missing value from the average of its closest samples in the feature space.  
This method captures non-linear relationships that linear models might miss.  
Alternatives like Decision Tree Regressors can also approximate non-linear patterns via if–else splits.


###  Summary of Part A

| Dataset | Imputation Method | Key Idea | Assumption | Expected Bias |
|:--:|:--|:--|:--|:--|
| A | Median (Simple) | Replace NaN with column median | MCAR / mild MAR | Low |
| B | Linear Regression | Predict missing values linearly | MAR + linear relations | Moderate |
| C | Non-Linear (KNN/Tree) | Predict non-linearly from neighbors | MAR + non-linear relations | Low–Moderate |


## Part B — Model Training and Performance Assessment

We now evaluate how each imputation strategy affects downstream model accuracy.
We’ll train a Logistic Regression classifier on four datasets:
A (Median Imputation), B (Linear Regression Imputation), C (Non-Linear Imputation), and D (Listwise Deletion).


In [6]:
for name, dfX in zip(['B','C'], [df_B, df_C]):
    dfX.fillna(dfX.median(), inplace=True)


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


### 1 – Data Split
For each imputed dataset, we split the data into training and testing sets (80 / 20).
Dataset D is created by dropping rows containing any missing values before the split.


In [8]:
target = 'default.payment.next.month'

# Helper to split data
def split_data(df):
    X = df.drop(columns=[target, 'ID'])
    y = df[target]
    return train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Datasets A–C (already imputed)
splits = {}
for name, data in zip(['A','B','C'], [df_A, df_B, df_C]):
    splits[name] = split_data(data)

# Dataset D – Listwise Deletion
df_D = df.dropna()
splits['D'] = split_data(df_D)


### 2 – Standardization
We standardize features in every dataset using `StandardScaler`
so that the Logistic Regression model is not biased by differing feature scales.


In [9]:
scaler = StandardScaler()
scaled_splits = {}

for name, (X_train, X_test, y_train, y_test) in splits.items():
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled  = scaler.transform(X_test)
    scaled_splits[name] = (X_train_scaled, X_test_scaled, y_train, y_test)


### 3 – Model Training and Evaluation
We train a Logistic Regression classifier on each dataset
and evaluate it using Accuracy, Precision, Recall, and F1-Score.


In [10]:
for name, (X_train, X_test, y_train, y_test) in scaled_splits.items():
    clf = LogisticRegression(max_iter=1000, solver='lbfgs')
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(f"\n=== Dataset {name} Results ===")
    print(classification_report(y_test, y_pred, digits=3))



=== Dataset A Results ===
              precision    recall  f1-score   support

           0      0.817     0.969     0.887      4673
           1      0.684     0.238     0.353      1327

    accuracy                          0.807      6000
   macro avg      0.751     0.603     0.620      6000
weighted avg      0.788     0.807     0.769      6000


=== Dataset B Results ===
              precision    recall  f1-score   support

           0      0.818     0.969     0.887      4673
           1      0.690     0.243     0.359      1327

    accuracy                          0.808      6000
   macro avg      0.754     0.606     0.623      6000
weighted avg      0.790     0.808     0.770      6000


=== Dataset C Results ===
              precision    recall  f1-score   support

         0.0      0.818     0.969     0.887      4673
         1.0      0.688     0.243     0.359      1327

    accuracy                          0.808      6000
   macro avg      0.753     0.606     0.623    

## Part C — Comparative Analysis

### 1 – Results Comparison

The following table summarizes performance across all four datasets.
Along with Accuracy and F1-score, we include **precision for class 0 (non-defaulters)** and **class 1 (defaulters)** to highlight class imbalance.

| Model | Imputation Method | Accuracy | Precision (0) | Precision (1) | F1-Score (1) | Key Comments |
|:--:|:--|:--:|:--:|:--:|:--:|:--|
| **A** | Median (Simple) | 0.807 | 0.817 | 0.684 | 0.353 | Baseline; robust and easy to apply. |
| **B** | Linear Regression | 0.808 | 0.818 | 0.690 | 0.359 | Slightly higher precision/F1; fits linear MAR assumption. |
| **C** | Non-Linear (KNN) | 0.808 | 0.818 | 0.688 | 0.359 | Same as B → non-linear effects minimal. |
| **D** | Listwise Deletion | 0.810 | 0.815 | 0.746 | 0.358 | Comparable accuracy; fewer training rows. |

**Interpretation:**  
All models achieve similar overall accuracy (~80 %), but regression-based imputations slightly improve precision and F1 for the minority class (defaults).  
This shows that even a limited imputation strategy can recover small but meaningful predictive information.

---

### 2 – Why Accuracy Can Be Misleading

In credit-default prediction, the dataset is **imbalanced** — most customers are non-defaulters.  
A classifier that always predicts “no default” would still reach ~80 % accuracy, yet completely fail to identify actual defaulters.  
Therefore, **Accuracy overstates performance** on such data; metrics like **Precision, Recall, and F1-score for class 1** are much better indicators of true effectiveness.

---

### 3 – Efficacy Discussion

**Listwise Deletion (D):**  
- Removes every record containing any NaN, reducing sample size and potentially altering the data distribution.  
- Its accuracy appears slightly higher but this can be an artifact of easier majority-class prediction.  
- Loss of minority examples weakens generalization.

**Imputation Models (A–C):**  
- Preserve data volume and representativeness.  
- Model A (Median) works best for MCAR data.  
- Model B (Linear) captures relationships under MAR, giving a small gain in precision and F1.  
- Model C (Non-Linear KNN) adds flexibility but yields similar results here—implying that the underlying relationship between `BILL_AMT1` and other predictors is mostly linear.

---

### 4 – Conclusion and Recommendation

- **Best Practical Strategy:** Linear-regression imputation (Model B).  
  It provides a slight but consistent lift in minority-class metrics with minimal extra complexity.  
- **When to Use Median:** Quick baseline or nearly complete data.  
- **When to Use Listwise Deletion:** Only if missingness < 2 % or interpretability outweighs potential bias.

Overall, regression-based imputation preserves information and stability, while accuracy alone should never be used to judge performance in imbalanced classification problems.
