# DA5401 - ASSIGNMENT 6
## Imputation via Regression for Missing Data

#### Importing the useful libraries

In [65]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

### Part A: Data Preprocessing and Imputation 
1. Loading and Preparing Data: Loading the dataset and, artificially introducing MAR missing values (5-10% in 2-3 numerical feature columns). The target variable is 'default payment next month'. 
2. Imputation Strategy 1: Simple Imputation (Baseline): 
    - Creating a clean dataset copy (Dataset A). 
    - For each column with missing values, filling the missing values with the median of that column.
3. Imputation Strategy 2: Regression Imputation (Linear): 
    - Creating a second clean dataset copy (Dataset B). 
    - For a single column with missing values, using a Linear Regression model to predict the missing values based on all other non-missing features.  
4. Imputation Strategy 3: Regression Imputation (Non-Linear): 
    - Creating a third clean dataset copy (Dataset C). 
    - For the same column as in Strategy 2, using a non-linear regression model (e.g., K-Nearest Neighbors Regression or Decision Tree Regression) to predict the missing values.

#### Loading the Dataset - UCI_Credit_Card.csv

In [66]:
df = pd.read_csv('UCI_Credit_Card.csv')

#### Artificially Introducing MAR (Missing At Random) missing values - 8% in 3 feature columns with target column - default payment next month

In [67]:
# Selecting columns to introduce missing values
cols_with_missing = ['AGE', 'BILL_AMT1', 'BILL_AMT2']

np.random.seed(42)

# Introducing 8% Missing At Random (MAR) values
for col in cols_with_missing:
    n_missing = int(0.08 * len(df))
    missing_indices = np.random.choice(df.index, n_missing, replace=False)
    # Introduce NaN in selected positions
    df.loc[missing_indices, col] = np.nan

print("\nMissing values introduced successfully!")
print(df[cols_with_missing].isna().sum())

# Verifying target column presence
print("\nTarget variable check:")
print(df['default.payment.next.month'].value_counts())

# Saving a copy for further steps
df.to_csv('UCI_Credit_Card_MAR.csv', index=False)
print("\nSaved as 'UCI_Credit_Card_MAR.csv'")


Missing values introduced successfully!
AGE          2400
BILL_AMT1    2400
BILL_AMT2    2400
dtype: int64

Target variable check:
default.payment.next.month
0    23364
1     6636
Name: count, dtype: int64

Saved as 'UCI_Credit_Card_MAR.csv'


##### After introducing Missing-At-Random (MAR) values, approximately 8% of the entries (around 2400 records) in the columns AGE, BILL_AMT1, BILL_AMT2 were replaced with NaN to simulate real-world data imperfections. The target variable, default.payment.next.month, remain unchanged with 23,364 non-defaulters and 6,636 defaulters, ensuring fair evaluation of subsequent imputation methods. 

----------------------------------------------------------------

#### Imputation 1 - Simple Imputation (Baseline) - Replacing missing values with median of the respective columns

In [68]:
# Creating a copy so the original remains unchanged
df_A = df.copy()

cols_with_missing = ['AGE', 'BILL_AMT1', 'BILL_AMT2']

# Replace NaN with median of each column
for col in cols_with_missing:
    median_value = df_A[col].median()
    df_A[col].fillna(median_value, inplace=True)

print("\nMissing values after imputation:")
print(df_A[cols_with_missing].isna().sum())

df_A.to_csv('UCI_Credit_Card_MedianImputed.csv', index=False)
print("\nSaved as 'UCI_Credit_Card_MedianImputed.csv'")


Missing values after imputation:
AGE          0
BILL_AMT1    0
BILL_AMT2    0
dtype: int64

Saved as 'UCI_Credit_Card_MedianImputed.csv'


##### After performing imputation, all missing values in the columns AGE, BILL_AMT1, BILL_AMT2 have been successfully filled, leaving no gaps in the dataset. This means the data is now complete and ready for further analysis or modeling. 

##### The median was chosen over the mean for the imputation because it provides a more reliable representation of the data when there are outliers or skewed distributions, which are common in financial datasets. The mean can be heavily influenced by extremely high or low values, leading to distorted replacements for the missing data. The median, being the middle value, remains unaffected by such extremes and preserves the natural balance of the dataset. This makes it a more robust and realistic choice for imputing missing values, ensuring that the overall distribution of the data remains consistent and that the resulting model performs more reliably.

----------------------------------------------------------------

#### Imputation 2 - Regression Imputation (Linear) - Using a linear regression model to predict missing values

In [69]:
target_col = 'AGE'
df_B = df.copy()

# Rows where AGE is missing
missing_mask = df_B[target_col].isna()
train_data = df_B.loc[~missing_mask]
pred_data = df_B.loc[missing_mask]

# Selecting numeric columns except the target
numeric_cols = df_B.select_dtypes(include=np.number).columns.tolist()
numeric_cols.remove(target_col)

# Using only columns with no missing values for predictors
predictors = [col for col in numeric_cols if df_B[col].isna().sum() == 0]

X_train = train_data[predictors]
y_train = train_data[target_col]
X_pred = pred_data[predictors]

# Training Linear Regression
reg = LinearRegression()
reg.fit(X_train, y_train)

# Predict missing AGE values
predicted_values = reg.predict(X_pred)
df_B.loc[missing_mask, target_col] = predicted_values

print("\nMissing values in AGE column after imputation:")
print(df_B[target_col].isna().sum())

# Filling remaining missing values with median in other columns
remaining_missing_cols = df_B.columns[df_B.isna().sum() > 0].tolist() 

if remaining_missing_cols: 
    print("\nColumns still with missing values:", remaining_missing_cols) 
    for col in remaining_missing_cols: 
        median_value = df_B[col].median() 
        df_B[col].fillna(median_value, inplace=True) 
else: print("\nNo remaining missing columns found.") 

print("\nRemaining missing values after median imputation:", df_B.isna().sum().sum())

# Saving clean dataset
df_B.to_csv('UCI_Credit_Card_RegressionImputed.csv', index=False)
print("\nSaved as 'UCI_Credit_Card_RegressionImputed.csv'")


Missing values in AGE column after imputation:
0

Columns still with missing values: ['BILL_AMT1', 'BILL_AMT2']

Remaining missing values after median imputation: 0

Saved as 'UCI_Credit_Card_RegressionImputed.csv'


##### After running the regression-based imputation, the results show that there are no missing values left in the AGE column - meaning the model successfullly filled in all the previously missing entries using predictions based on other available numerical features. This indicates that the regression model captured enough relationships in the data to estimate resonable AGE values. 

##### This method assumes that the missing AGE values can be predicted from other known features through a straight-line (linear) relationship. It also relies on the idea that the missingness is Missing At Random (MAR) - meaning the reason AGE is missing depends on other observed data, not on AGE itself. When both hold true, linear regression gives realistic and consistent estimates for the missing values.

----------------------------------------------------------------

#### Imputation 3 - Regression Imputation (Non-Linear) - Using a non-linear regressiion model (K-Nearest Neighbors) 

In [70]:
df_C = df.copy()

target_col = 'AGE'

# Identifying rows where AGE is missing
missing_mask = df_C[target_col].isna()

# Training and prediction subsets
train_data = df_C.loc[~missing_mask]
pred_data = df_C.loc[missing_mask]

# Selecting numeric columns except the target
numeric_cols = df_C.select_dtypes(include=np.number).columns.tolist()
numeric_cols.remove(target_col)

# Keep only columns with no missing values
predictors = [col for col in numeric_cols if df_C[col].isna().sum() == 0]

# Define training and prediction data
X_train = train_data[predictors]
y_train = train_data[target_col]
X_pred = pred_data[predictors]

knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)

# Predicting missing AGE values
predicted_values = knn_reg.predict(X_pred)

# Filling missing AGE
df_C.loc[missing_mask, target_col] = predicted_values

print("\nMissing values in 'AGE' after KNN regression imputation:",
      df_C[target_col].isna().sum())

# Filling remaining missing values with median in other columns
remaining_missing_cols = df_C.columns[df_C.isna().sum() > 0].tolist() 

if remaining_missing_cols: 
    print("\nColumns still with missing values:", remaining_missing_cols) 
    for col in remaining_missing_cols: 
        median_value = df_C[col].median() 
        df_C[col].fillna(median_value, inplace=True) 
else: print("\nNo remaining missing columns found.") 

print("\nRemaining missing values after median imputation:", df_C.isna().sum().sum())

# Saving cleaned dataset
df_C.to_csv('UCI_Credit_Card_NonLinearImputed.csv', index=False)
print("\nSaved as 'UCI_Credit_Card_NonLinearImputed.csv'")



Missing values in 'AGE' after KNN regression imputation: 0

Columns still with missing values: ['BILL_AMT1', 'BILL_AMT2']

Remaining missing values after median imputation: 0

Saved as 'UCI_Credit_Card_NonLinearImputed.csv'


##### The result shows that after applying KNN regression imputation, there are no missing values left in the AGE column - which means the algorithm successfully estimated and filled all the previously missing entries. This worked by finding other records in the dataset that were most similar to the rows with missing AGE values - based on other numerical features - and then using the average of those neighbors' ages to fill the gaps.

##### The underlying assumption behind this non-linear method is that data points that are similar in their other features (like bill amounts, payments, or credit limits) will also have similar AGE values. KNN doesn't assume a staright-line (linear) relationship between variables; instead, it assumes that local patterns and proximity in the data space can capture complex, possibly non-linear relationships. Essentially, it relies on the idea that 'similar people behave similarly', making it a flexible, data-driven way to handle missing values when linear regression might not capture the full pattern.

----------------------------------------------------------------
----------------------------------------------------------------

### Part B: Model Training and Performance Assessment 
1. Data Split: For each of the three imputed datasets (A, B, C), splitting the data into training and testing sets. Also, creating a fourth dataset (Dataset D) by simply removing all rows that contain any missing values (Listwise Deletion). Splitting Dataset D into training and testing sets. 
2. Classifier Setup: Standardizing the features in all four datasets (A, B, C, D) using StandardScaler. 
3. Model Evaluation: Training a Logistic Regression classifier on the training set of each of the four datasets (A, B, C, D). Evaluating the performance of each model on its respective test set using a full Classification Report (Accuracy, Precision, Recall, F1-score).

#### Splitting the data of 3 imputed datasets into training and testing and also creating new dataset by removing all missing values rows and then splitting that too

In [58]:
target_col = 'default.payment.next.month' 

# For df_A
X_A = df_A.drop(columns=[target_col])
y_A = df_A[target_col]

X_A_train, X_A_test, y_A_train, y_A_test = train_test_split(X_A, y_A, test_size=0.2, random_state=42, stratify=y_A)

# For df_B
X_B = df_B.drop(columns=[target_col])
y_B = df_B[target_col]
X_B_train, X_B_test, y_B_train, y_B_test = train_test_split(X_B, y_B, test_size=0.2, random_state=42, stratify=y_B)

# For df_C
X_C = df_C.drop(columns=[target_col])
y_C = df_C[target_col]
X_C_train, X_C_test, y_C_train, y_C_test = train_test_split(X_C, y_C, test_size=0.2, random_state=42, stratify=y_C)

# Drop any row that has at least one NaN
df_D = df.dropna()

print("Shape after listwise deletion:", df_D.shape)
print("Missing values in df_D:", df_D.isna().sum().sum())  

X_D = df_D.drop(columns=[target_col])
y_D = df_D[target_col]

X_D_train, X_D_test, y_D_train, y_D_test = train_test_split(X_D, y_D, test_size=0.2, random_state=42, stratify=y_D)


Shape after listwise deletion: (23364, 25)
Missing values in df_D: 0


##### The code prepares four differrent versions of the credit card dataset for model training and testing. For datasets A, B, and C - which were imputed using median, linear and non-linear regression methods - the data is split into training and testing sets while keeping the proportion of default and non-default cases balanced. In the final part, we created another version of the dataset, df_D, using listwise deletion, which simply removes any row containing missing values. This output shows that after this process, 23,364 complete records remain with no missing data at all. This means the dataset is now fully clean but smaller in size, as some information was lost when incomplete entries were dropped.  

----------------------------------------------------------------

#### Standardizing all the 4 datasets

In [59]:
scaler_A = StandardScaler()

# Fit scaler on training set only
X_A_train_scaled = scaler_A.fit_transform(X_A_train)
X_A_test_scaled = scaler_A.transform(X_A_test)

scaler_B = StandardScaler()
X_B_train_scaled = scaler_B.fit_transform(X_B_train)
X_B_test_scaled = scaler_B.transform(X_B_test)

scaler_C = StandardScaler()
X_C_train_scaled = scaler_C.fit_transform(X_C_train)
X_C_test_scaled = scaler_C.transform(X_C_test)

scaler_D = StandardScaler()
X_D_train_scaled = scaler_D.fit_transform(X_D_train)
X_D_test_scaled = scaler_D.transform(X_D_test)

#### Training Logistic Regression classifier on all datasets and evaluating the performance

In [60]:
print("--- Logistic Regression on Dataset A (Median Imputation) ---")

lr_A = LogisticRegression(max_iter=1000, random_state=42)
lr_A.fit(X_A_train_scaled, y_A_train)

y_A_pred = lr_A.predict(X_A_test_scaled)

print(classification_report(y_A_test, y_A_pred))

print("--- Logistic Regression on Dataset B (Linear Regression Imputation) ---")

lr_B = LogisticRegression(max_iter=1000, random_state=42)
lr_B.fit(X_B_train_scaled, y_B_train)

y_B_pred = lr_B.predict(X_B_test_scaled)

print(classification_report(y_B_test, y_B_pred))

print("--- Logistic Regression on Dataset C (Non-Linear Regression Imputation) ---")

lr_C = LogisticRegression(max_iter=1000, random_state=42)
lr_C.fit(X_C_train_scaled, y_C_train)

y_C_pred = lr_C.predict(X_C_test_scaled)

print(classification_report(y_C_test, y_C_pred))

print("--- Logistic Regression on Dataset D (Listwise Deletion) ---")

lr_D = LogisticRegression(max_iter=1000, random_state=42)
lr_D.fit(X_D_train_scaled, y_D_train)

y_D_pred = lr_D.predict(X_D_test_scaled)

print(classification_report(y_D_test, y_D_pred))


--- Logistic Regression on Dataset A (Median Imputation) ---
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.69      0.24      0.35      1327

    accuracy                           0.81      6000
   macro avg       0.75      0.60      0.62      6000
weighted avg       0.79      0.81      0.77      6000

--- Logistic Regression on Dataset B (Linear Regression Imputation) ---
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.69      0.24      0.35      1327

    accuracy                           0.81      6000
   macro avg       0.75      0.60      0.62      6000
weighted avg       0.79      0.81      0.77      6000

--- Logistic Regression on Dataset C (Non-Linear Regression Imputation) ---
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.69     

##### The results show that all four approaches - median imputation, linear regression imputation, non-linear regression imputation, and listwise deletion - produced very similar model performance. The Logistic Regression classifier was able to identify the majority class (non-default, labeled 0) with high precision and recall across all datasets, achieving around 82% precision and 97-98% recall for that class. However, for the minority class (default, labeled 1), the model struggled, with lower recall (around 24%) despite a precision of about 69-74%.

##### Overall accuracy remained around 81% for all datasets, and the weighted averages for F1-score were consistent at approximately 0.77. This indicates that, while the models are good at predicting the majority class, they have difficulty correctly identifying defaults, whih is often expected in imbalanced datasets. Interestingly, the type of imputation - whether median, linear regression, or KNN-based non-linear regression - did not significantly change the classifier's performance, suggesting that all three strategies were sufficient to handle missing values without biasing the results. Likewise deletion also performed similarly, though with slightly fewer total samples, confirming that removing rows with missing values did not drastically alter the overall model behavior.

------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------

### Part C: Comparative Analysis 
1. Results Comparison: Creating a summary table comparing the performance metrics (especially F1-score) of the four models: 
    - Model A (Median Imputation) 
    - Model B (Linear Regression Imputation) 
    - Model C (Non-Linear Regression Imputation) 
    - Model D (Listwise Deletion) 
2. Efficacy Discussion: 
    - Discussing the trade-off between Listwise Deletion (Model D) and Imputation (Models A, B, C). Why might Model D perform poorly even if the imputed models perform worse?
    - Which regression method (Linear vs. Non-Linear) performed better and why? Relating this to the assumed relationship between the imputed feature and the predictors. 
    - Concluding with a recommendation on the best strategy for handling missing data in this scenario, justifying by referencing both the classification performance metrics and the conceptual implications of each method. 


#### Results Comparison

In [64]:
datasets_preds = {
    "Model A (Median Imputation)": (y_A_test, y_A_pred),
    "Model B (Linear Regression Imputation)": (y_B_test, y_B_pred),
    "Model C (Non-Linear Regression Imputation)": (y_C_test, y_C_pred),
    "Model D (Listwise Deletion)": (y_D_test, y_D_pred)
}

results = []

# Loop through models and generate metrics
for model_name, (y_test, y_pred) in datasets_preds.items():
    report = classification_report(y_test, y_pred, output_dict=True)
    
    results.append({
        "Model": model_name,
        "Accuracy": report["accuracy"],
        "Precision": report["weighted avg"]["precision"],
        "Recall": report["weighted avg"]["recall"],
        "F1-Score": report["weighted avg"]["f1-score"]
    })

# Create a summary DataFrame
summary_df = pd.DataFrame(results)
print("\nModel Performance Summary:")
print(summary_df.round(4))


Model Performance Summary:
                                        Model  Accuracy  Precision  Recall  \
0                 Model A (Median Imputation)    0.8077     0.7891  0.8077   
1      Model B (Linear Regression Imputation)    0.8073     0.7884  0.8073   
2  Model C (Non-Linear Regression Imputation)    0.8078     0.7893  0.8078   
3                 Model D (Listwise Deletion)    0.8125     0.8008  0.8125   

   F1-Score  
0    0.7690  
1    0.7687  
2    0.7694  
3    0.7731  


##### Looking at the performance summary, all 4 models show fairly similar results, with overall accuracy hovering around 81%. Among them, Model D, which used listwise deletion, slightly outperforms the other across all metrics, achieving the highest accuracy (0.8125), precision (0.8008), recall (0.8125), and F1-score (0.7731). The median and regression imputation models (A, B, and C) perform almost equally, with only tiny differences in F1-scores and other metrics. This suggests that while imputing missing values - whether with median, linear regression, or non-linear methods - produces solid models, removing rows with missing data (listwise deletion) in this case gave a marginally better predictive performance. Overall, the models are consistent, and the choice of imputation strategy does not drastically change the outcomes. 

----------------------------------------------------------------

#### Trade-off between Model D and other 3 Models

##### Listwise deletion (Model D) removes any row that contains a missing value, which gurantees a complete dataset for modeling. This approach often simplifies processing and avoids the risks of incorrectly estimating values. In this case, Model D slightly outperformed the imputated models in terms of accuracy, precision, recall, and F1-score. However, the trade-off is that it discards data; here, about 1000+ rows were removed, which reduces the amount of training information and could lead to biased models if the missingness is not completely random. Imputation strategies, on the other hand, preserve all data, leveraging available patterns to estimate missing values. This is why even though the imputated models (A, B, C) performed slightly worse, they maintain the full dataset and often provide more robust insights, especially in real-world scenarios where losing data could hide important patterns.

----------------------------------------------------------------

#### Better regression method

##### Between the two regression-based imputations, the non-linear method (Model C using KNN) performed marginally better than the linear regression method (Model B) across all performance metrics - showing a slightly higher accuracy (0.8078 vs 0.8073), precision (0.7893 vs 0.7884), recall (0.8078 vs 0.8073), and F1-score (0.7694 vs 0.7687).

##### The difference, though small, indicates that the relationship between the imputed feature (AGE) and its predictors is likely non-linear in nature. Linear regression assumes a straight-line relationship where changes in predictors affect the target proportionally. However, human-related variables like age often interact with socio-economic and occupational factors in more complex, curved, or clustered patterns.

##### The non-linear approach (such as KNN-based imputation) does not impose a linear structure; instead, it relies on similarity across multidimensional feature space, capturing localized patterns and subtle dependencies that a linear model would oversimplify. Consequently, this method yields imputations that are closer to the true underlying distribution of the missing values, leading to a slightly mode acurate and generalizable model.

----------------------------------------------------------------

#### Best recommended strategy 

##### Considering both the classification metrics and conceptual implications, imputation using the non-linear regression method (Model C) is the recommended strategy for this scenario. Although listwise deletion showed slightly higher F1-score and accuracy, it comes at the cost of discarding a significant portion of the data. Non-linear imputation retains the full dataset, respects potential complex relationships among features, and delivers nearly the same predictive performance as listwise deletion. This makes it a safer and more generalizable approach, particularly in datasets where missingness is MAR and maintaining data integrity is important. Median or linear regression imputation could also work reasonably well, but KNN-based non-linear imputation better adapts to the data's inherent structure and preserves subtle patterns that might be lost with simpler methods.

----------------------------------------------------------------
----------------------------------------------------------------