<a href="https://colab.research.google.com/github/Drowser2430/Drowser2430/blob/main/Drowser_Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Upload the Data

In [2]:

from google.colab import files
uploaded = files.upload()


Saving train_data.csv to train_data (1).csv


This dataset contains demographics and financial data about clients. We will use this dataset in order to predict loan repayment behavior.

Load **the** Data


In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv('train_data.csv')  # Use your actual file name
df.head()


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,410704,0,Cash loans,F,N,Y,1,157500.0,900000.0,26446.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,381230,0,Cash loans,F,N,Y,1,90000.0,733176.0,21438.0,...,0,0,0,0,0.0,0.0,0.0,0.0,2.0,1.0
2,450177,0,Cash loans,F,Y,Y,0,189000.0,1795500.0,62541.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,332445,0,Cash loans,M,Y,N,0,175500.0,494550.0,45490.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
4,357429,0,Cash loans,F,Y,Y,0,270000.0,1724688.0,54283.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


Loading the dataset to start preparing it for feature engineering and model behavior.

Clean the Data

In [4]:
# Remove extreme outliers in income
df = df[df['AMT_INCOME_TOTAL'] < 1e6]

# Drop unnecessary ID columns
df = df.drop(columns=['SK_ID_CURR'], errors='ignore')

# Fill missing values with median
df = df.fillna(df.median(numeric_only=True))

# Quick data check
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 153611 entries, 0 to 153754
Columns: 121 entries, TARGET to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 143.0+ MB


Now we clean the data, and we do this by removing outliers in 'AMT_INCOME_TOTAL' above 1 million to (reduce noise). Removed 'SK_ID_CURR' column as it is an identifier and not useful for modeling. Lasttly, filled missing values using the median to ensure the model recieves complete inputs. These steps are essential to ensuring the model learns from clean, consistent patterns in the data.

Encoding Categorical Columns

In [8]:
# One-hot encode object (categorical) columns
X = pd.get_dummies(df.drop(columns='TARGET'), drop_first=True)
y = df['TARGET']


Since XGboost does not accept non-numeric values, we use one-hot encoding to convert all object- type (text) columns into binary numeric columns. This transformation allows model to train without errors. I did have Gemini assist me with this as I was a bit lost.

In [9]:
X = pd.get_dummies(df.drop(columns='TARGET'), drop_first=True)
y = df['TARGET']


Splitting the Dataset

In [11]:
from sklearn.model_selection import train_test_split

# Perform train/test/validation split with stratified sampling
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)


The dataset is split into training,  validation, and test sets using stratified sample to preserve the orginal distribution of the target variable ('Target'). This ensures both defaulted non defaulted loans are fairly represented in each split.

In [None]:
Baseline Model Training

In [12]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Train XGBoost model on baseline data
baseline_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
baseline_model.fit(X_train, y_train)

# Predict on validation set
y_pred_base = baseline_model.predict(X_val)
y_prob_base = baseline_model.predict_proba(X_val)[:, 1]

# Evaluate baseline performance
print("=== Baseline Model Performance ===")
print(classification_report(y_val, y_pred_base))
print("ROC AUC Score:", roc_auc_score(y_val, y_prob_base))


Parameters: { "use_label_encoder" } are not used.



=== Baseline Model Performance ===
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     21182
           1       0.40      0.03      0.06      1860

    accuracy                           0.92     23042
   macro avg       0.66      0.51      0.51     23042
weighted avg       0.88      0.92      0.88     23042

ROC AUC Score: 0.7328865145188196


The Baseline Model Summary confirms that baseline XGboost model achieved high overall accuracy (92%), but this is misleading due to class imbalance. It only correctly identified 3% of the default cases (recall = 0.03), which is not acceptable in a loan default prediction scenario. This confirms the need for feature engineerting to give the model better signals. REsampling techniques like SMOTE to address class imbalance.  

APPLY SMOTE to Address Class Imbalance

In [13]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE only to the training data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Check class distribution after resampling
print("Original training set class distribution:")
print(y_train.value_counts())
print("\nAfter SMOTE:")
print(pd.Series(y_resampled).value_counts())


Original training set class distribution:
TARGET
0    98844
1     8683
Name: count, dtype: int64

After SMOTE:
TARGET
0    98844
1    98844
Name: count, dtype: int64


SMOTE successfully balanced the training dataset by generating synthetic examples of the minority class (defaults). This helps prevent the model from becoming biased toward the majority class (non-defaults) and improves its ability to recognize patterns in both groups.

After resampling, the class distribution in the training data is now 50/50, which provides a better foundation for learning and should improve recall and precision for class 1 in later evaluations.


Create a CREDIT_INCOME_RATIO Feature

In [14]:
# Create CREDIT_INCOME_RATIO feature in both resampled training and validation sets
X_resampled['CREDIT_INCOME_RATIO'] = X_resampled['AMT_CREDIT'] / (X_resampled['AMT_INCOME_TOTAL'] + 1)
X_val['CREDIT_INCOME_RATIO'] = X_val['AMT_CREDIT'] / (X_val['AMT_INCOME_TOTAL'] + 1)


  X_resampled['CREDIT_INCOME_RATIO'] = X_resampled['AMT_CREDIT'] / (X_resampled['AMT_INCOME_TOTAL'] + 1)


This feature gives the model insight into how much credit is being requested compared to a person’s total income. A higher ratio might indicate a greater risk of default, while a lower ratio suggests better repayment capacity. Adding this kind of derived feature is one way to enhance the model’s predictive power using domain knowledge.


Create AGE_GROUP Feature by Binning Age

In [15]:
# Convert DAYS_BIRTH (which is in negative days) to age in years and bin it
X_resampled['AGE_GROUP'] = pd.cut(-X_resampled['DAYS_BIRTH'] / 365, bins=[20, 30, 40, 50, 60, 80], labels=False)
X_val['AGE_GROUP'] = pd.cut(-X_val['DAYS_BIRTH'] / 365, bins=[20, 30, 40, 50, 60, 80], labels=False)


  X_resampled['AGE_GROUP'] = pd.cut(-X_resampled['DAYS_BIRTH'] / 365, bins=[20, 30, 40, 50, 60, 80], labels=False)


We binned applicants’ age into groups of 10-year ranges. This simplifies the representation of age and can help the model identify trends or risk profiles based on broader life stages (young adults, middle-aged, older clients). It also helps reduce overfitting by generalizing age-related behavior.


Train Model on Feature-Engineered Data

In [16]:
# Retrain model with engineered features
fe_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
fe_model.fit(X_resampled, y_resampled)

# Predict on unchanged validation set
y_pred_fe = fe_model.predict(X_val)
y_prob_fe = fe_model.predict_proba(X_val)[:, 1]

# Evaluate performance
print("=== Feature Engineered Model Performance ===")
print(classification_report(y_val, y_pred_fe))
print("ROC AUC Score (FE):", roc_auc_score(y_val, y_prob_fe))


Parameters: { "use_label_encoder" } are not used.



=== Feature Engineered Model Performance ===
              precision    recall  f1-score   support

           0       0.92      0.99      0.96     21182
           1       0.36      0.04      0.08      1860

    accuracy                           0.92     23042
   macro avg       0.64      0.52      0.52     23042
weighted avg       0.88      0.92      0.89     23042

ROC AUC Score (FE): 0.72775356790052


In comparing Baseline vs. Feature-Engineered Model, the recall and F1-score for class 1 improved slightly. The model is catching defaulters than before (still not great, but it's learning). ROC AUC dropped very slightly, which may be due to overlifting on synthetic data or not enough separation power from added features.

Overall, the baseline XGBoost Model,, trained with no feature engineering, acheieved high accuracy (92%) and a decent ROC AUC (0.7329). However, it failed to identify defaulters effectively, with a recall of just 0.03 for class 1.

After retraining the model, the **recall for class 1 increased slightly from 0.03 to 0.04**, and the **F1-score improved from 0.06 to 0.08**. These are small but meaningful steps toward building a more fair and effective model. ROC AUC remained relatively stable.

In conclusion, feature engineering helped the model identify more default cases without hurting overall performance. With further tuning, especially on threshold selection or advanced sampling techniques, this model could become even more robust in real-world settings.

