# **INTRODUCTION**

We will study and estimate for Home Credit Default Risk competition on Kaggle. 

# Question 1: Confirmation of competition details
_Learn & Predict_

We are tasked with predicting the likelihood of a loan applicant defaulting on a loan. The target is the column TARGET in the dataset where:

TARGET = 1: Client will default on the loan.

TARGET = 0: Client will repay the loan.

Our goal is to build a machine learning model to predict the probability of default (Target = 1) for each loan applicant using structured data.

_File for Submission_

We will submit a CSV file such that for each SK_ID_CURR in the test set, we must predict a probability for the TARGET variable. 

_Evaluation_

Our predictions will be evaluated using the Area Under the ROC Curve (AUC or ROC AUC) between the predicted probability and the observed target.

# **Creating a Baseline Model**
First, we create a baseline model using a simple method , which will be the benchmark for accuracy.

It doesn't need to be highly accurate. We just want it to run without errors and be able to submit estimates to Kaggle.

# Problem 2: Learning and verification
Create and execute a series of steps to easily analyze, preprocess, train, and verify data.

Use the evaluation metrics used in this competition for validation. The training method is not specified.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import KNNImputer
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

In [2]:
# Load training and test datasets
train = pd.read_csv('/kaggle/input/home-credit-default-risk/application_train.csv')
test = pd.read_csv('/kaggle/input/home-credit-default-risk/application_test.csv')

# Quick view
print("Train shape:", train.shape)
print("Test shape:", test.shape)
print("Missing values (top):\n", train.isnull().sum().sort_values(ascending=False).head())
print("Target Distribution:\n", train['TARGET'].value_counts(normalize=True))

Train shape: (307511, 122)
Test shape: (48744, 121)
Missing values (top):
 COMMONAREA_MEDI             214865
COMMONAREA_AVG              214865
COMMONAREA_MODE             214865
NONLIVINGAPARTMENTS_MODE    213514
NONLIVINGAPARTMENTS_AVG     213514
dtype: int64
Target Distribution:
 TARGET
0    0.919271
1    0.080729
Name: proportion, dtype: float64


# Preprocessing
For learning purposes we will use a different approach of filling the missing values with median instead of dropping the nulls as we did in the Housing Credit Analysis Notebook.

In [3]:
# Save target & drop from train
y = train['TARGET']
train = train.drop(columns=['TARGET'])

# Fill missing values
train = train.fillna(train.median(numeric_only=True))
test = test.fillna(test.median(numeric_only=True))

# Encode categorical features
le = LabelEncoder()
for col in train.select_dtypes('object').columns:
    train[col] = le.fit_transform(train[col].astype(str))
    test[col] = le.transform(test[col].astype(str))

# Align train and test
train, test = train.align(test, join='inner', axis=1)

In [4]:
# Split Train & Validate
X_train, X_val, y_train, y_val = train_test_split(train, y, test_size=0.2,
                                                  random_state=42, stratify=y)

In [5]:
# Train logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
val_probs = model.predict_proba(X_val)[:, 1]

# Predict on validation set
roc_auc = roc_auc_score(y_val, val_probs)
print("Validation ROC AUC:", roc_auc)

Validation ROC AUC: 0.6238852803755546


# Problem 3: Estimation on test data
Perform estimation on the test data (application_test.csv) and submit it to Kaggle.

As long as the submission is correct, it doesn't matter if the accuracy is low.

In [6]:
# Predict probabilities of default (TARGET = 1)
test_probs = model.predict_proba(test)[:, 1]

# Load IDs
submission = pd.DataFrame({
    'SK_ID_CURR': test['SK_ID_CURR'],
    'TARGET': test_probs
})

# Save to CSV for Kaggle submission
submission.to_csv('submission_home_credit.csv', index=False)

# Feature Engineering
Based on the baseline model, we will improve accuracy by making various adjustments to the input features.

# Problem 4: Feature engineering
To improve accuracy , perform feature engineering with the following in mind :

- Which features to use
- How to pre-process

We will summarize what the evaluation indicators for the validation data were when we did what. Conduct training and validation for at least five patterns.

For those with high accuracy, we will perform estimation on the test data and submit it to Kaggle.

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_auc_score
import lightgbm as lgb

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

results = {}

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_scaled, y_train)
lr_probs = lr.predict_proba(X_val_scaled)[:, 1]
results['Logistic Regression'] = roc_auc_score(y_val, lr_probs)

# Decision Tree
dt = DecisionTreeClassifier(max_depth=6, random_state=42)
dt.fit(X_train, y_train)
dt_probs = dt.predict_proba(X_val)[:, 1]
results['Decision Tree'] = roc_auc_score(y_val, dt_probs)

# Random Forest
rf = RandomForestClassifier(n_estimators=50, max_depth=10, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)
rf_probs = rf.predict_proba(X_val)[:, 1]
results['Random Forest'] = roc_auc_score(y_val, rf_probs)

# Linear SVM + Calibration for probability estimates
svm_base = LinearSVC(max_iter=5000)
svm = CalibratedClassifierCV(svm_base, cv=3)
svm.fit(X_train_scaled, y_train)
svm_probs = svm.predict_proba(X_val_scaled)[:, 1]
results['SVM (Calibrated Linear)'] = roc_auc_score(y_val, svm_probs)

# LightGBM
lgbm = lgb.LGBMClassifier(n_estimators=100, max_depth=7)
lgbm.fit(X_train, y_train)
lgbm_probs = lgbm.predict_proba(X_val)[:, 1]
results['LightGBM'] = roc_auc_score(y_val, lgbm_probs)

# Print results
print("Validation ROC AUC Scores:")
for model, score in results.items():
    print(f"{model}: {score:.4f}")



[LightGBM] [Info] Number of positive: 19860, number of negative: 226148
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.119996 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 11567
[LightGBM] [Info] Number of data points in the train set: 246008, number of used features: 117
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432482
[LightGBM] [Info] Start training from score -2.432482
Validation ROC AUC Scores:
Logistic Regression: 0.7468
Decision Tree: 0.7130
Random Forest: 0.7374
SVM (Calibrated Linear): 0.7465
LightGBM: 0.7584


LightGBM performed the best, as it's highly optimized for structured data like this.

Logistic Regression and SVM also did quite well with both close to LightGBM.

Decision Tree lags a bit, but still offers interpretability if needed.

Since LightGBM has high accuracy, we will perform estimation on the test data and submit it to Kaggle

In [8]:
# Prepare and preprocess the test dataset
# Recombine training features and labels
X_full = pd.concat([X_train, X_val])
y_full = pd.concat([y_train, y_val])

# Train final model
final_model = lgb.LGBMClassifier(n_estimators=100, max_depth=7)
final_model.fit(X_full, y_full)
# Fill missing values
test = test.fillna(test.median(numeric_only=True))

# Encode categoricals
for col in test.select_dtypes('object').columns:
    test[col] = le.transform(test[col].astype(str))

# Align columns (just in case)
test = test[X_full.columns]


[LightGBM] [Info] Number of positive: 24825, number of negative: 282686
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.139339 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 11461
[LightGBM] [Info] Number of data points in the train set: 307511, number of used features: 117
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432486
[LightGBM] [Info] Start training from score -2.432486


In [9]:
# Predict Probabilities
test_preds = final_model.predict_proba(test)[:, 1]

In [10]:
# Create submission file
submission = pd.read_csv("/kaggle/input/home-credit-default-risk/sample_submission.csv")
submission['TARGET'] = test_preds
submission.to_csv("lightgbm_submission.csv", index=False)
