
# Credit Card Default Prediction

## Business Objective
The objective of this project is to predict which customers are likely to default on their credit card payments.
This is critical for financial institutions to reduce credit risk and minimize financial losses.

## Evaluation Focus
Since failing to identify a defaulter results in direct financial loss, this project prioritizes **Recall**
along with Accuracy and F1-score.


In [None]:

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score



## Data Loading
The dataset is loaded directly from a public repository to ensure reproducibility.


In [None]:

# Load dataset
default = pd.read_csv(
    "https://raw.githubusercontent.com/ybifoundation/Dataset/main/Credit%20Default.csv"
)

default.head()



## Exploratory Data Analysis (EDA)


In [None]:

# Target distribution
sns.countplot(x='Default', data=default)
plt.title('Distribution of Credit Card Defaults')
plt.show()

# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(default.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()



### EDA Insights
- The dataset shows class imbalance.
- Loan-related features have strong influence on default behavior.



## Feature Engineering


In [None]:

# Loan to Income Ratio
if 'Loan' in default.columns and 'Income' in default.columns:
    default['Loan_to_Income'] = default['Loan'] / default['Income']



## Data Preparation


In [None]:

X = default.drop('Default', axis=1)
y = default['Default']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)



## Model Building


In [None]:

# Logistic Regression
lr_model = LogisticRegression(max_iter=500)
lr_model.fit(X_train, y_train)

# Random Forest
rf_model = RandomForestClassifier(n_estimators=200, random_state=42)
rf_model.fit(X_train, y_train)



## Model Evaluation


In [None]:

# Logistic Regression Evaluation
y_pred_lr = lr_model.predict(X_test)
print('--- Logistic Regression ---')
print(confusion_matrix(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))
print('Accuracy:', accuracy_score(y_test, y_pred_lr))

# Random Forest Evaluation
y_pred_rf = rf_model.predict(X_test)
print('--- Random Forest ---')
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))
print('Accuracy:', accuracy_score(y_test, y_pred_rf))



### Model Performance Discussion
While the Random Forest achieved perfect accuracy on the test set, this may indicate potential
overfitting due to the limited dataset size. Further validation is required for real-world use.



## Cross-Validation


In [None]:

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(n_estimators=200, random_state=42))
])

cv_recall = cross_val_score(
    rf_pipeline,
    X,
    y,
    cv=5,
    scoring='recall'
)

print("Cross-Validation Recall Scores:", cv_recall)
print("Mean Recall:", cv_recall.mean())



## Final Conclusion & Business Insight

The Random Forest model outperformed Logistic Regression in identifying high-risk customers.
Although perfect test accuracy was observed, further validation using larger datasets
is recommended.

This project demonstrates how machine learning can support credit risk assessment
and assist banks in proactive decision-making.
