# 🏦 Credit Risk Assessment: Predicting Credit Card Defaults

## 📌 Project Overview

In this project, I explore a real-world dataset of credit card clients to build a machine learning model that predicts whether a customer will default on their payment. This type of model is commonly used by financial institutions to assess borrower risk and make informed lending decisions.

The focus is on understanding:
- Which customer attributes (e.g., payment history, credit limit, bill amounts) are most predictive of default.
- How well a logistic regression model can perform in identifying risky clients.
- The challenges of working with imbalanced datasets in financial prediction.

---

## 🗂️ Data Source

Dataset: **Default of Credit Card Clients**

- **Source**: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)
- **Direct Download**: [Credit Card Default CSV](https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls)

> Note: This file is in `.xls` format. Use `pandas.read_excel()` to load it.

---

## 🔍 Project Objectives

1. **Data Cleaning & Preprocessing**
- Load the dataset and handle missing/null values (if any)
- Rename columns for clarity
    - Convert categorical variables (e.g., education, gender) into readable labels

2. **Exploratory Data Analysis (EDA)**
- Distribution of default vs non-default clients
- Correlation between features and default status
- Visualizations: bar plots, histograms, heatmaps

3. **Feature Engineering**
- Convert categorical features into dummy/encoded variables
- Normalize or scale continuous variables
- Handle class imbalance if needed (e.g., SMOTE)

4. **Model Building**
- Train a logistic regression model (baseline)
- Evaluate with accuracy, precision, recall, F1-score
    - Use a confusion matrix and ROC curve to visualize performance

5. **Insights & Interpretation**
- Which features have the strongest influence on default risk?
- What would a risk manager or fintech product owner take away?

6. **(Optional) Advanced Modeling**
- Try random forest, XGBoost, or a decision tree
- Use GridSearchCV for hyperparameter tuning
    - Compare performance metrics

---

## ✅ Key Skills Practiced

- Data wrangling and transformation (`pandas`, `numpy`)
- Exploratory Data Analysis (`matplotlib`, `seaborn`)
- Classification modeling (`scikit-learn`)
- Evaluation metrics for imbalanced data
    - Feature importance interpretation

---

## 🚀 Next Steps

- Package this into a simple Streamlit or Flask app to allow user input and get a default risk score
- Explore SHAP values for better explainability
    - Use a real-time scoring API or dashboard for production simulation



## 🧹 Step 1: Load and Inspect the Dataset

- Download the dataset from the UCI repository (link above)
- Load the Excel file into a DataFrame
- Check the shape (rows x columns)
- Display the first few rows to get a feel for the data
- Verify column names and datatypes

---

## 🔍 Step 2: Clean and Prepare the Data

- Rename columns for readability (optional, but helpful)
- Drop or handle any irrelevant or duplicate columns
- Check for missing values
  - If any exist, decide whether to fill, drop, or flag them
- Confirm data types (e.g., numeric, categorical, object)
  - Convert types where appropriate (e.g., categorical labels)

---

## 📊 Step 3: Explore the Data (EDA)

- Analyze the distribution of the target variable (default vs no default)
- Explore distributions of key features (e.g., age, bill amount, payment history)
- Visualize relationships:
  - Bar plots, histograms, boxplots, correlation heatmaps
- Look for class imbalance and feature skewness
- Start asking early questions:
  - Do payment delays correlate with defaults?
  - Are certain age groups riskier?

---

## 🛠️ Step 4: Prepare Data for Modeling (Coming Soon)

- Encode categorical features (if needed)
- Normalize or scale continuous features
- Split into training/testing sets
- Plan model evaluation strategy (accuracy, precision, recall, F1, ROC)



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

In [6]:
# assign a variable to the raw data path
cc_data_path = '../data/raw/default_of_credit_card_clients.xls'
# Load the data set
df = pd.read_excel(cc_data_path)

# check the data set shape
print("Data set shape: ", df.shape)

# check the data types
print("Data types: ", df.info())

# check the first few rows of data
print(df.head())


Data set shape:  (30001, 25)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30001 entries, 0 to 30000
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  30001 non-null  object
 1   X1          30001 non-null  object
 2   X2          30001 non-null  object
 3   X3          30001 non-null  object
 4   X4          30001 non-null  object
 5   X5          30001 non-null  object
 6   X6          30001 non-null  object
 7   X7          30001 non-null  object
 8   X8          30001 non-null  object
 9   X9          30001 non-null  object
 10  X10         30001 non-null  object
 11  X11         30001 non-null  object
 12  X12         30001 non-null  object
 13  X13         30001 non-null  object
 14  X14         30001 non-null  object
 15  X15         30001 non-null  object
 16  X16         30001 non-null  object
 17  X17         30001 non-null  object
 18  X18         30001 non-null  object
 19  X19         30001