# 🏦 Credit Risk Assessment: Predicting Credit Card Defaults

## 📌 Project Overview

In this project, I explore a real-world dataset of credit card clients to build a machine learning model that predicts whether a customer will default on their payment. This type of model is commonly used by financial institutions to assess borrower risk and make informed lending decisions.

The focus is on understanding:
- Which customer attributes (e.g., payment history, credit limit, bill amounts) are most predictive of default.
- How well a logistic regression model can perform in identifying risky clients.
- The challenges of working with imbalanced datasets in financial prediction.

---

## 🗂️ Data Source

Dataset: **Default of Credit Card Clients**

- **Source**: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)
- **Direct Download**: [Credit Card Default CSV](https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls)

> Note: This file is in `.xls` format. Use `pandas.read_excel()` to load it.

---

## 🔍 Project Objectives

1. **Data Cleaning & Preprocessing**
- Load the dataset and handle missing/null values (if any)
- Rename columns for clarity
    - Convert categorical variables (e.g., education, gender) into readable labels

2. **Exploratory Data Analysis (EDA)**
- Distribution of default vs non-default clients
- Correlation between features and default status
- Visualizations: bar plots, histograms, heatmaps

3. **Feature Engineering**
- Convert categorical features into dummy/encoded variables
- Normalize or scale continuous variables
- Handle class imbalance if needed (e.g., SMOTE)

4. **Model Building**
- Train a logistic regression model (baseline)
- Evaluate with accuracy, precision, recall, F1-score
    - Use a confusion matrix and ROC curve to visualize performance

5. **Insights & Interpretation**
- Which features have the strongest influence on default risk?
- What would a risk manager or fintech product owner take away?

6. **(Optional) Advanced Modeling**
- Try random forest, XGBoost, or a decision tree
- Use GridSearchCV for hyperparameter tuning
    - Compare performance metrics

---

## ✅ Key Skills Practiced

- Data wrangling and transformation (`pandas`, `numpy`)
- Exploratory Data Analysis (`matplotlib`, `seaborn`)
- Classification modeling (`scikit-learn`)
- Evaluation metrics for imbalanced data
    - Feature importance interpretation

---

## 🚀 Next Steps

- Package this into a simple Streamlit or Flask app to allow user input and get a default risk score
- Explore SHAP values for better explainability
    - Use a real-time scoring API or dashboard for production simulation



## 🧹 Step 1: Load and Inspect the Dataset

- Download the dataset from the UCI repository (link above)
- Load the Excel file into a DataFrame
- Check the shape (rows x columns)
- Display the first few rows to get a feel for the data
- Verify column names and datatypes

---

## 🔍 Step 2: Clean and Prepare the Data

- Rename columns for readability (optional, but helpful)
- Drop or handle any irrelevant or duplicate columns
- Check for missing values
  - If any exist, decide whether to fill, drop, or flag them
- Confirm data types (e.g., numeric, categorical, object)
  - Convert types where appropriate (e.g., categorical labels)

---

## 📊 Step 3: Explore the Data (EDA)

- Analyze the distribution of the target variable (default vs no default)
- Explore distributions of key features (e.g., age, bill amount, payment history)
- Visualize relationships:
  - Bar plots, histograms, boxplots, correlation heatmaps
- Look for class imbalance and feature skewness
- Start asking early questions:
  - Do payment delays correlate with defaults?
  - Are certain age groups riskier?

---

## 🛠️ Step 4: Prepare Data for Modeling (Coming Soon)

- Encode categorical features (if needed)
- Normalize or scale continuous features
- Split into training/testing sets
- Plan model evaluation strategy (accuracy, precision, recall, F1, ROC)



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

In [3]:
# assign a variable to the raw data path
cc_data_path = '../data/raw/default_of_credit_card_clients.xls'
# Load the data set
df = pd.read_excel(cc_data_path, header=0)

# check the data set shape
print("Data set shape: ", df.shape)

# check the data types
print("Data types: ", df.info())

# Verify the fix
print("New data types after proper loading:")
print(df.dtypes)

# Review the first 5 rows of the data
print("\nFirst few rows of fixed data:")
print(df.head())


Data set shape:  (30001, 25)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30001 entries, 0 to 30000
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  30001 non-null  object
 1   X1          30001 non-null  object
 2   X2          30001 non-null  object
 3   X3          30001 non-null  object
 4   X4          30001 non-null  object
 5   X5          30001 non-null  object
 6   X6          30001 non-null  object
 7   X7          30001 non-null  object
 8   X8          30001 non-null  object
 9   X9          30001 non-null  object
 10  X10         30001 non-null  object
 11  X11         30001 non-null  object
 12  X12         30001 non-null  object
 13  X13         30001 non-null  object
 14  X14         30001 non-null  object
 15  X15         30001 non-null  object
 16  X16         30001 non-null  object
 17  X17         30001 non-null  object
 18  X18         30001 non-null  object
 19  X19         30001

## Data Investigation

- For more detailed documentation, see docs/data_investigation.ipynb

In [4]:
# Let's look at the first few rows of numeric columns to see how they're stored
print("Sample of what should be numeric data:")
print(df.iloc[:5, :5])  # First 5 rows and columns as an example

# Check if we have any special characters or formatting in the numbers
print("\nUnique values in first numeric column:")
print(df.iloc[:, 0].unique()[:5])  # First 5 unique values from first column


Sample of what should be numeric data:
  Unnamed: 0         X1   X2         X3        X4
0         ID  LIMIT_BAL  SEX  EDUCATION  MARRIAGE
1          1      20000    2          2         1
2          2     120000    2          2         2
3          3      90000    2          2         2
4          4      50000    2          2         1

Unique values in first numeric column:
['ID' 1 2 3 4]


## Problem identified
- See docs/data_investigation.ipynb for further details
- See fixed code above
- See further diagnostics below

In [5]:
# %%
# Detailed column investigation
print("Column names:")
print(df.columns.tolist())

# Let's examine each column's unique values
for col in df.columns[:5]:  # Starting with first 5 columns
    print(f"\nColumn: {col}")
    print("Sample unique values:")
    print(df[col].unique()[:5])
    print("Value counts:")
    print(df[col].value_counts().head())
    print("Can convert to numeric?")
    print(f"NA count if converted: {pd.to_numeric(df[col], errors='coerce').isna().sum()}")


Column names:
['Unnamed: 0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'Y']

Column: Unnamed: 0
Sample unique values:
['ID' 1 2 3 4]
Value counts:
Unnamed: 0
ID       1
19997    1
20009    1
20008    1
20007    1
Name: count, dtype: int64
Can convert to numeric?
NA count if converted: 1

Column: X1
Sample unique values:
['LIMIT_BAL' 20000 120000 90000 50000]
Value counts:
X1
50000     3365
20000     1976
30000     1610
80000     1567
200000    1528
Name: count, dtype: int64
Can convert to numeric?
NA count if converted: 1

Column: X2
Sample unique values:
['SEX' 2 1]
Value counts:
X2
2      18112
1      11888
SEX        1
Name: count, dtype: int64
Can convert to numeric?
NA count if converted: 1

Column: X3
Sample unique values:
['EDUCATION' 2 1 3 5]
Value counts:
X3
2    14030
1    10585
3     4917
5      280
4      123
Name: count, dtype: int64
Can convert to numeric?
NA coun

## Diagnostics to investigate why the data in the dataframe is still an object

In [6]:
# Display initial information
print("DataFrame Info:")
print(df.info())

print("\nFirst few rows of data:")
print(df.head())

# Check if we have any string values in supposedly numeric columns
# print("\nSample values from LIMIT_BAL:")
# print(df['LIMIT_BAL'].head())
# print("\nUnique values in LIMIT_BAL:")
# print(df['LIMIT_BAL'].unique()[:5])


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30001 entries, 0 to 30000
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  30001 non-null  object
 1   X1          30001 non-null  object
 2   X2          30001 non-null  object
 3   X3          30001 non-null  object
 4   X4          30001 non-null  object
 5   X5          30001 non-null  object
 6   X6          30001 non-null  object
 7   X7          30001 non-null  object
 8   X8          30001 non-null  object
 9   X9          30001 non-null  object
 10  X10         30001 non-null  object
 11  X11         30001 non-null  object
 12  X12         30001 non-null  object
 13  X13         30001 non-null  object
 14  X14         30001 non-null  object
 15  X15         30001 non-null  object
 16  X16         30001 non-null  object
 17  X17         30001 non-null  object
 18  X18         30001 non-null  object
 19  X19         30001 non-null  ob


# Data Cleaning Process
## 1. Initial Setup and Data Loading


In [7]:
# Create a dictionary for data type conversion
dtypes_dict = {
    'ID': 'int32',
    'LIMIT_BAL': 'float64',
    'SEX': 'int8',
    'EDUCATION': 'int8',
    'MARRIAGE': 'int8',
    'AGE': 'int32',
    'PAY_0': 'int8',
    'PAY_2': 'int8',
    'PAY_3': 'int8',
    'PAY_4': 'int8',
    'PAY_5': 'int8',
    'PAY_6': 'int8',
    'BILL_AMT1': 'float64',
    'BILL_AMT2': 'float64',
    'BILL_AMT3': 'float64',
    'BILL_AMT4': 'float64',
    'BILL_AMT5': 'float64',
    'BILL_AMT6': 'float64',
    'PAY_AMT1': 'float64',
    'PAY_AMT2': 'float64',
    'PAY_AMT3': 'float64',
    'PAY_AMT4': 'float64',
    'PAY_AMT5': 'float64',
    'PAY_AMT6': 'float64',
    'default.payment.next.month': 'int8'
}

# Try to convert datatypes and catch any errors
try:
    df = df.astype(dtypes_dict)
    print("Data types successfully converted!")
except Exception as e:
    print(f"Error converting data types: {e}")

# Display information about the dataframe
print("\nDataFrame Info after conversion attempt:")
print(df.info())

# Show the first few rows
print("\nFirst few rows:")
print(df.head())


Error converting data types: "Only a column name can be used for the key in a dtype mappings argument. 'ID' not found in columns."

DataFrame Info after conversion attempt:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30001 entries, 0 to 30000
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  30001 non-null  object
 1   X1          30001 non-null  object
 2   X2          30001 non-null  object
 3   X3          30001 non-null  object
 4   X4          30001 non-null  object
 5   X5          30001 non-null  object
 6   X6          30001 non-null  object
 7   X7          30001 non-null  object
 8   X8          30001 non-null  object
 9   X9          30001 non-null  object
 10  X10         30001 non-null  object
 11  X11         30001 non-null  object
 12  X12         30001 non-null  object
 13  X13         30001 non-null  object
 14  X14         30001 non-null  object
 15  X15         30001 non-null  objec

## Checking data type in the columns


In [8]:
df['X6'].unique()
df['X6'].sample(10)  # peek at a few


4933      0
5595      1
12257     0
21574     2
5557      0
11513    -1
27850     2
18240    -2
24617    -1
3046     -1
Name: X6, dtype: object

In [9]:
df['X6'] = pd.to_numeric(df['X6'], errors='coerce')
df['X6'].dtype


dtype('float64')

## Above code confirmed the fix for the X6 column.
- Apply the same cleaning process to all columns.

In [10]:
# Checks columns for dtype.  If integer it is left alone.  if object it is converted
for col in df.select_dtypes(include='object'):
    df[col] = pd.to_numeric(df[col], errors='coerce')
# check conversions afterwards
df.dtypes  # After conversion


Unnamed: 0    float64
X1            float64
X2            float64
X3            float64
X4            float64
X5            float64
X6            float64
X7            float64
X8            float64
X9            float64
X10           float64
X11           float64
X12           float64
X13           float64
X14           float64
X15           float64
X16           float64
X17           float64
X18           float64
X19           float64
X20           float64
X21           float64
X22           float64
X23           float64
Y             float64
dtype: object

## Rename the columns to be more descriptive

In [11]:
# Create a dictionary mapping original column names to more descriptive names
column_mapping = {
    'X1': 'CREDIT_AMOUNT',
    'X2': 'GENDER',
    'X3': 'EDUCATION',
    'X4': 'MARITAL_STATUS',
    'X5': 'AGE',
    'X6': 'PAY_SEPT',  # or 'PAY_1' for consistency
    'X7': 'PAY_AUG',  # or 'PAY_2' for consistency
    'X8': 'PAY_JUL',  # or 'PAY_3' for consistency
    'X9': 'PAY_JUN',  # or 'PAY_4' for consistency
    'X10': 'PAY_MAY',  # or 'PAY_5' for consistency
    'X11': 'PAY_APR',  # or 'PAY_6' for consistency
    'X12': 'BILL_SEPT',  # or 'BILL_AMT1' for consistency
    'X13': 'BILL_AUG',  # or 'BILL_AMT2' for consistency
    'X14': 'BILL_JUL',  # or 'BILL_AMT3' for consistency
    'X15': 'BILL_JUN',  # or 'BILL_AMT4' for consistency
    'X16': 'BILL_MAY',  # or 'BILL_AMT5' for consistency
    'X17': 'BILL_APR',  # or 'BILL_AMT6' for consistency
    'X18': 'PAY_AMT_SEPT',  # or 'PAY_AMT1' for consistency
    'X19': 'PAY_AMT_AUG',  # or 'PAY_AMT2' for consistency
    'X20': 'PAY_AMT_JUL',  # or 'PAY_AMT3' for consistency
    'X21': 'PAY_AMT_JUN',  # or 'PAY_AMT4' for consistency
    'X22': 'PAY_AMT_MAY',  # or 'PAY_AMT5' for consistency
    'X23': 'PAY_AMT_APR',  # or 'PAY_AMT6' for consistency
    # Assuming you also have a target column
    'default.payment.next.month': 'DEFAULT'  # if this is your target column
}

# Apply the renaming to your dataframe
df = df.rename(columns=column_mapping)


In [12]:
# Example of mapping categorical values
df['GENDER'] = df['GENDER'].map({1: 'Male', 2: 'Female'})
df['EDUCATION'] = df['EDUCATION'].map({1: 'Graduate School', 2: 'University', 3: 'High School', 4: 'Other'})
df['MARITAL_STATUS'] = df['MARITAL_STATUS'].map({1: 'Married', 2: 'Single', 3: 'Other'})
