
# FICO Score Quantization for Probability of Default (PD) Modeling

## Objective
- Convert continuous FICO scores (300–850) into **categorical risk buckets**
- Lower bucket index ⇒ **better credit quality**
- Buckets optimized for **Probability of Default (PD) prediction**
- Method must generalize to future datasets

We implement **log-likelihood–optimal quantization** using **dynamic programming**.



## Why Quantization?
Many ML architectures require categorical inputs.
Quantization compresses numeric FICO scores into buckets while preserving default information.

We optimize bucket boundaries to **maximize likelihood of observed defaults**.


In [None]:

import pandas as pd
import numpy as np


In [None]:

# Load data (contains FICO + default indicator)
df = pd.read_csv('/mnt/data/Task 3 and 4_Loan_Data (1).csv')

# Rename columns for clarity if needed
# Expect columns: 'fico_score', 'default'
df.head()


In [None]:

# Keep only required columns
fico = df['fico_score'].astype(int).values
default = df['default'].astype(int).values

# Sort by FICO (important for DP)
order = np.argsort(fico)
fico = fico[order]
default = default[order]

N = len(fico)



## Log-Likelihood for a Bucket
For bucket *i*:
- nᵢ = number of borrowers
- kᵢ = number of defaults
- pᵢ = kᵢ / nᵢ

Log-likelihood:
\$
LL_i = k_i \log(p_i) + (n_i - k_i) \log(1 - p_i)
\$

Buckets with homogeneous default behavior score higher.


In [None]:

# Precompute cumulative sums for fast bucket stats
cum_defaults = np.cumsum(default)
cum_total = np.arange(1, N + 1)

def bucket_log_likelihood(i, j):
    '''
    Log-likelihood for bucket covering [i, j)
    '''
    n = j - i
    k = cum_defaults[j-1] - (cum_defaults[i-1] if i > 0 else 0)
    
    if k == 0 or k == n:
        return 0  # avoids log(0); acceptable for prototype
    
    p = k / n
    return k * np.log(p) + (n - k) * np.log(1 - p)



## Dynamic Programming Formulation

Let:
- dp[b][j] = max log-likelihood using *b* buckets for first *j* FICO values

Transition:
\$
dp[b][j] = \max_{i < j} (dp[b-1][i] + LL(i, j))
\$

This guarantees **globally optimal bucket boundaries**.


In [None]:

def optimal_fico_buckets(num_buckets):
    dp = np.full((num_buckets + 1, N + 1), -np.inf)
    prev = np.zeros((num_buckets + 1, N + 1), dtype=int)

    dp[0][0] = 0

    for b in range(1, num_buckets + 1):
        for j in range(1, N + 1):
            for i in range(b - 1, j):
                score = dp[b-1][i] + bucket_log_likelihood(i, j)
                if score > dp[b][j]:
                    dp[b][j] = score
                    prev[b][j] = i

    # Backtrack bucket boundaries
    boundaries = []
    j = N
    for b in range(num_buckets, 0, -1):
        i = prev[b][j]
        boundaries.append((fico[i], fico[j-1]))
        j = i

    boundaries.reverse()
    return boundaries


In [None]:

# Example: Create 5 FICO buckets
buckets = optimal_fico_buckets(num_buckets=5)
buckets



## Construct Rating Map
Lower rating number ⇒ better FICO score.


In [None]:

def fico_to_rating(score, buckets):
    for idx, (low, high) in enumerate(buckets):
        if low <= score <= high:
            return idx + 1
    return len(buckets)


In [None]:

# Test rating assignment
df['fico_rating'] = [fico_to_rating(s, buckets) for s in df['fico_score']]
df[['fico_score', 'fico_rating', 'default']].head(10)



## Interpretation for Risk Team

- Buckets are **data-driven**, not arbitrary
- Default density is preserved inside buckets
- Suitable for:
  - PD modeling
  - Scorecards
  - Regulatory reporting
  - Categorical ML architectures

## Advantages
✔ Maximizes likelihood of observed defaults  
✔ Globally optimal via dynamic programming  
✔ Automatically adapts to future datasets  

## Next Steps
- Fit PD per bucket
- Calibrate PDs
- Integrate into mortgage risk engine
