## Quantizing Using Dynamic Programming

I worked on a task to create a general approach for generating quantization buckets that map borrowers' FICO scores to credit ratings. The objective was to determine the optimal bucket boundaries that best summarize the data for a model. This involved using various optimization techniques, such as minimizing mean squared error or maximizing a log-likelihood function. I focussed on the log-likelihood function which took into account the bucket boundaries, the number of records and defaults in each bucket, and the default probability.

To solve this problem, I developed a rating map that translates FICO scores into ratings, with lower ratings indicating better credit scores. I considered dynamic programming to incrementally solve subproblems, optimizing the quantization process. This method ensured an efficient and accurate discretization of the data, enabling the model to work effectively with future data sets.

In [1]:
import pandas as pd
import numpy as np


# Load data
data = pd.read_csv('Loan_Data.csv')

# Select the FICO score and default columns
fico_scores = data['fico_score']
defaults = data['default']

# Initialize counters
default_counts = np.zeros(851)
total_counts = np.zeros(851)


# Calculate default and total counts for each FICO score
for fico, default_val in zip(fico_scores, defaults):
    default_counts[fico - 300] += default_val
    total_counts[fico - 300] += 1



def log_likelihood(n, k):

    p = k/n

    if (p==0 or p==1):

        return 0

    return k*np.log(p)+ (n-k)*np.log(1-p)



In [23]:
# Initialize dynamic programming table
r = 10
dp = [[[-np.inf, 0] for _ in range(851)] for _ in range(r + 1)]

for i in range(r+1): #i is number of buckets

    for j in range(551): #j is each FICO score

        if (i==0):

            dp[i][j][0] = 0 #Handles 0 buckets case

        else:

            for k in range(j):

                if (total[j]==total[k]):

                    continue

                if (i==1):

                    dp[i][j][0] = log_likelihood(total[j], default[j])

                else:

                    if (dp[i][j][0] < (dp[i-1][k][0] + log_likelihood(total[j]-total[k], default[j] - default[k]))):

                        dp[i][j][0] = log_likelihood(total[j]-total[k], default[j]-default[k]) + dp[i-1][k][0]

                        dp[i][j][1] = k #index of previous bucket

print (round(dp[r][550][0], 4))

k = 550

l = []

while r >= 0:

    l.append(k+300)

    k = dp[r][k][1]

    r -= 1

    print(l)
    


-4217.8245
[850]
[850, 753]
[850, 753, 752]
[850, 753, 752, 732]
[850, 753, 752, 732, 696]
[850, 753, 752, 732, 696, 649]
[850, 753, 752, 732, 696, 649, 611]
[850, 753, 752, 732, 696, 649, 611, 580]
[850, 753, 752, 732, 696, 649, 611, 580, 552]
[850, 753, 752, 732, 696, 649, 611, 580, 552, 520]
[850, 753, 752, 732, 696, 649, 611, 580, 552, 520, 300]
