## CHAVAN ADVAIT GURUNATH
## advaitchavan135@gmail.com
## Task 4: Bucket FICO scores

### Charlie wants to make her model work for future data sets, so she needs a general approach to generating the buckets. Given a set number of buckets corresponding to the number of input labels for the model, she would like to find out the boundaries that best summarize the data. You need to create a rating map that maps the FICO score of the borrowers to a rating where a lower rating signifies a better credit score.

### The process of doing this is known as quantization. You could consider many ways of solving the problem by optimizing different properties of the resulting buckets, such as the mean squared error or log-likelihood

In [4]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

In [2]:
df = pd.read_csv('Loan_Data.csv')

In [3]:
df.head()

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
0,8153374,0,5221.545193,3915.471226,78039.38546,5,605,0
1,7442532,5,1958.928726,8228.75252,26648.43525,2,572,1
2,2256073,0,3363.009259,2027.83085,65866.71246,4,602,0
3,4885975,0,4766.648001,2501.730397,74356.88347,5,612,0
4,4700614,1,1345.827718,1768.826187,23448.32631,6,631,0


## Define Quantization Function

In [6]:
def quantize_fico_kmeans(df, fico_col='fico_score', n_buckets=4, random_state=42):
    """
    Fits 1D k-means on FICO scores and returns:
      - boundaries: list of n_buckets-1 cut points
      - df_out: DataFrame with added 'bucket_idx' and 'fico_rating'
    """
    # Fit k-means
    X = df[[fico_col]].astype(float).values
    km = KMeans(n_clusters=n_buckets, random_state=random_state)
    km.fit(X)
    
    # Sorted cluster centers
    centers = np.sort(km.cluster_centers_.flatten())
    # Boundaries at midpoints
    boundaries = [(centers[i] + centers[i+1]) / 2 
                  for i in range(len(centers)-1)]
    
    # Assign bucket indices
    df_out = df.copy()
    df_out['bucket_idx'] = np.searchsorted(boundaries, df_out[fico_col].values)
    
    # Map to rating: 1 = best (highest scores), K = worst
    df_out['fico_rating'] = 1 + (n_buckets - 1 - df_out['bucket_idx'])
    return boundaries, df_out


## Apply Quantization and Show Results

In [7]:
# Number of buckets
K = 4

# Quantize
boundaries, df_out = quantize_fico_kmeans(df, n_buckets=K)

print("Bucket boundaries:", boundaries)
display(df_out[['customer_id','fico_score','bucket_idx','fico_rating']])


Bucket boundaries: [570.955984898498, 631.8466709054499, 691.7167693413633]


Unnamed: 0,customer_id,fico_score,bucket_idx,fico_rating
0,8153374,605,1,3
1,7442532,572,1,3
2,2256073,602,1,3
3,4885975,612,1,3
4,4700614,631,1,3
...,...,...,...,...
9995,3972488,697,3,1
9996,6184073,615,1,3
9997,6694516,596,1,3
9998,3942961,647,2,2


## Per-Bucket Statistics

In [8]:
summary = (
    df_out
    .groupby('fico_rating')
    .agg(
        count=('fico_score','size'),
        avg_fico=('fico_score','mean'),
        default_rate=('default','mean')
    )
    .sort_index()
)
display(summary)


Unnamed: 0_level_0,count,avg_fico,default_rate
fico_rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1862,723.747046,0.053169
2,3598,659.686492,0.117565
3,3212,604.006849,0.224471
4,1328,537.90512,0.457831


## Alternative – Equal-Frequency (Quantile) Binning

In [9]:
# Equal-frequency binning into K buckets
df_q = df.copy()
df_q['bucket_q'] = pd.qcut(df_q['fico_score'], q=K, labels=False, duplicates='drop')
# Invert so 1=best
df_q['fico_rating_q'] = 1 + (K - 1 - df_q['bucket_q'])

display(df_q[['customer_id','fico_score','bucket_q','fico_rating_q']])


Unnamed: 0,customer_id,fico_score,bucket_q,fico_rating_q
0,8153374,605,1,3
1,7442532,572,0,4
2,2256073,602,1,3
3,4885975,612,1,3
4,4700614,631,1,3
...,...,...,...,...
9995,3972488,697,3,1
9996,6184073,615,1,3
9997,6694516,596,0,4
9998,3942961,647,2,2
