# Task 4: Bucket FICO Scores

## Problem Statement

Charlie aims to build a model that is applicable to future datasets. The challenge is to produce a general methodology for bucketing FICO scores, such that these buckets serve as input labels to the model. The objective is to delineate boundaries that best capture and summarize the FICO score distribution. This process is termed "quantization."

Given a FICO score, the target is to map it to a rating in which a lower rating implies a better credit score.

 ## Intial Approach

The first step taken was a simple and straightforward bucketing based on equal intervals:

This function took the start and end values of the FICO scores, the number of desired buckets (k), and the dataframe (df). It then computed equal-sized buckets within this range. Each bucket's log-likelihood was calculated under the assumption of a normal distribution, which was subsequently used to evaluate the bucket's fit to the data.

## Optimised Solution

To improve the simple bucketing approach and get a better fit for the data, dynamic adjustments were made to the bucket boundaries. The objective was to find boundaries that would minimize the log-likelihood of the FICO scores falling within them.


# Equally Split

Investigate the Log Likelihood if the buckets were split equally to give some background context.

In [52]:
import pandas as pd
import numpy as np

def create_buckets(start, end, k, df):
    bucket_size = (end - start) // k
    buckets = []
    
    for i in range(k):
        lower_bound = start + i * bucket_size
        if i == k - 1:  # For the last bucket, ensure we capture the end value
            upper_bound = end
        else:
            upper_bound = lower_bound + bucket_size - 1
        
        # Filter the dataframe for FICO scores within the current bucket
        bucket_data = df[(df['fico_score'] >= lower_bound) & (df['fico_score'] <= upper_bound)]['fico_score'].values
        
        # Calculate log likelihood assuming a normal distribution
        mu = np.mean(bucket_data)
        sigma = np.std(bucket_data)
        n = len(bucket_data)
        log_likelihood = -0.5 * n * np.log(2 * np.pi * sigma**2) - (1 / (2 * sigma**2)) * np.sum((bucket_data - mu)**2)
        
        buckets.append((lower_bound, upper_bound, log_likelihood))
        
    return buckets

# Read the data
df = pd.read_csv('Loan_Data.csv')

buckets_0_600 = create_buckets(0, 600, 5, df)
buckets_600_850 = create_buckets(600, 850, 5, df)

print("Buckets for 0-600 (Format: (Start, End, Log Likelihood)):")
for bucket in buckets_0_600:
    print(bucket)

print("\nBuckets for 600-850 (Format: (Start, End, Log Likelihood)):")
for bucket in buckets_600_850:
    print(bucket)


Buckets for 0-600 (Format: (Start, End, Log Likelihood)):
(0, 119, nan)
(120, 239, nan)
(240, 359, nan)
(360, 479, -227.43410802662515)
(480, 600, -12527.639323549474)

Buckets for 600-850 (Format: (Start, End, Log Likelihood)):
(600, 649, -12652.791737625666)
(650, 699, -11147.419966664507)
(700, 749, -5026.754543115131)
(750, 799, -1009.2452004585322)
(800, 850, -133.0821644404997)


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
  ret = ret.dtype.type(ret / rcount)


# Solution

In [56]:
import pandas as pd
import numpy as np

def log_likelihood(bucket_data):
    if len(bucket_data) == 0:
        return float('-inf')
    mu = np.mean(bucket_data)
    sigma = np.std(bucket_data)
    n = len(bucket_data)
    return -0.5 * n * np.log(2 * np.pi * sigma**2) - (1 / (2 * sigma**2)) * np.sum((bucket_data - mu)**2)

def optimize_buckets(start, end, k, df, iterations=1000):
    bucket_size = (end - start) // k
    buckets = [(start + i * bucket_size, start + (i + 1) * bucket_size - 1) for i in range(k)]
    buckets[-1] = (buckets[-1][0], end)  # Ensure last bucket captures the end value

    for _ in range(iterations):
        for idx, (lower, upper) in enumerate(buckets):
            original_data = df[(df['fico_score'] >= lower) & (df['fico_score'] <= upper)]['fico_score'].values
            original_ll = log_likelihood(original_data)

            # Check if we can adjust the lower boundary
            if idx > 0:
                lower_adjusted_data = df[(df['fico_score'] >= lower-1) & (df['fico_score'] <= upper)]['fico_score'].values
                lower_ll = log_likelihood(lower_adjusted_data)

            # Check if we can adjust the upper boundary
            if idx < len(buckets) - 1:
                upper_adjusted_data = df[(df['fico_score'] >= lower) & (df['fico_score'] <= upper+1)]['fico_score'].values
                upper_ll = log_likelihood(upper_adjusted_data)

            # Determine which boundary adjustment (if any) reduces the log likelihood the most
            if idx > 0 and lower_ll < original_ll:
                buckets[idx] = (lower-1, upper)
                original_ll = lower_ll  # update the reference likelihood

            if idx < len(buckets) - 1 and upper_ll < original_ll:
                buckets[idx] = (lower, upper+1)

    # Get the log likelihoods for the adjusted buckets
    results = []
    for lower, upper in buckets:
        bucket_data = df[(df['fico_score'] >= lower) & (df['fico_score'] <= upper)]['fico_score'].values
        results.append((lower, upper, log_likelihood(bucket_data)))

    return results

# Read the data
df = pd.read_csv('Loan_Data.csv')

buckets_0_600 = optimize_buckets(0, 600, 5, df)
buckets_600_850 = optimize_buckets(600, 850, 5, df)

print("Buckets for 0-600 (Format: (Start, End, Log Likelihood)):")
for bucket in buckets_0_600:
    print(bucket)

print("\nBuckets for 600-850 (Format: (Start, End, Log Likelihood)):")
for bucket in buckets_600_850:
    print(bucket)


Buckets for 0-600 (Format: (Start, End, Log Likelihood)):
(0, 119, -inf)
(120, 239, -inf)
(240, 359, -inf)
(360, 794, -54828.762269007704)
(479, 600, -12553.944341609436)

Buckets for 600-850 (Format: (Start, End, Log Likelihood)):
(600, 794, -37736.59835138829)
(479, 794, -54338.58344768431)
(479, 794, -54338.58344768431)
(479, 806, -54541.565314484746)
(796, 850, -178.35289866784876)
