
Charlie wants to make her model work for future data sets, so she needs a general approach to generating the buckets. Given a set number of buckets corresponding to the number of input labels for the model, she would like to find out the boundaries that best summarize the data. You need to create a rating map that maps the FICO score of the borrowers to a rating where a lower rating signifies a better credit score.

The process of doing this is known as quantization. You could consider many ways of solving the problem by optimizing different properties of the resulting buckets, such as the mean squared error or log-likelihood (see below for definitions). 

In [None]:
import numpy as np

# 假设你已经有一个包含FICO评分的NumPy数组fico_scores
# fico_scores = np.array([...])

# 确定桶的数量
n_buckets = 10  # 例如，你想要创建10个桶

# 初始化桶的边界
min_score = 300  # FICO评分的最小值
max_score = 850  # FICO评分的最大值
bucket_edges = np.linspace(min_score, max_score, n_buckets+1)

# 迭代优化桶的边界
for iteration in range(100):  # 你可以设置为需要的迭代次数
    bucket_indices = np.digitize(fico_scores, bucket_edges, right=True)
    new_edges = []
    
    for i in range(1, len(bucket_edges)):
        # 计算当前桶的FICO评分平均值
        in_bucket = fico_scores[(bucket_indices == i)]
        if len(in_bucket) > 0:
            bucket_avg = in_bucket.mean()
            new_edges.append(bucket_avg)
    
    # 更新桶的边界
    new_edges = np.unique(new_edges)
    if len(new_edges) > 1:
        bucket_edges = np.concatenate(([min_score], (new_edges[:-1] + new_edges[1:])/2, [max_score]))
    
    # 计算MSE
    mse = sum((fico_scores - bucket_edges[bucket_indices])**2) / len(fico_scores)
    
    print(f"Iteration {iteration}: MSE={mse}")

# 输出最终桶的边界
print("Final bucket edges:", bucket_edges)
