# Finding and using anchor points

In this notebook, we show how to find anchor points based on your training set and how to use them to estimate the performance of new models in the test set.

## Preparing data

Loading packages

In [1]:
import numpy as np
import pickle
import pandas as pd
import json
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances
from irt import *
from utils import *
random_state = 42

The leaderboard dataset we will use is composed by six scenarios (sub-datasets):
1. TruthfulQA
1. GSM8K
1. Winogrande
1. ARC
1. HellaSwag
1. MMLU

MMLU is further divided into sub-scenarios (e.g., abstract algebra, anatomy, etc). Let's check scenarios and sub-scenarios:

In [2]:
scenarios

{'harness_truthfulqa_mc_0': ['harness_truthfulqa_mc_0'],
 'gsm8k': ['harness_gsm8k_5'],
 'winogrande': ['harness_winogrande_5'],
 'arc': ['harness_arc_challenge_25'],
 'hellaswag': ['harness_hellaswag_10'],
 'mmlu': ['harness_hendrycksTest_abstract_algebra_5',
  'harness_hendrycksTest_anatomy_5',
  'harness_hendrycksTest_astronomy_5',
  'harness_hendrycksTest_business_ethics_5',
  'harness_hendrycksTest_clinical_knowledge_5',
  'harness_hendrycksTest_college_biology_5',
  'harness_hendrycksTest_college_chemistry_5',
  'harness_hendrycksTest_college_computer_science_5',
  'harness_hendrycksTest_college_mathematics_5',
  'harness_hendrycksTest_college_medicine_5',
  'harness_hendrycksTest_college_physics_5',
  'harness_hendrycksTest_computer_security_5',
  'harness_hendrycksTest_conceptual_physics_5',
  'harness_hendrycksTest_econometrics_5',
  'harness_hendrycksTest_electrical_engineering_5',
  'harness_hendrycksTest_elementary_mathematics_5',
  'harness_hendrycksTest_formal_logic_5',
 

Loading leaderboard data:

In [3]:
with open('data/lb.pickle', 'rb') as handle:
    data = pickle.load(handle)

Below, we will process the data so all correctness scores (for all scenarios) are stored in $Y$. The dictionaries `scenarios_position` and `subscenarios_position` give the position of scenarios/subscenarios correctness scores in $Y$.

In [5]:
scenarios_position, subscenarios_position = prepare_data(scenarios, data)
Y = create_responses(scenarios, data)
Y.shape

(395, 28659)

For example, below you can see the scores for MMLU:

In [6]:
Y[:,scenarios_position['mmlu']], Y[:,scenarios_position['mmlu']].shape

(array([[0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        ...,
        [0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        [1., 0., 1., ..., 1., 1., 0.]]),
 (395, 14042))

For scenarios that have multiple subscenarios, it is usually the case that we want to give equal importance to individual subscenarios when computing the aggregated performance in that scenario. This is equivalent to using a weighted average when computing the aggregated performance. We will create `balance_weights`, a vector of weights to help us compute those weighted averages. These weights will be different than one only for MMLU, which is the only scenario with multiple subscenarios.

In [7]:
# 28659 包含包括mmlu所有场景的子集，和其他的数据集的subscenario
balance_weights = np.ones(Y.shape[1])
# N为MMLU的问题总数
N = len(scenarios_position['mmlu'])
# n_sub为mmlu科目数量
n_sub = len(scenarios['mmlu'])
# sub为科目
for sub in scenarios['mmlu']:
    # n_i为对应subject题目数量
    n_i = len(subscenarios_position['mmlu'][sub])
    # idx = subscenario_position['mmlu'][sub],是mmlu子集在所有subscenrio中的位置
    # n_sub * n_i 为对应subject的问题数量，乘上mmlu科目数量（57）。balance_weights中有大于1有小于1的，大于1说明科目数量小，给予更高的weight
    balance_weights[subscenarios_position['mmlu'][sub]] = N/(n_sub*n_i)  

We can see below that first averaging within subscenarios and then computing a simple average is equivalent to using a weighted average from the beginning:

In [8]:
# accs1 先计算每个模型同一个科目的总准确率，再把每个模型的所有科目的准确率汇总等权，计算每个模型的准确率，形成(395, )的准确率
accs1 = np.mean([Y[:,subscenarios_position['mmlu'][sub]].mean(axis=1) for sub in scenarios['mmlu']], axis=0)
# balance_weights*Y，每行大模型的0/1正确率都会乘上对应的weight。scenarios_position['mmlu']取14042个问题对应的index。mean后为每个模型的总准确率，形状为（395,）
accs2 = (balance_weights*Y)[:,scenarios_position['mmlu']].mean(axis=1)
# 两者结果一致
np.abs(accs1 - accs2).mean()

2.322333605307685e-14

## Getting and using anchor points

The variable `clustering` specified how the clusting is run. If `clustering="correct."`, then correctness is used. On the other hand, if `clustering="irt"`, then the IRT embeddings for examples are used.

In [12]:
clustering = 'correct.' # 'correct.' or 'irt'

Computing anchor points and their weights for each scenario:

In [13]:
anchor_points = {}
anchor_weights = {}

for scenario in scenarios.keys():
    # X 若为correct cluster，则每行为mmlu题目，每列为模型回答的0/1的答案，列数为模型数
    # X 若为irt cluster，则每行为mmlu题目，每列为题目区分度A和难度B，列数为2
    if clustering=='correct.':
        X = Y_train[:,scenarios_position[scenario]].T
    elif clustering=='irt':
        #这行代码会从预先训练好的 IRT 模型中加载每个题目的两个核心参数：
        # `A` (Discrimination/区分度): 表示一个题目在区分高能力和低能力模型上的效果有多好。A 值越高，区分度越好。
        # `B` (Difficulty/难度): 表示一个题目的难度值。B 值越高，题目越难。
        A, B, _ = load_irt_parameters('data/irt_model/')
        # 首先，它将所有题目的 A 和 B 参数合并成一个矩阵。转置（.T）后，这个矩阵的每一行代表一道题目，而两列分别是这道题的区分度（A）和难度（B）。
        X = np.vstack((A.squeeze(), B.squeeze().reshape((1,-1)))).T
        # 然后，它从这个总矩阵中筛选出当前 scenario (例如 'mmlu') 所对应的那些题目。
        X = X[scenarios_position[scenario]]
    else:
        raise NotImplementedError 
        
    #Normalizing balance_weights, so their sum is one within each scenario
    norm_balance_weights = balance_weights[scenarios_position[scenario]]
    norm_balance_weights /= norm_balance_weights.sum()

    # Fitting the KMeans model
    # * kmeans.labels_:
    #   * 内容: 一个一维数组，长度与 X 的行数（即题目数量）相同。数组中的第 i 个值，就是第 i 道题目被分配到的簇的编号（从 0
    #     到 99）。
    #   * 作用: 这是最直接的聚类结果，告诉我们每道题属于哪个簇。

    #* kmeans.cluster_centers_:
    #   * 内容: 一个形状为 (100, 特征数量) 的二维数组。每一行代表一个簇的中心点（质心）在特征空间中的坐标。
    #   * 作用: 代表了 100 个簇的“平均”特征。后续代码会用它来寻找离每个中心点最近的真实题目，作为“锚点”。

    #* kmeans.inertia_:
    #   * 内容: 一个浮点数，表示所有样本点到其所属簇中心的距离平方和。
    #   * 作用: 它是衡量聚类效果的一个指标，值越小通常表示聚类效果越好（簇内更紧密）。
    kmeans = KMeans(n_clusters=number_item, n_init="auto", random_state=random_state)
    kmeans.fit(X, sample_weight=norm_balance_weights)

    # Calculating anchor points
    # 对于 KMeans 算法找到的 100 个簇，分别计算出离每个簇的中心点最近的那道真实题目，并将这 100道真实题目的索引作为该场景下的“锚点”保存下来。
    # 这些“锚点”就是对整个题库进行浓缩和降维后得到的、最具代表性的题目样本。
    anchor_points[scenario] = pairwise_distances(kmeans.cluster_centers_, X, metric='euclidean').argmin(axis=1)

    # Calculating anchor weights
    # 对number_item个簇，计算每个簇的权重，权重为该簇内所有题目权重的和
    # kmeans.labels_ == c:
    #   * 这是一个布尔判断。对于当前的簇编号 c，它会生成一个布尔类型的“掩码” (mask) 数组。
    #   * 例如，当 c=5 时，这个掩码数组中，所有属于第 5 簇的题目的位置为 True，其他都为 False。
    # norm_balance_weights[...]:
    #   * norm_balance_weights 是一个数组，包含了每一道题目的归一化权重。
    #   * norm_balance_weights[kmeans.labels_==c] 这个操作利用上面的布尔掩码，从 norm_balance_weights 中只挑选出那些属于簇`c` 的题目的权重。
    anchor_weights[scenario] = np.array([np.sum(norm_balance_weights[kmeans.labels_==c]) for c in range(number_item)])

Saving

Checking results

In [14]:
anchor = {'anchor_points':anchor_points,
          'anchor_weights':anchor_weights}

with open('data/anchor.pickle', 'wb') as handle:
    pickle.dump(anchor, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [15]:
anchor_points['mmlu']

array([ 1147,   666,  8256,   504, 13241,  1503, 13572,  1894,  9286,
         265,  1961,  7870,  9004,  8691,  3683,  8468,  3141,  8504,
        4458,   637,   557,  8168, 12351,  2594,  4963,  7707,  6886,
        6546,  7707,  1109,  2659, 11216, 13545,  7286,  9531,    68,
        8214,  3575,  1868,  7870,  7342,  3629, 13631,  2498,  5221,
        7024,   923,  2651,  5962,  1488,   668,  6412, 14026,  2870,
        5366,   418,  2316,  7225, 13296,  7707, 13779,  3219, 14017,
       12968,   959,  5961, 10509,  8168,    55,  8012,  1959, 10379,
        1025, 13978, 10813,  1349, 11209,  2153,   858,  6887,  3254,
        2313,  1980,   265,    25,  6167,  7342,  1124,  4784,  8894,
        4368,  6539,  4880,  4151, 11553,  7304, 11931,     0,  1018,
        1786], dtype=int64)

In [16]:
anchor_weights['mmlu']

array([0.0042764 , 0.02905955, 0.06135599, 0.00945872, 0.00536487,
       0.03758534, 0.00856022, 0.00549396, 0.00403803, 0.00997437,
       0.00730435, 0.00539472, 0.00272371, 0.00394002, 0.01031464,
       0.00679635, 0.00634383, 0.00358692, 0.00296083, 0.01802555,
       0.00594566, 0.04414731, 0.01086474, 0.00268234, 0.01827593,
       0.00933639, 0.00135921, 0.01064351, 0.00888152, 0.00501712,
       0.00101003, 0.0096813 , 0.00513194, 0.00606622, 0.00389487,
       0.01298388, 0.00380664, 0.009867  , 0.00183478, 0.01108704,
       0.00823578, 0.00545061, 0.00292699, 0.00850653, 0.0062363 ,
       0.00187063, 0.00476257, 0.00200589, 0.01161997, 0.00428859,
       0.01155459, 0.01342028, 0.00850961, 0.00618165, 0.00663006,
       0.06445207, 0.00599398, 0.01817672, 0.00697267, 0.01280029,
       0.01402079, 0.00628686, 0.02230109, 0.00489392, 0.00147227,
       0.01523384, 0.01902777, 0.01244332, 0.00585419, 0.01393512,
       0.0107454 , 0.00489883, 0.00580522, 0.00561106, 0.01013

Using anchor points to estimate performance in the test set and reporting the average prediction error

In [17]:
# Y：（模型数，题目数），每个值为0/1
for scenario in scenarios.keys():
    # Y_test 形状 （100，题目数量），0/1值
    # [:,scenarios_position[scenario]] 第一次筛选，100个模型，和所有对应scenario的题目，使变量变为(100,14042)
    # [:,anchor_points[scenario]] 第二次筛选，100个模型，和所有对应scenario的题目中的锚点题目，使变量变为（100，100）
    Y_anchor = Y_test[:,scenarios_position[scenario]][:,anchor_points[scenario]]
    # `Y_hat`: 通过模型在100个代表性题目上的表现，乘以这些题目的重要性权重，最终估算出每个模型的总成绩。
    Y_hat = (Y_anchor*anchor_weights[scenario]).sum(axis=1)
    Y_true = (balance_weights*Y_test)[:,scenarios_position[scenario]].mean(axis=1)

    print(f"scenario: {scenario}, avg. error: {np.abs(Y_hat-Y_true).mean():.3f}")

scenario: harness_truthfulqa_mc_0, avg. error: 0.014
scenario: gsm8k, avg. error: 0.037
scenario: winogrande, avg. error: 0.021
scenario: arc, avg. error: 0.026
scenario: hellaswag, avg. error: 0.013
scenario: mmlu, avg. error: 0.016


## 生成中文 MMLU 的两个相似子集（各 300 题）

- 只使用英文侧的题目表现信号（此处采用英文侧模型在题目上的平均正确率作为难度 proxy），
- 排除 subjects：`high_school_us_history`, `security_studies`, `high_school_government_and_politics`, `jurisprudence`, `business_ethics`, `us_foreign_policy`, `global_facts`，`moral_scenarios`, `professional_law`,
 `moral_disputes` 。


## 三种方法生成MMLU的300题子集

本节将实现三种不同的方法来创建MMLU的300题子集，并对它们的表现进行综合评估：

1. **方法一：难度分位桶法（基准方法）** - 基于题目难度分位数进行分桶抽样
2. **方法二：IRT聚类法** - 基于IRT参数（难度和区分度）进行K-Means聚类
3. **方法三：题目正误矩阵聚类法** - 基于模型作答矩阵进行K-Means聚类

每种方法将生成两个子集（各300题），确保两个子集在特征分布上高度匹配且无重叠。


In [18]:
# 设置必要的变量和参数
np.random.seed(random_state)

# 定义子集大小
SUBSET_SIZE = 300

# 获取MMLU数据
Y_mmlu = Y[:, scenarios_position['mmlu']]  # 形状: (395个模型, 14042个MMLU题目)
item_acc = Y_mmlu.mean(axis=0)  # 每题的平均正确率（作为难度proxy）

# 读取中文题库以获取题目索引映射
df_cn = pd.read_csv('mmlu_ZH-CN.csv')
assert df_cn.shape[0] == item_acc.shape[0], '中文题库行数与英文MMLU题数不一致'

# 排除指定科目
excluded_subjects = {
    'high_school_us_history', 'security_studies', 'high_school_government_and_politics',
    'jurisprudence', 'business_ethics', 'us_foreign_policy', 'global_facts', 'moral_scenarios',
    'professional_law', 'moral_disputes'
}
mask_keep = ~df_cn['Subject'].isin(excluded_subjects)
valid_indices = df_cn[mask_keep].index.values  # 有效题目在原始数据中的索引

print(f"总MMLU题目数: {len(item_acc)}")
print(f"排除指定科目后的题目数: {len(valid_indices)}")
print(f"目标子集大小: {SUBSET_SIZE}")

# 为了方便后续处理，创建有效题目的映射
Y_mmlu_valid = Y_mmlu[:, valid_indices]  # 只包含有效题目的响应矩阵
item_acc_valid = item_acc[valid_indices]  # 只包含有效题目的难度指标

print(f"有效MMLU数据形状: {Y_mmlu_valid.shape}")


总MMLU题目数: 14042
排除指定科目后的题目数: 10217
目标子集大小: 300
有效MMLU数据形状: (395, 10217)


### 方法一：难度分位桶法（基准方法）

使用题目的平均正确率作为难度指标，进行分位数分桶，然后在每个桶内按比例抽样生成两个不重叠的子集。


In [20]:
def create_difficulty_bucket_subsets(item_difficulties, valid_indices, subset_size, num_buckets=5, random_state=42):
    """
    基于难度分位桶法创建两个不重叠的子集
    
    参数:
    - item_difficulties: 题目难度指标（越低越难）
    - valid_indices: 有效题目在原始数据中的索引
    - subset_size: 每个子集的大小
    - num_buckets: 分桶数量
    - random_state: 随机种子
    
    返回:
    - subset1_indices, subset2_indices: 两个子集在valid_indices中的索引
    """
    rng = np.random.RandomState(random_state)
    
    # 将难度分为分位桶
    bucket_labels = pd.qcut(item_difficulties, q=num_buckets, labels=False, duplicates='drop')
    
    # 处理可能的桶数不足问题
    if pd.isna(bucket_labels).any():
        ranks = pd.Series(item_difficulties).rank(method='average') / len(item_difficulties)
        bucket_labels = np.minimum((ranks * num_buckets).astype(int), num_buckets - 1)
    
    # 计算每个桶应该抽取的题目数量（按比例）
    bucket_counts = pd.Series(bucket_labels).value_counts().sort_index()
    bucket_props = bucket_counts / bucket_counts.sum()
    target_counts_float = bucket_props * subset_size
    target_counts = np.floor(target_counts_float).astype(int)
    
    # 分配剩余的题目到比例最大的桶
    remainder = subset_size - target_counts.sum()
    if remainder > 0:
        remainder_probs = target_counts_float - target_counts
        remainder_buckets = remainder_probs.nlargest(remainder).index
        target_counts[remainder_buckets] += 1
    
    subset1_indices = []
    subset2_indices = []
    
    # 在每个桶内抽样
    for bucket_id in sorted(bucket_counts.index):
        bucket_mask = bucket_labels == bucket_id
        bucket_items = np.where(bucket_mask)[0]
        target_count = target_counts[bucket_id]
        
        if target_count > 0 and len(bucket_items) >= target_count * 2:
            # 随机抽取足够的题目
            selected = rng.choice(bucket_items, size=target_count * 2, replace=False)
            # 随机分配到两个子集
            rng.shuffle(selected)
            subset1_indices.extend(selected[:target_count])
            subset2_indices.extend(selected[target_count:target_count * 2])
        elif target_count > 0:
            # 桶内题目不足，尽量均分
            rng.shuffle(bucket_items)
            mid = len(bucket_items) // 2
            subset1_indices.extend(bucket_items[:mid])
            subset2_indices.extend(bucket_items[mid:])
    
    # 如果子集大小不足，从剩余题目中随机补充
    all_selected = set(subset1_indices + subset2_indices)
    remaining_items = [i for i in range(len(item_difficulties)) if i not in all_selected]
    
    while len(subset1_indices) < subset_size and remaining_items:
        idx = rng.choice(len(remaining_items))
        subset1_indices.append(remaining_items.pop(idx))
    
    while len(subset2_indices) < subset_size and remaining_items:
        idx = rng.choice(len(remaining_items))
        subset2_indices.append(remaining_items.pop(idx))
    
    # 转换为原始数据中的索引
    subset1_original = valid_indices[subset1_indices[:subset_size]]
    subset2_original = valid_indices[subset2_indices[:subset_size]]
    
    return subset1_original, subset2_original, bucket_labels

# 应用难度分位桶法
subset1_bucket, subset2_bucket, difficulty_buckets = create_difficulty_bucket_subsets(
    item_acc_valid, valid_indices, SUBSET_SIZE, random_state=random_state
)

print("方法一：难度分位桶法 - 完成")
print(f"subset1_bucket大小: {len(subset1_bucket)}")
print(f"subset2_bucket大小: {len(subset2_bucket)}")
print(f"重叠题目数: {len(set(subset1_bucket) & set(subset2_bucket))}")

# 验证难度分布
subset1_difficulties = item_acc[subset1_bucket]
subset2_difficulties = item_acc[subset2_bucket]
print(f"subset1_bucket难度范围: {subset1_difficulties.min():.3f} - {subset1_difficulties.max():.3f}")
print(f"subset2_bucket难度范围: {subset2_difficulties.min():.3f} - {subset2_difficulties.max():.3f}")
print(f"subset1_bucket平均难度: {subset1_difficulties.mean():.3f}")
print(f"subset2_bucket平均难度: {subset2_difficulties.mean():.3f}")

# 验证方法一是否排除了指定科目（应该已经通过valid_indices排除）
bucket_subjects1 = df_cn.loc[subset1_bucket, 'Subject'].values
bucket_subjects2 = df_cn.loc[subset2_bucket, 'Subject'].values
excluded_in_bucket1 = [s for s in bucket_subjects1 if s in excluded_subjects]
excluded_in_bucket2 = [s for s in bucket_subjects2 if s in excluded_subjects]
print(f"方法一中包含的排除科目: subset1_bucket: {excluded_in_bucket1}, subset2_bucket: {excluded_in_bucket2}")


方法一：难度分位桶法 - 完成
subset1_bucket大小: 300
subset2_bucket大小: 300
重叠题目数: 0
subset1_bucket难度范围: 0.005 - 1.000
subset2_bucket难度范围: 0.010 - 0.997
subset1_bucket平均难度: 0.600
subset2_bucket平均难度: 0.594
方法一中包含的排除科目: subset1_bucket: [], subset2_bucket: []


### 方法二：IRT聚类法

使用题目的IRT特征（难度和区分度）进行K-Means聚类，找到300个簇中心，然后为每个簇选择距离中心最近的2个题目，随机分配到两个子集。


In [21]:
def create_irt_clustering_subsets(valid_indices, subset_size, random_state=42):
    """
    基于IRT聚类法创建两个不重叠的子集
    
    参数:
    - valid_indices: 有效题目在原始数据中的索引
    - subset_size: 每个子集的大小
    - random_state: 随机种子
    
    返回:
    - subset1_indices, subset2_indices: 两个子集在原始数据中的索引
    """
    rng = np.random.RandomState(random_state)
    
    # 加载IRT参数
    A, B, _ = load_irt_parameters('data/irt_model/')
    
    # 构建IRT特征矩阵：每行代表一个题目，列为[区分度, 难度]
    irt_features = np.vstack((A.squeeze(), B.squeeze())).T
    
    # 只使用有效题目的IRT特征
    irt_features_valid = irt_features[valid_indices]
    
    print(f"IRT特征矩阵形状: {irt_features_valid.shape}")
    print(f"区分度范围: {irt_features_valid[:, 0].min():.3f} - {irt_features_valid[:, 0].max():.3f}")
    print(f"难度范围: {irt_features_valid[:, 1].min():.3f} - {irt_features_valid[:, 1].max():.3f}")
    
    # 进行K-Means聚类，目标是300个簇
    kmeans = KMeans(n_clusters=subset_size, n_init=10, random_state=random_state)
    cluster_labels = kmeans.fit_predict(irt_features_valid)
    cluster_centers = kmeans.cluster_centers_
    
    print(f"聚类完成，共{len(np.unique(cluster_labels))}个簇")
    
    # 为每个簇找到距离中心最近的2个题目（排除指定科目）
    subset1_indices = []
    subset2_indices = []
    
    # 获取每个有效题目的科目信息
    subjects_valid = df_cn.loc[valid_indices, 'Subject'].values
    
    for cluster_id in range(subset_size):
        # 找到属于当前簇的所有题目
        cluster_mask = cluster_labels == cluster_id
        cluster_items = np.where(cluster_mask)[0]
        
        if len(cluster_items) == 0:
            continue
        
        # 计算簇内所有题目到中心的距离
        cluster_features = irt_features_valid[cluster_items]
        center = cluster_centers[cluster_id]
        distances = np.linalg.norm(cluster_features - center, axis=1)
        
        # 按距离排序，找到不属于排除科目的最近的2个题目
        sorted_indices = np.argsort(distances)
        valid_items = []
        
        for idx in sorted_indices:
            item_idx = cluster_items[idx]
            item_subject = subjects_valid[item_idx]
            
            # 检查科目是否在排除列表中
            if item_subject not in excluded_subjects:
                valid_items.append(item_idx)
                
            # 找到2个有效题目就停止
            if len(valid_items) >= 2:
                break
        
        # 分配找到的有效题目
        if len(valid_items) >= 2:
            # 随机分配到两个子集
            rng.shuffle(valid_items)
            subset1_indices.append(valid_items[0])
            subset2_indices.append(valid_items[1])
        elif len(valid_items) == 1:
            # 只有一个有效题目，随机分配
            if rng.random() < 0.5:
                subset1_indices.append(valid_items[0])
            else:
                subset2_indices.append(valid_items[0])
        # 如果没有找到有效题目，跳过这个簇
    
    # 如果某个子集大小不足，从另一个子集中随机移动一些题目
    while len(subset1_indices) < subset_size and len(subset2_indices) > 0:
        idx = rng.choice(len(subset2_indices))
        subset1_indices.append(subset2_indices.pop(idx))
    
    while len(subset2_indices) < subset_size and len(subset1_indices) > subset_size:
        idx = rng.choice(len(subset1_indices[subset_size:]))
        subset2_indices.append(subset1_indices.pop(subset_size + idx))
    
    # 转换为原始数据中的索引
    subset1_original = valid_indices[subset1_indices[:subset_size]]
    subset2_original = valid_indices[subset2_indices[:subset_size]]
    
    return subset1_original, subset2_original, cluster_labels, cluster_centers

# 应用IRT聚类法
subset1_irt, subset2_irt, irt_clusters, irt_centers = create_irt_clustering_subsets(
    valid_indices, SUBSET_SIZE, random_state=random_state
)

print("方法二：IRT聚类法 - 完成")
print(f"subset1_irt大小: {len(subset1_irt)}")
print(f"subset2_irt大小: {len(subset2_irt)}")
print(f"重叠题目数: {len(set(subset1_irt) & set(subset2_irt))}")

# 验证IRT特征分布
# 加载IRT参数用于验证
A, B, _ = load_irt_parameters('data/irt_model/')
irt_features_all = np.vstack((A.squeeze(), B.squeeze())).T
subset1_irt_features = irt_features_all[subset1_irt]
subset2_irt_features = irt_features_all[subset2_irt]

print(f"subset1_irt区分度均值: {subset1_irt_features[:, 0].mean():.3f}")
print(f"subset2_irt区分度均值: {subset2_irt_features[:, 0].mean():.3f}")
print(f"subset1_irt难度均值: {subset1_irt_features[:, 1].mean():.3f}")
print(f"subset2_irt难度均值: {subset2_irt_features[:, 1].mean():.3f}")

# 验证方法二是否排除了指定科目
irt_subjects1 = df_cn.loc[subset1_irt, 'Subject'].values
irt_subjects2 = df_cn.loc[subset2_irt, 'Subject'].values
excluded_in_irt1 = [s for s in irt_subjects1 if s in excluded_subjects]
excluded_in_irt2 = [s for s in irt_subjects2 if s in excluded_subjects]
print(f"方法二中包含的排除科目: subset1_irt: {excluded_in_irt1}, subset2_irt: {excluded_in_irt2}")


IRT特征矩阵形状: (10217, 11)
区分度范围: -5.049 - 7.334
难度范围: -2.978 - 2.676
聚类完成，共300个簇
方法二：IRT聚类法 - 完成
subset1_irt大小: 300
subset2_irt大小: 300
重叠题目数: 0
subset1_irt区分度均值: 1.577
subset2_irt区分度均值: 1.585
subset1_irt难度均值: -0.141
subset2_irt难度均值: -0.174
方法二中包含的排除科目: subset1_irt: [], subset2_irt: []


### 方法三：题目正误矩阵聚类法

使用所有模型对所有题目的作答0/1矩阵进行K-Means聚类，找到300个簇中心，然后为每个簇选择距离中心最近的2个题目，随机分配到两个子集。


In [22]:
def create_response_matrix_clustering_subsets(Y_train, valid_indices, subset_size, random_state=42):
    """
    基于题目正误矩阵聚类法创建两个不重叠的子集
    
    参数:
    - Y_train: 训练集模型响应矩阵 (模型数, 题目数)
    - valid_indices: 有效题目在原始数据中的索引
    - subset_size: 每个子集的大小
    - random_state: 随机种子
    
    返回:
    - subset1_indices, subset2_indices: 两个子集在原始数据中的索引
    """
    rng = np.random.RandomState(random_state)
    
    # 获取MMLU训练数据
    Y_mmlu_train = Y_train[:, scenarios_position['mmlu']]  # 形状: (295个训练模型, 14042个MMLU题目)
    
    # 只使用有效题目
    Y_mmlu_train_valid = Y_mmlu_train[:, valid_indices]  # 形状: (295, 有效题目数)
    
    # 转置矩阵，使每行代表一个题目，每列代表一个模型的响应
    response_matrix = Y_mmlu_train_valid.T  # 形状: (有效题目数, 295个模型)
    
    print(f"响应矩阵形状: {response_matrix.shape}")
    print(f"平均正确率: {response_matrix.mean():.3f}")
    
    # 进行K-Means聚类，目标是300个簇
    kmeans = KMeans(n_clusters=subset_size, n_init=10, random_state=random_state)
    cluster_labels = kmeans.fit_predict(response_matrix)
    cluster_centers = kmeans.cluster_centers_
    
    print(f"聚类完成，共{len(np.unique(cluster_labels))}个簇")
    
    # 为每个簇找到距离中心最近的2个题目（排除指定科目）
    subset1_indices = []
    subset2_indices = []
    
    # 获取每个有效题目的科目信息
    subjects_valid = df_cn.loc[valid_indices, 'Subject'].values
    
    for cluster_id in range(subset_size):
        # 找到属于当前簇的所有题目
        cluster_mask = cluster_labels == cluster_id
        cluster_items = np.where(cluster_mask)[0]
        
        if len(cluster_items) == 0:
            continue
        
        # 计算簇内所有题目到中心的距离
        cluster_responses = response_matrix[cluster_items]
        center = cluster_centers[cluster_id]
        distances = np.linalg.norm(cluster_responses - center, axis=1)
        
        # 按距离排序，找到不属于排除科目的最近的2个题目
        sorted_indices = np.argsort(distances)
        valid_items = []
        
        for idx in sorted_indices:
            item_idx = cluster_items[idx]
            item_subject = subjects_valid[item_idx]
            
            # 检查科目是否在排除列表中
            if item_subject not in excluded_subjects:
                valid_items.append(item_idx)
                
            # 找到2个有效题目就停止
            if len(valid_items) >= 2:
                break
        
        # 分配找到的有效题目
        if len(valid_items) >= 2:
            # 随机分配到两个子集
            rng.shuffle(valid_items)
            subset1_indices.append(valid_items[0])
            subset2_indices.append(valid_items[1])
        elif len(valid_items) == 1:
            # 只有一个有效题目，随机分配
            if rng.random() < 0.5:
                subset1_indices.append(valid_items[0])
            else:
                subset2_indices.append(valid_items[0])
        # 如果没有找到有效题目，跳过这个簇
    
    # 如果某个子集大小不足，从另一个子集中随机移动一些题目
    while len(subset1_indices) < subset_size and len(subset2_indices) > 0:
        idx = rng.choice(len(subset2_indices))
        subset1_indices.append(subset2_indices.pop(idx))
    
    while len(subset2_indices) < subset_size and len(subset1_indices) > subset_size:
        idx = rng.choice(len(subset1_indices[subset_size:]))
        subset2_indices.append(subset1_indices.pop(subset_size + idx))
    
    # 转换为原始数据中的索引
    subset1_original = valid_indices[subset1_indices[:subset_size]]
    subset2_original = valid_indices[subset2_indices[:subset_size]]
    
    return subset1_original, subset2_original, cluster_labels, cluster_centers

# 应用题目正误矩阵聚类法
subset1_matrix, subset2_matrix, matrix_clusters, matrix_centers = create_response_matrix_clustering_subsets(
    Y_train, valid_indices, SUBSET_SIZE, random_state=random_state
)

print("方法三：题目正误矩阵聚类法 - 完成")
print(f"subset1_matrix大小: {len(subset1_matrix)}")
print(f"subset2_matrix大小: {len(subset2_matrix)}")
print(f"重叠题目数: {len(set(subset1_matrix) & set(subset2_matrix))}")

# 验证响应模式分布
Y_mmlu_train = Y_train[:, scenarios_position['mmlu']]
subset1_matrix_responses = Y_mmlu_train[:, subset1_matrix]
subset2_matrix_responses = Y_mmlu_train[:, subset2_matrix]

print(f"subset1_matrix平均正确率: {subset1_matrix_responses.mean():.3f}")
print(f"subset2_matrix平均正确率: {subset2_matrix_responses.mean():.3f}")
print(f"subset1_matrix难度标准差: {subset1_matrix_responses.mean(axis=0).std():.3f}")
print(f"subset2_matrix难度标准差: {subset2_matrix_responses.mean(axis=0).std():.3f}")


响应矩阵形状: (10217, 295)
平均正确率: 0.575
聚类完成，共300个簇
方法三：题目正误矩阵聚类法 - 完成
subset1_matrix大小: 300
subset2_matrix大小: 292
重叠题目数: 0
subset1_matrix平均正确率: 0.517
subset2_matrix平均正确率: 0.510
subset1_matrix难度标准差: 0.242
subset2_matrix难度标准差: 0.242


## 综合评估与比较

现在我们将对生成的6个子集进行全面评估：
1. 计算每个子集上的模型预测准确率
2. 与完整MMLU数据集的准确率进行比较
3. 计算每种方法内部两个子集之间的准确率差异
4. 评估哪种方法最稳定和具有代表性


In [26]:
# 收集所有子集
all_subsets = {
    'subset1_bucket': subset1_bucket,
    'subset2_bucket': subset2_bucket,
    'subset1_irt': subset1_irt,
    'subset2_irt': subset2_irt,
    'subset1_matrix': subset1_matrix,
    'subset2_matrix': subset2_matrix
}

# 计算完整MMLU（排除指定科目后）的基准准确率
Y_mmlu_valid_all = Y[:, valid_indices]  # 所有模型在有效题目上的响应
full_mmlu_acc = Y_mmlu_valid_all.mean(axis=1)  # 每个模型在完整有效MMLU上的准确率

print("=== 完整MMLU基准表现 ===")
print(f"完整MMLU平均准确率: {full_mmlu_acc.mean():.4f}")
print(f"完整MMLU准确率标准差: {full_mmlu_acc.std():.4f}")
print(f"完整MMLU准确率范围: {full_mmlu_acc.min():.4f} - {full_mmlu_acc.max():.4f}")

# 计算每个子集的准确率
subset_accuracies = {}
print("\n=== 各子集表现 ===")

for subset_name, subset_indices in all_subsets.items():
    # 所有模型在该子集上的响应
    subset_responses = Y[:, subset_indices]
    # 每个模型在该子集上的准确率
    subset_acc = subset_responses.mean(axis=1)
    subset_accuracies[subset_name] = subset_acc
    
    print(f"\n{subset_name}:")
    print(f"  平均准确率: {subset_acc.mean():.4f}")
    print(f"  标准差: {subset_acc.std():.4f}")
    print(f"  与完整MMLU的差异: {abs(subset_acc.mean() - full_mmlu_acc.mean()):.4f}")
    print(f"  相关系数: {np.corrcoef(subset_acc, full_mmlu_acc)[0,1]:.4f}")

# 计算每种方法内部的一致性
print("\n=== 方法内部一致性分析 ===")

methods = ['bucket', 'irt', 'matrix']
method_consistency = {}

for method in methods:
    subset1_acc = subset_accuracies[f'subset1_{method}']
    subset2_acc = subset_accuracies[f'subset2_{method}']
    
    # 计算两个子集间的差异
    diff_mean = abs(subset1_acc.mean() - subset2_acc.mean())
    diff_std = abs(subset1_acc.std() - subset2_acc.std())
    correlation = np.corrcoef(subset1_acc, subset2_acc)[0,1]
    
    # 计算每个模型在两个子集上的准确率差异
    model_diffs = abs(subset1_acc - subset2_acc)
    avg_model_diff = model_diffs.mean()
    max_model_diff = model_diffs.max()
    
    method_consistency[method] = {
        'mean_diff': diff_mean,
        'std_diff': diff_std,
        'correlation': correlation,
        'avg_model_diff': avg_model_diff,
        'max_model_diff': max_model_diff
    }
    
    print(f"\n{method.upper()}方法一致性:")
    print(f"  两子集平均准确率差异: {diff_mean:.5f}")
    print(f"  两子集标准差差异: {diff_std:.5f}")
    print(f"  两子集相关系数: {correlation:.4f}")
    print(f"  模型间平均差异: {avg_model_diff:.5f}")
    print(f"  模型间最大差异: {max_model_diff:.5f}")

# 计算每种方法与完整MMLU的代表性
print("\n=== 方法代表性分析 ===")

method_representativeness = {}

for method in methods:
    subset1_acc = subset_accuracies[f'subset1_{method}']
    subset2_acc = subset_accuracies[f'subset2_{method}']
    
    # 两个子集与完整MMLU的相关性
    corr1 = np.corrcoef(subset1_acc, full_mmlu_acc)[0,1]
    corr2 = np.corrcoef(subset2_acc, full_mmlu_acc)[0,1]
    avg_corr = (corr1 + corr2) / 2
    
    # 两个子集与完整MMLU的平均准确率差异
    diff1 = abs(subset1_acc.mean() - full_mmlu_acc.mean())
    diff2 = abs(subset2_acc.mean() - full_mmlu_acc.mean())
    avg_diff = (diff1 + diff2) / 2
    
    # 标准差差异
    std_diff1 = abs(subset1_acc.std() - full_mmlu_acc.std())
    std_diff2 = abs(subset2_acc.std() - full_mmlu_acc.std())
    avg_std_diff = (std_diff1 + std_diff2) / 2
    
    method_representativeness[method] = {
        'avg_correlation': avg_corr,
        'avg_mean_diff': avg_diff,
        'avg_std_diff': avg_std_diff
    }
    
    print(f"\n{method.upper()}方法代表性:")
    print(f"  平均相关系数: {avg_corr:.4f}")
    print(f"  平均准确率差异: {avg_diff:.5f}")
    print(f"  平均标准差差异: {avg_std_diff:.5f}")

print("\n=== 数据保存 ===")

# 保存结果供后续分析
evaluation_results = {
    'full_mmlu_accuracy': {
        'mean': float(full_mmlu_acc.mean()),
        'std': float(full_mmlu_acc.std()),
        'min': float(full_mmlu_acc.min()),
        'max': float(full_mmlu_acc.max())
    },
    'subset_accuracies': {name: {'mean': float(acc.mean()), 'std': float(acc.std())} 
                         for name, acc in subset_accuracies.items()},
    'method_consistency': method_consistency,
    'method_representativeness': method_representativeness
}

print("评估结果已计算完成并存储在evaluation_results中")


=== 完整MMLU基准表现 ===
完整MMLU平均准确率: 0.6987
完整MMLU准确率标准差: 0.0756
完整MMLU准确率范围: 0.3654 - 0.8306

=== 各子集表现 ===

subset1_bucket:
  平均准确率: 0.6840
  标准差: 0.0829
  与完整MMLU的差异: 0.0147
  相关系数: 0.9823

subset2_bucket:
  平均准确率: 0.7044
  标准差: 0.0809
  与完整MMLU的差异: 0.0057
  相关系数: 0.9800

subset1_irt:
  平均准确率: 0.5859
  标准差: 0.0977
  与完整MMLU的差异: 0.1128
  相关系数: 0.9735

subset2_irt:
  平均准确率: 0.5871
  标准差: 0.0931
  与完整MMLU的差异: 0.1115
  相关系数: 0.9738

subset1_matrix:
  平均准确率: 0.6892
  标准差: 0.0820
  与完整MMLU的差异: 0.0095
  相关系数: 0.9842

subset2_matrix:
  平均准确率: 0.6481
  标准差: 0.0820
  与完整MMLU的差异: 0.0506
  相关系数: 0.9720

=== 方法内部一致性分析 ===

BUCKET方法一致性:
  两子集平均准确率差异: 0.02037
  两子集标准差差异: 0.00198
  两子集相关系数: 0.9555
  模型间平均差异: 0.02654
  模型间最大差异: 0.09191

IRT方法一致性:
  两子集平均准确率差异: 0.00128
  两子集标准差差异: 0.00463
  两子集相关系数: 0.9727
  模型间平均差异: 0.01833
  模型间最大差异: 0.08059

MATRIX方法一致性:
  两子集平均准确率差异: 0.04108
  两子集标准差差异: 0.00002
  两子集相关系数: 0.9534
  模型间平均差异: 0.04256
  模型间最大差异: 0.10255

=== 方法代表性分析 ===

BUCKET方法代表性:
  平均相关系数: 0.9811
  平均

In [27]:
# 综合评分和排序
print("=== 综合评分和方法排序 ===")

# 定义评分权重
weights = {
    'consistency': 0.5,      # 方法内部一致性权重
    'representativeness': 0.5  # 代表性权重
}

method_scores = {}

for method in methods:
    consistency = method_consistency[method]
    representativeness = method_representativeness[method]
    
    # 一致性评分（越小越好，需要取倒数并归一化）
    consistency_score = (
        (1 / (1 + consistency['mean_diff'])) * 0.3 +
        (1 / (1 + consistency['avg_model_diff'])) * 0.3 +
        consistency['correlation'] * 0.4
    )
    
    # 代表性评分（越高越好）
    representativeness_score = (
        representativeness['avg_correlation'] * 0.5 +
        (1 / (1 + representativeness['avg_mean_diff'])) * 0.3 +
        (1 / (1 + representativeness['avg_std_diff'])) * 0.2
    )
    
    # 综合评分
    total_score = (
        consistency_score * weights['consistency'] +
        representativeness_score * weights['representativeness']
    )
    
    method_scores[method] = {
        'consistency_score': consistency_score,
        'representativeness_score': representativeness_score,
        'total_score': total_score
    }
    
    print(f"\n{method.upper()}方法评分:")
    print(f"  一致性评分: {consistency_score:.4f}")
    print(f"  代表性评分: {representativeness_score:.4f}")
    print(f"  综合评分: {total_score:.4f}")

# 排序方法
sorted_methods = sorted(method_scores.items(), key=lambda x: x[1]['total_score'], reverse=True)

print("\n=== 方法排序（从最佳到最差） ===")
for i, (method, scores) in enumerate(sorted_methods, 1):
    print(f"{i}. {method.upper()}方法 - 综合评分: {scores['total_score']:.4f}")

# 详细分析最佳方法
best_method = sorted_methods[0][0]
print(f"\n=== 最佳方法：{best_method.upper()}方法详细分析 ===")

best_consistency = method_consistency[best_method]
best_representativeness = method_representativeness[best_method]

print(f"一致性指标:")
print(f"  两子集平均准确率差异: {best_consistency['mean_diff']:.5f}")
print(f"  两子集相关系数: {best_consistency['correlation']:.4f}")
print(f"  模型间平均差异: {best_consistency['avg_model_diff']:.5f}")

print(f"\n代表性指标:")
print(f"  与完整MMLU平均相关系数: {best_representativeness['avg_correlation']:.4f}")
print(f"  与完整MMLU平均准确率差异: {best_representativeness['avg_mean_diff']:.5f}")

# 输出最佳方法的子集
print(f"\n最佳方法子集索引:")
print(f"subset1_{best_method}: {len(all_subsets[f'subset1_{best_method}'])}个题目")
print(f"subset2_{best_method}: {len(all_subsets[f'subset2_{best_method}'])}个题目")

print("\n=== 结论和建议 ===")
if best_method == 'bucket':
    print("难度分位桶法表现最佳。这种方法简单易懂，基于题目难度进行分层抽样，")
    print("能够很好地保持子集间的平衡，并且与原始数据集具有较好的代表性。")
elif best_method == 'irt':
    print("IRT聚类法表现最佳。这种方法利用了题目的心理测量学特征（难度和区分度），")
    print("能够更精细地捕捉题目特性，生成的子集具有更好的测量特性。")
elif best_method == 'matrix':
    print("题目正误矩阵聚类法表现最佳。这种方法基于模型的实际响应模式进行聚类，")
    print("能够识别具有相似响应模式的题目，生成的子集在评估不同模型时更加稳定。")

print(f"\n推荐使用{best_method.upper()}方法生成的子集进行MMLU性能评估。")


=== 综合评分和方法排序 ===

BUCKET方法评分:
  一致性评分: 0.9685
  代表性评分: 0.9863
  综合评分: 0.9774

IRT方法评分:
  一致性评分: 0.9833
  代表性评分: 0.9527
  综合评分: 0.9680

MATRIX方法评分:
  一致性评分: 0.9573
  代表性评分: 0.9790
  综合评分: 0.9681

=== 方法排序（从最佳到最差） ===
1. BUCKET方法 - 综合评分: 0.9774
2. MATRIX方法 - 综合评分: 0.9681
3. IRT方法 - 综合评分: 0.9680

=== 最佳方法：BUCKET方法详细分析 ===
一致性指标:
  两子集平均准确率差异: 0.02037
  两子集相关系数: 0.9555
  模型间平均差异: 0.02654

代表性指标:
  与完整MMLU平均相关系数: 0.9811
  与完整MMLU平均准确率差异: 0.01019

最佳方法子集索引:
subset1_bucket: 300个题目
subset2_bucket: 300个题目

=== 结论和建议 ===
难度分位桶法表现最佳。这种方法简单易懂，基于题目难度进行分层抽样，
能够很好地保持子集间的平衡，并且与原始数据集具有较好的代表性。

推荐使用BUCKET方法生成的子集进行MMLU性能评估。


In [28]:
# 创建汇总表格用于可视化比较
import pandas as pd

print("=== 最终汇总表格 ===")

# 创建方法比较表
comparison_data = []
for method in methods:
    consistency = method_consistency[method]
    representativeness = method_representativeness[method]
    scores = method_scores[method]
    
    comparison_data.append({
        '方法': method.upper(),
        '两子集平均准确率差异': f"{consistency['mean_diff']:.5f}",
        '两子集相关系数': f"{consistency['correlation']:.4f}",
        '模型间平均差异': f"{consistency['avg_model_diff']:.5f}",
        '与完整MMLU相关系数': f"{representativeness['avg_correlation']:.4f}",
        '与完整MMLU准确率差异': f"{representativeness['avg_mean_diff']:.5f}",
        '一致性评分': f"{scores['consistency_score']:.4f}",
        '代表性评分': f"{scores['representativeness_score']:.4f}",
        '综合评分': f"{scores['total_score']:.4f}",
        '排名': sorted_methods.index((method, scores)) + 1
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.sort_values('排名')

print(comparison_df.to_string(index=False))

# 创建子集准确率比较表
print("\n=== 子集准确率对比 ===")
accuracy_data = []
for subset_name, subset_acc in subset_accuracies.items():
    accuracy_data.append({
        '子集': subset_name,
        '平均准确率': f"{subset_acc.mean():.4f}",
        '标准差': f"{subset_acc.std():.4f}",
        '与完整MMLU差异': f"{abs(subset_acc.mean() - full_mmlu_acc.mean()):.4f}",
        '与完整MMLU相关系数': f"{np.corrcoef(subset_acc, full_mmlu_acc)[0,1]:.4f}"
    })

# 添加完整MMLU基准
accuracy_data.append({
    '子集': 'Full_MMLU',
    '平均准确率': f"{full_mmlu_acc.mean():.4f}",
    '标准差': f"{full_mmlu_acc.std():.4f}",
    '与完整MMLU差异': "0.0000",
    '与完整MMLU相关系数': "1.0000"
})

accuracy_df = pd.DataFrame(accuracy_data)
print(accuracy_df.to_string(index=False))

# 保存最终结果
final_results = {
    'evaluation_date': pd.Timestamp.now().isoformat(),
    'best_method': best_method,
    'method_comparison': comparison_df.to_dict('records'),
    'accuracy_comparison': accuracy_df.to_dict('records'),
    'all_subsets': {name: indices.tolist() for name, indices in all_subsets.items()},
    'evaluation_results': evaluation_results
}

# 输出所选子集的摘要信息
print(f"\n=== 最终选定的最佳子集 ===")
print(f"方法: {best_method.upper()}")
print(f"subset1_{best_method}: {all_subsets[f'subset1_{best_method}'][:10].tolist()}... (共{len(all_subsets[f'subset1_{best_method}'])}题)")
print(f"subset2_{best_method}: {all_subsets[f'subset2_{best_method}'][:10].tolist()}... (共{len(all_subsets[f'subset2_{best_method}'])}题)")

print("\n=== 任务完成 ===")
print("已成功实现三种MMLU子集生成方法并完成综合评估:")
print("1. ✅ 难度分位桶法")
print("2. ✅ IRT聚类法") 
print("3. ✅ 题目正误矩阵聚类法")
print("4. ✅ 综合评估和方法比较")
print(f"5. ✅ 最佳方法推荐: {best_method.upper()}")

print(f"\n所有结果已保存在变量final_results中，可用于后续分析。")


=== 最终汇总表格 ===
    方法 两子集平均准确率差异 两子集相关系数 模型间平均差异 与完整MMLU相关系数 与完整MMLU准确率差异  一致性评分  代表性评分   综合评分  排名
BUCKET    0.02037  0.9555 0.02654      0.9811      0.01019 0.9685 0.9863 0.9774   1
MATRIX    0.04108  0.9534 0.04256      0.9781      0.03004 0.9573 0.9790 0.9681   2
   IRT    0.00128  0.9727 0.01833      0.9737      0.11216 0.9833 0.9527 0.9680   3

=== 子集准确率对比 ===
            子集  平均准确率    标准差 与完整MMLU差异 与完整MMLU相关系数
subset1_bucket 0.6840 0.0829    0.0147      0.9823
subset2_bucket 0.7044 0.0809    0.0057      0.9800
   subset1_irt 0.5859 0.0977    0.1128      0.9735
   subset2_irt 0.5871 0.0931    0.1115      0.9738
subset1_matrix 0.6892 0.0820    0.0095      0.9842
subset2_matrix 0.6481 0.0820    0.0506      0.9720
     Full_MMLU 0.6987 0.0756    0.0000      1.0000

=== 最终选定的最佳子集 ===
方法: BUCKET
subset1_bucket: [1015, 1848, 12603, 2303, 1351, 9873, 932, 1898, 1670, 12498]... (共300题)
subset2_bucket: [1074, 1771, 2182, 4117, 1449, 13092, 317, 4748, 1943, 3065]... (共300题)

=== 任务完成 ===
已成功

In [30]:
# 导出6个子集为与 mmlu_ZH-CN.csv 相同结构且按要求排序的CSV

import os

out_dir = 'mmlu_subsets_csv'
os.makedirs(out_dir, exist_ok=True)

# 输出列保持与 mmlu_ZH-CN.csv 一致
cols = ['ID', 'Question', 'A', 'B', 'C', 'D', 'Answer', 'Subject']

# 子集名称与索引（这些变量已在上文生成）
subsets_to_export = {
    'subset1_bucket': subset1_bucket,
    'subset2_bucket': subset2_bucket,
    'subset1_irt': subset1_irt,
    'subset2_irt': subset2_irt,
    'subset1_matrix': subset1_matrix,
    'subset2_matrix': subset2_matrix,
}

def export_subset_sorted(df_source, indices, path, cols):
    df_sub = df_source.loc[indices, cols].copy()
    # 确保按 Subject 升序、同一 Subject 内按 ID 降序
    # 使用稳定排序以保证分组后次序稳定
    df_sub = df_sub.sort_values(by=['Subject', 'ID'],
                                ascending=[True, True],
                                kind='mergesort')
    df_sub.to_csv(path, index=False, encoding='utf-8')

export_paths = {}
for name, idxs in subsets_to_export.items():
    filename = f"mmlu_ZH-CN_{name}.csv"
    path = os.path.join(out_dir, filename)
    export_subset_sorted(df_cn, idxs, path, cols)
    export_paths[name] = path

print("导出完成，文件位置：")
for name, path in export_paths.items():
    print(f"- {name}: {path}")

导出完成，文件位置：
- subset1_bucket: mmlu_subsets_csv\mmlu_ZH-CN_subset1_bucket.csv
- subset2_bucket: mmlu_subsets_csv\mmlu_ZH-CN_subset2_bucket.csv
- subset1_irt: mmlu_subsets_csv\mmlu_ZH-CN_subset1_irt.csv
- subset2_irt: mmlu_subsets_csv\mmlu_ZH-CN_subset2_irt.csv
- subset1_matrix: mmlu_subsets_csv\mmlu_ZH-CN_subset1_matrix.csv
- subset2_matrix: mmlu_subsets_csv\mmlu_ZH-CN_subset2_matrix.csv


In [None]:
# 多随机种子复现实验（10个random_state）：对三种方法进行稳定性与代表性评估
import io, contextlib, os, json
import numpy as np
import pandas as pd

# 可调整的种子集合
seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 42]
SUBSET_SIZE = 300

# 基准：完整（排除科目后的）MMLU表现
# Y shape: 模型数量，题目数量
# full_mmlu_acc：每个模型在筛选的MMLU的准确率
Y_mmlu_valid_all = Y[:, valid_indices]
full_mmlu_acc = Y_mmlu_valid_all.mean(axis=1)

def compute_method_total_score(consistency, representativeness, weights):
	# consistency：子集间的consistency
	# mean_diff: 对同一个方法在同一个随机种子下，两个子集的模型准确率均值（395个模型的准确率算平均）之差的绝对值
	# avg_model_diff: 与mean_diff相比，会算逐模型的准确率差的绝对值，会累积差异幅度
	# mean_diff是两个平均数（子集1的总准确率-子集2的总准确率）之差的绝对值
	# avg_model_diff是395个模型的准确率之差的绝对值之和的平均值，比mean_diff大
	# consistency存储两个子集间的一致性信息，repre存储与筛选后全集的关系
	# correlation：r = cov(a,b) / (sigma_a*sigma_b),cov(a,b)= (1/n)∑(a_i − μ_a)(b_i − μ_b)
	# a,b 两个子集的准确率，(395,), σ_a,σ_b为各自标准差( σ_a = sqrt((1/n)∑(a_i − μ_a)^2))
	# σ_a和sigma_b不能为0
	# consistency也是同个method同个seed。10个seed也有10个值

	# representativeness：两个子集与总集的repre的和
	# avg_correlation: 两个子集与全集的平均相关性.两个子集和全集的shape都为（395，）
	# 与上面的corr的区别是，子集各自与全集算Pierson corr coeff，再加和除以2（mean）
	# avg_mean_diff: 两个自己的各自准确率与总集的准确率的差值绝对值的平均值
	# diff1 = abs(subset1_acc.mean() - full_mmlu_acc.mean())
	# diff2 = abs(subset2_acc.mean() - full_mmlu_acc.mean())
	# avg_mean_diff = (diff1 + diff2) / 2
	# avg_std_diff
	# std1 = abs(a1.std() - full_mmlu_acc.std())
	# std2 = abs(a2.std() - full_mmlu_acc.std())
	# avg_std_diff = (std1 + std2) / 2

	return (
		((1/(1+consistency['mean_diff']))*0.3 +
		 (1/(1+consistency['avg_model_diff']))*0.3 +
		 consistency['correlation']*0.4) * weights['consistency'] +
		(representativeness['avg_correlation']*0.5 +
		 (1/(1+representativeness['avg_mean_diff']))*0.3 +
		 (1/(1+representativeness['avg_std_diff']))*0.2) * weights['representativeness']
	)

weights = {'consistency': 0.5, 'representativeness': 0.5}

per_seed_rows = []
detail_per_seed = []

for seed in seeds:
	np.random.seed(seed)

	# 调用前抑制内部print输出，避免重复日志冲刷
	with contextlib.redirect_stdout(io.StringIO()):
		# 方法一：难度分位桶法
		subset1_bucket, subset2_bucket, _ = create_difficulty_bucket_subsets(
			item_acc_valid, valid_indices, SUBSET_SIZE, random_state=seed
		)
		# 方法二：IRT聚类法
		subset1_irt, subset2_irt, _, _ = create_irt_clustering_subsets(
			valid_indices, SUBSET_SIZE, random_state=seed
		)
		# 方法三：正误矩阵聚类法
		subset1_matrix, subset2_matrix, _, _ = create_response_matrix_clustering_subsets(
			Y_train, valid_indices, SUBSET_SIZE, random_state=seed
		)

	all_subsets_seed = {
		'subset1_bucket': subset1_bucket, 'subset2_bucket': subset2_bucket,
		'subset1_irt': subset1_irt, 'subset2_irt': subset2_irt,
		'subset1_matrix': subset1_matrix, 'subset2_matrix': subset2_matrix
	}

	# 各子集准确率（对所有395个模型）
	subset_acc = {name: (Y[:, idxs].mean(axis=1)) for name, idxs in all_subsets_seed.items()}

	# 方法内部一致性（两个子集之间）
	method_consistency = {}
	for method in ['bucket', 'irt', 'matrix']:
		a1 = subset_acc[f'subset1_{method}']
		a2 = subset_acc[f'subset2_{method}']
		diffs = np.abs(a1 - a2)
		method_consistency[method] = {
			'mean_diff': float(abs(a1.mean() - a2.mean())),
			'std_diff': float(abs(a1.std() - a2.std())),
			'correlation': float(np.corrcoef(a1, a2)[0, 1]),
			'avg_model_diff': float(diffs.mean()),
			'max_model_diff': float(diffs.max())
		}

	# 代表性（相对完整MMLU）
	method_representativeness = {}
	for method in ['bucket', 'irt', 'matrix']:
		a1 = subset_acc[f'subset1_{method}']
		a2 = subset_acc[f'subset2_{method}']
		corr1 = float(np.corrcoef(a1, full_mmlu_acc)[0, 1])
		corr2 = float(np.corrcoef(a2, full_mmlu_acc)[0, 1])
		diff1 = float(abs(a1.mean() - full_mmlu_acc.mean()))
		diff2 = float(abs(a2.mean() - full_mmlu_acc.mean()))
		std1 = float(abs(a1.std() - full_mmlu_acc.std()))
		std2 = float(abs(a2.std() - full_mmlu_acc.std()))
		method_representativeness[method] = {
			'avg_correlation': (corr1 + corr2) / 2,
			'avg_mean_diff': (diff1 + diff2) / 2,
			'avg_std_diff': (std1 + std2) / 2
		}

	# 评分与排序
	method_scores = {}
	for method in ['bucket', 'irt', 'matrix']:
		method_scores[method] = float(compute_method_total_score(
			method_consistency[method],
			method_representativeness[method],
			weights
		))

	ranking = sorted(method_scores.items(), key=lambda x: x[1], reverse=True)
	best_method = ranking[0][0]

	# 逐种子指标行（便于汇总/可视化）
	for method in ['bucket', 'irt', 'matrix']:
		per_seed_rows.append({
			'seed': seed,
			'方法': method.upper(),
			'总分': method_scores[method],
			'与Full相关(均值)': method_representativeness[method]['avg_correlation'],
			'与Full均值差(均值)': method_representativeness[method]['avg_mean_diff'],
			'两子集相关': method_consistency[method]['correlation'],
			'两子集均值差': method_consistency[method]['mean_diff'],
			'两子集模型差(均值)': method_consistency[method]['avg_model_diff'],
			'排名': 1 + [m for m, _ in ranking].index(method)
		})

	# 保存详细结构化结果
	detail_per_seed.append({
		'seed': seed,
		'ranking': ranking,
		'best_method': best_method,
		'method_scores': method_scores,
		'method_consistency': method_consistency,
		'method_representativeness': method_representativeness
	})

# 聚合统计（方法层面）
per_seed_df = pd.DataFrame(per_seed_rows)

agg_rows = []
for method in ['bucket', 'irt', 'matrix']:
	# dfm: data frame
	dfm = per_seed_df[per_seed_df['方法'] == method.upper()]
	wins = int((dfm['排名'] == 1).sum())
	agg_rows.append({
		'方法': method.upper(),
		'胜出次数/10': wins,
		'平均排名': float(dfm['排名'].mean()),
		'总分(均值)': float(dfm['总分'].mean()),
		'总分(Std)': float(dfm['总分'].std()),
		'与Full相关(均值)': float(dfm['与Full相关(均值)'].mean()),
		'与Full相关(Std)': float(dfm['与Full相关(均值)'].std()),
		'与Full均值差(均值)': float(dfm['与Full均值差(均值)'].mean()),
		'与Full均值差(Std)': float(dfm['与Full均值差(均值)'].std()),
		'两子集相关(均值)': float(dfm['两子集相关'].mean()),
		'两子集相关(Std)': float(dfm['两子集相关'].std()),
		'两子集均值差(均值)': float(dfm['两子集均值差'].mean()),
		'两子集均值差(Std)': float(dfm['两子集均值差'].std()),
		'两子集模型差(均值)': float(dfm['两子集模型差(均值)'].mean()),
		'两子集模型差(Std)': float(dfm['两子集模型差(均值)'].std())
	})
agg_df = pd.DataFrame(agg_rows).sort_values(['胜出次数/10', '总分(均值)'], ascending=[False, False])

# 打印精简汇总
print('=== 10个random_state 聚合结果（方法层面） ===')
print(agg_df.to_string(index=False))

print('\n=== 每个random_state的最佳方法 ===')
best_per_seed = pd.DataFrame([{'seed': d['seed'], 'best': d['best_method'].upper()} for d in detail_per_seed])
print(best_per_seed.value_counts('best').to_string())

# 保存结果到 subset_analysis_charts/
os.makedirs('subset_analysis_charts', exist_ok=True)
per_seed_df.to_csv('subset_analysis_charts/multi_seed_per_seed_metrics.csv', index=False, encoding='utf-8')
agg_df.to_csv('subset_analysis_charts/multi_seed_aggregate_metrics.csv', index=False, encoding='utf-8')

summary_payload = {
	'seeds': seeds,
	'aggregate': agg_df.to_dict('records'),
	'per_seed': per_seed_df.to_dict('records'),
	'detail_per_seed': detail_per_seed
}
with open('subset_analysis_charts/mmlu_ZH-CN_subset_summary.json', 'w', encoding='utf-8') as f:
	json.dump(summary_payload, f, ensure_ascii=False, indent=2)

print('\n文件已保存：')
print('- subset_analysis_charts/multi_seed_per_seed_metrics.csv')
print('- subset_analysis_charts/multi_seed_aggregate_metrics.csv')
print('- subset_analysis_charts/mmlu_ZH-CN_subset_summary.json')

KeyboardInterrupt: 

In [33]:
# 基于多随机种子结果，生成可视化图表到 subset_analysis_charts/
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 可选使用 seaborn，美化图表；若未安装则自动降级为纯 matplotlib
try:
	import seaborn as sns
	HAS_SNS = True
except Exception:
	HAS_SNS = False

# 字体配置（Windows优先雅黑）
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei', 'SimHei', 'Arial Unicode MS', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False

in_per_seed = 'subset_analysis_charts/multi_seed_per_seed_metrics.csv'
in_agg = 'subset_analysis_charts/multi_seed_aggregate_metrics.csv'
in_summary = 'subset_analysis_charts/mmlu_ZH-CN_subset_summary.json'
out_dir = 'subset_analysis_charts'
os.makedirs(out_dir, exist_ok=True)

per_seed_df = pd.read_csv(in_per_seed)
agg_df = pd.read_csv(in_agg)
summary = None
if os.path.exists(in_summary):
	with open(in_summary, 'r', encoding='utf-8') as f:
		summary = json.load(f)

# 统一方法顺序
method_order = ['BUCKET', 'MATRIX', 'IRT']
per_seed_df['方法'] = pd.Categorical(per_seed_df['方法'], categories=method_order, ordered=True)
agg_df['方法'] = pd.Categorical(agg_df['方法'], categories=method_order, ordered=True)

# 1) 各方法“胜出次数/10”条形图
fig, ax = plt.subplots(figsize=(6,4), dpi=150)
plot_df = agg_df.sort_values('方法')
x = np.arange(len(plot_df))
vals = plot_df['胜出次数/10'].values
if HAS_SNS:
	sns.barplot(x='方法', y='胜出次数/10', data=plot_df, ax=ax, palette='Set2')
else:
	ax.bar(x, vals, color=['#66c2a5','#8da0cb','#fc8d62'])
	ax.set_xticks(x)
	ax.set_xticklabels(plot_df['方法'])
ax.set_title('各方法胜出次数（10个random_state）')
ax.set_xlabel('方法')
ax.set_ylabel('胜出次数/10')
for i, v in enumerate(vals):
	ax.text(i, v + 0.1, str(int(v)), ha='center', va='bottom', fontsize=9)
plt.tight_layout()
fig.savefig(os.path.join(out_dir, 'wins_per_method.png'))
plt.close(fig)

# 2) 平均排名（含标准差误差棒）
rank_stats = per_seed_df.groupby('方法')['排名'].agg(['mean','std']).reindex(method_order)
fig, ax = plt.subplots(figsize=(6,4), dpi=150)
if HAS_SNS:
	ax.errorbar(rank_stats.index, rank_stats['mean'], yerr=rank_stats['std'], fmt='o-', capsize=4)
else:
	ax.errorbar(np.arange(len(rank_stats)), rank_stats['mean'], yerr=rank_stats['std'], fmt='o-', capsize=4)
	ax.set_xticks(np.arange(len(rank_stats)))
	ax.set_xticklabels(rank_stats.index)
ax.set_title('平均排名（误差棒为标准差）')
ax.set_xlabel('方法')
ax.set_ylabel('排名（越小越好）')
for i, (m, row) in enumerate(rank_stats.iterrows()):
	ax.text(i, row['mean'], f"{row['mean']:.2f}", ha='center', va='bottom', fontsize=9)
plt.tight_layout()
fig.savefig(os.path.join(out_dir, 'avg_rank_with_std.png'))
plt.close(fig)

# 3) 各方法总分分布（箱线图）
fig, ax = plt.subplots(figsize=(6,4), dpi=150)
if HAS_SNS:
	sns.boxplot(x='方法', y='总分', data=per_seed_df, ax=ax, palette='Set3')
	sns.stripplot(x='方法', y='总分', data=per_seed_df, ax=ax, color='k', alpha=0.5, size=4, jitter=True)
else:
	# 手动箱线图
	data_list = [per_seed_df[per_seed_df['方法']==m]['总分'].values for m in method_order]
	ax.boxplot(data_list, labels=method_order, patch_artist=True,
	           boxprops=dict(facecolor='#d9d9d9'))
	ax.scatter(per_seed_df['方法'].cat.codes+1, per_seed_df['总分'], s=10, c='k', alpha=0.6)
ax.set_title('各方法总分分布（10个random_state）')
ax.set_xlabel('方法')
ax.set_ylabel('总分（越高越好）')
plt.tight_layout()
fig.savefig(os.path.join(out_dir, 'total_score_box.png'))
plt.close(fig)

# 4) 代表性：与完整MMLU相关 vs 与完整MMLU均值差 散点图（按方法着色）
fig, ax = plt.subplots(figsize=(6,4), dpi=150)
if HAS_SNS:
	sns.scatterplot(
		data=per_seed_df, x='与Full均值差(均值)', y='与Full相关(均值)',
		hue='方法', hue_order=method_order, style='方法', ax=ax, s=60, palette='Set2'
	)
else:
	colors = {'BUCKET':'#66c2a5', 'MATRIX':'#8da0cb', 'IRT':'#fc8d62'}
	for m in method_order:
		dfm = per_seed_df[per_seed_df['方法']==m]
		ax.scatter(dfm['与Full均值差(均值)'], dfm['与Full相关(均值)'], s=60, c=colors[m], label=m)
ax.set_title('代表性：与完整MMLU相关 vs 均值差')
ax.set_xlabel('与Full均值差（越小越好）')
ax.set_ylabel('与Full相关（越大越好）')
ax.legend(title='方法')
plt.tight_layout()
fig.savefig(os.path.join(out_dir, 'representativeness_scatter.png'))
plt.close(fig)

# 5) 一致性：两子集相关 vs 两子集模型差(均值) 散点图（按方法着色）
fig, ax = plt.subplots(figsize=(6,4), dpi=150)
if HAS_SNS:
	sns.scatterplot(
		data=per_seed_df, x='两子集模型差(均值)', y='两子集相关',
		hue='方法', hue_order=method_order, style='方法', ax=ax, s=60, palette='Set1'
	)
else:
	colors = {'BUCKET':'#e41a1c', 'MATRIX':'#377eb8', 'IRT':'#4daf4a'}
	for m in method_order:
		dfm = per_seed_df[per_seed_df['方法']==m]
		ax.scatter(dfm['两子集模型差(均值)'], dfm['两子集相关'], s=60, c=colors[m], label=m)
ax.set_title('一致性：两子集相关 vs 模型差(均值)')
ax.set_xlabel('两子集模型差(均值)（越小越好）')
ax.set_ylabel('两子集相关（越大越好）')
ax.legend(title='方法')
plt.tight_layout()
fig.savefig(os.path.join(out_dir, 'consistency_scatter.png'))
plt.close(fig)

# 6) 每个random_state的排名热力图（行：seed，列：方法）
rank_pivot = per_seed_df.pivot_table(index='seed', columns='方法', values='排名')
fig, ax = plt.subplots(figsize=(6, max(3, 0.4*len(rank_pivot))), dpi=150)
if HAS_SNS:
	sns.heatmap(rank_pivot.loc[sorted(rank_pivot.index)], annot=True, fmt='.0f',
	            cmap='YlGnBu', cbar=True, ax=ax)
else:
	# 简化版：使用imshow
	im = ax.imshow(rank_pivot.loc[sorted(rank_pivot.index)].values, cmap='YlGnBu', aspect='auto')
	ax.set_xticks(np.arange(len(rank_pivot.columns)))
	ax.set_xticklabels(rank_pivot.columns)
	ax.set_yticks(np.arange(len(rank_pivot.index)))
	ax.set_yticklabels(sorted(rank_pivot.index))
	for i in range(len(rank_pivot.index)):
		for j in range(len(rank_pivot.columns)):
			val = rank_pivot.loc[sorted(rank_pivot.index)[i], rank_pivot.columns[j]]
			ax.text(j, i, f'{int(val)}', ha='center', va='center', color='black', fontsize=8)
	fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
ax.set_title('每个random_state的排名（数值越小越好）')
plt.tight_layout()
fig.savefig(os.path.join(out_dir, 'per_seed_rank_heatmap.png'))
plt.close(fig)

# 7) 汇总柱状图：各方法的关键指标均值（来自聚合表）
# 使用聚合表中的关键列：'总分(均值)'、'与Full相关(均值)'、'与Full均值差(均值)'、'两子集相关(均值)'、'两子集均值差(均值)'、'两子集模型差(均值)'
key_cols = ['总分(均值)', '与Full相关(均值)', '与Full均值差(均值)', '两子集相关(均值)', '两子集均值差(均值)', '两子集模型差(均值)']
agg_plot = agg_df.set_index('方法').loc[method_order, key_cols]

# 分两张图显示：正向指标与反向指标
pos_cols = ['总分(均值)', '与Full相关(均值)', '两子集相关(均值)']
neg_cols = ['与Full均值差(均值)', '两子集均值差(均值)', '两子集模型差(均值)']

fig, ax = plt.subplots(figsize=(7,4), dpi=150)
if HAS_SNS:
	agg_plot[pos_cols].plot(kind='bar', ax=ax, colormap='Set2')
else:
	agg_plot[pos_cols].plot(kind='bar', ax=ax)
ax.set_title('关键正向指标对比（越高越好）')
ax.set_xlabel('方法')
ax.set_ylabel('数值')
plt.tight_layout()
fig.savefig(os.path.join(out_dir, 'aggregate_positive_metrics.png'))
plt.close(fig)

fig, ax = plt.subplots(figsize=(7,4), dpi=150)
if HAS_SNS:
	agg_plot[neg_cols].plot(kind='bar', ax=ax, colormap='Set3')
else:
	agg_plot[neg_cols].plot(kind='bar', ax=ax)
ax.set_title('关键反向指标对比（越低越好）')
ax.set_xlabel('方法')
ax.set_ylabel('数值')
plt.tight_layout()
fig.savefig(os.path.join(out_dir, 'aggregate_negative_metrics.png'))
plt.close(fig)

print('图表已生成到文件夹：subset_analysis_charts/')
print('\n文件列表：')
for f in sorted(os.listdir(out_dir)):
	print('-', f)


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='方法', y='胜出次数/10', data=plot_df, ax=ax, palette='Set2')
  rank_stats = per_seed_df.groupby('方法')['排名'].agg(['mean','std']).reindex(method_order)

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='方法', y='总分', data=per_seed_df, ax=ax, palette='Set3')
  rank_pivot = per_seed_df.pivot_table(index='seed', columns='方法', values='排名')


图表已生成到文件夹：subset_analysis_charts/

文件列表：
- aggregate_negative_metrics.png
- aggregate_positive_metrics.png
- avg_rank_with_std.png
- consistency_scatter.png
- mmlu_ZH-CN_subset_summary.json
- multi_seed_aggregate_metrics.csv
- multi_seed_per_seed_metrics.csv
- per_seed_rank_heatmap.png
- representativeness_scatter.png
- total_score_box.png
- wins_per_method.png
