# Finding and using anchor points

In this notebook, we show how to find anchor points based on your training set and how to use them to estimate the performance of new models in the test set.

## Preparing data

Loading packages

In [1]:
import numpy as np
import pickle
import pandas as pd
import json
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances
from irt import *
from utils import *

random_state = 42

The leaderboard dataset we will use is composed by six scenarios (sub-datasets):
1. TruthfulQA
1. GSM8K
1. Winogrande
1. ARC
1. HellaSwag
1. MMLU

MMLU is further divided into sub-scenarios (e.g., abstract algebra, anatomy, etc). Let's check scenarios and sub-scenarios:

In [2]:
scenarios

{'harness_truthfulqa_mc_0': ['harness_truthfulqa_mc_0'],
 'gsm8k': ['harness_gsm8k_5'],
 'winogrande': ['harness_winogrande_5'],
 'arc': ['harness_arc_challenge_25'],
 'hellaswag': ['harness_hellaswag_10'],
 'mmlu': ['harness_hendrycksTest_abstract_algebra_5',
  'harness_hendrycksTest_anatomy_5',
  'harness_hendrycksTest_astronomy_5',
  'harness_hendrycksTest_business_ethics_5',
  'harness_hendrycksTest_clinical_knowledge_5',
  'harness_hendrycksTest_college_biology_5',
  'harness_hendrycksTest_college_chemistry_5',
  'harness_hendrycksTest_college_computer_science_5',
  'harness_hendrycksTest_college_mathematics_5',
  'harness_hendrycksTest_college_medicine_5',
  'harness_hendrycksTest_college_physics_5',
  'harness_hendrycksTest_computer_security_5',
  'harness_hendrycksTest_conceptual_physics_5',
  'harness_hendrycksTest_econometrics_5',
  'harness_hendrycksTest_electrical_engineering_5',
  'harness_hendrycksTest_elementary_mathematics_5',
  'harness_hendrycksTest_formal_logic_5',
 

Loading leaderboard data:

In [3]:
with open('data/lb.pickle', 'rb') as handle:
    data = pickle.load(handle)

In this dataset, we have data from 395 models. Let's see the names of some of them below

In [4]:
len(data['models']),data['models'][:10]

(395,
 ['open-llm-leaderboard/details_zhengr__MixTAO-7Bx2-MoE-DPO',
  'open-llm-leaderboard/details_alignment-handbook__zephyr-7b-sft-full',
  'open-llm-leaderboard/details_rombodawg__Leaderboard-killer-MoE_4x7b',
  'open-llm-leaderboard/details_FelixChao__ExtremeDolphin-MoE',
  'open-llm-leaderboard/details_LoSboccacc__orthogonal-2x7B-base',
  'open-llm-leaderboard/details_moreh__MoMo-70B-lora-1.8.6-DPO',
  'open-llm-leaderboard/details_deepseek-ai__deepseek-moe-16b-base',
  'open-llm-leaderboard/details_Swisslex__Mixtral-Orca-v0.1',
  'open-llm-leaderboard/details_wang7776__Mistral-7B-Instruct-v0.2-sparsity-20',
  'open-llm-leaderboard/details_nfaheem__Marcoroni-7b-DPO-Merge'])

Below, we will process the data so all correctness scores (for all scenarios) are stored in $Y$. The dictionaries `scenarios_position` and `subscenarios_position` give the position of scenarios/subscenarios correctness scores in $Y$.

In [5]:
scenarios_position, subscenarios_position = prepare_data(scenarios, data)
Y = create_responses(scenarios, data)
Y.shape

(395, 28659)

For example, below you can see the scores for MMLU:

In [6]:
Y[:,scenarios_position['mmlu']], Y[:,scenarios_position['mmlu']].shape

(array([[0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        ...,
        [0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        [1., 0., 1., ..., 1., 1., 0.]]),
 (395, 14042))

For scenarios that have multiple subscenarios, it is usually the case that we want to give equal importance to individual subscenarios when computing the aggregated performance in that scenario. This is equivalent to using a weighted average when computing the aggregated performance. We will create `balance_weights`, a vector of weights to help us compute those weighted averages. These weights will be different than one only for MMLU, which is the only scenario with multiple subscenarios.

In [None]:
# 28659 包含包括mmlu所有场景的子集，和其他的数据集的subscenario
balance_weights = np.ones(Y.shape[1])
# N为MMLU的问题总数
N = len(scenarios_position['mmlu'])
# n_sub为mmlu科目数量
n_sub = len(scenarios['mmlu'])
# sub为科目
for sub in scenarios['mmlu']:
    # n_i为对应subject题目数量
    n_i = len(subscenarios_position['mmlu'][sub])
    # idx = subscenario_position['mmlu'][sub],是mmlu子集在所有subscenrio中的位置
    # n_sub * n_i 为对应subject的问题数量，乘上mmlu科目数量（57）。balance_weights中有大于1有小于1的，大于1说明科目数量小，给予更高的weight
    balance_weights[subscenarios_position['mmlu'][sub]] = N/(n_sub*n_i)  

We can see below that first averaging within subscenarios and then computing a simple average is equivalent to using a weighted average from the beginning:

In [None]:
# accs1 先计算每个模型同一个科目的总准确率，再把每个模型的所有科目的准确率汇总等权，计算每个模型的准确率，形成(395, )的准确率
accs1 = np.mean([Y[:,subscenarios_position['mmlu'][sub]].mean(axis=1) for sub in scenarios['mmlu']], axis=0)
# balance_weights*Y，每行大模型的0/1正确率都会乘上对应的weight。scenarios_position['mmlu']取14042个问题对应的index。mean后为每个模型的总准确率，形状为（395,）
accs2 = (balance_weights*Y)[:,scenarios_position['mmlu']].mean(axis=1)
# 两者结果一致
np.abs(accs1 - accs2).mean()

2.322333605307685e-14

## Getting and using anchor points

Let's split the data in train and test (recent models are placed in the test set):

In [None]:
#Y第一个维度为模型
Y_test = Y[:100]
Y_train = Y[100:]

In [None]:
# 准确率最高的模型的准确率
(balance_weights*Y_train)[:,scenarios_position['mmlu']].mean(axis=1).max()

0.7865657803785115

The variable `number_item` gives the number of anchor points we want to find in each scenario:

In [11]:
number_item = 100

The variable `clustering` specified how the clusting is run. If `clustering="correct."`, then correctness is used. On the other hand, if `clustering="irt"`, then the IRT embeddings for examples are used.

Computing anchor points and their weights for each scenario:

In [None]:
anchor_points = {}
anchor_weights = {}

for scenario in scenarios.keys():
    # X 若为correct cluster，则每行为mmlu题目，每列为模型回答的0/1的答案，列数为模型数
    # X 若为irt cluster，则每行为mmlu题目，每列为题目区分度A和难度B，列数为2
    if clustering=='correct.':
        X = Y_train[:,scenarios_position[scenario]].T
    elif clustering=='irt':
        #这行代码会从预先训练好的 IRT 模型中加载每个题目的两个核心参数：
        # `A` (Discrimination/区分度): 表示一个题目在区分高能力和低能力模型上的效果有多好。A 值越高，区分度越好。
        # `B` (Difficulty/难度): 表示一个题目的难度值。B 值越高，题目越难。
        A, B, _ = load_irt_parameters('data/irt_model/')
        # 首先，它将所有题目的 A 和 B 参数合并成一个矩阵。转置（.T）后，这个矩阵的每一行代表一道题目，而两列分别是这道题的区分度（A）和难度（B）。
        X = np.vstack((A.squeeze(), B.squeeze().reshape((1,-1)))).T
        # 然后，它从这个总矩阵中筛选出当前 scenario (例如 'mmlu') 所对应的那些题目。
        X = X[scenarios_position[scenario]]
    else:
        raise NotImplementedError 
        
    #Normalizing balance_weights, so their sum is one within each scenario
    norm_balance_weights = balance_weights[scenarios_position[scenario]]
    norm_balance_weights /= norm_balance_weights.sum()

    # Fitting the KMeans model
    # * kmeans.labels_:
    #   * 内容: 一个一维数组，长度与 X 的行数（即题目数量）相同。数组中的第 i 个值，就是第 i 道题目被分配到的簇的编号（从 0
    #     到 99）。
    #   * 作用: 这是最直接的聚类结果，告诉我们每道题属于哪个簇。

    #* kmeans.cluster_centers_:
    #   * 内容: 一个形状为 (100, 特征数量) 的二维数组。每一行代表一个簇的中心点（质心）在特征空间中的坐标。
    #   * 作用: 代表了 100 个簇的“平均”特征。后续代码会用它来寻找离每个中心点最近的真实题目，作为“锚点”。

    #* kmeans.inertia_:
    #   * 内容: 一个浮点数，表示所有样本点到其所属簇中心的距离平方和。
    #   * 作用: 它是衡量聚类效果的一个指标，值越小通常表示聚类效果越好（簇内更紧密）。
    kmeans = KMeans(n_clusters=number_item, n_init="auto", random_state=random_state)
    kmeans.fit(X, sample_weight=norm_balance_weights)

    # Calculating anchor points
    # 对于 KMeans 算法找到的 100 个簇，分别计算出离每个簇的中心点最近的那道真实题目，并将这 100道真实题目的索引作为该场景下的“锚点”保存下来。
    # 这些“锚点”就是对整个题库进行浓缩和降维后得到的、最具代表性的题目样本。
    anchor_points[scenario] = pairwise_distances(kmeans.cluster_centers_, X, metric='euclidean').argmin(axis=1)

    # Calculating anchor weights
    # 对number_item个簇，计算每个簇的权重，权重为该簇内所有题目权重的和
    # kmeans.labels_ == c:
    #   * 这是一个布尔判断。对于当前的簇编号 c，它会生成一个布尔类型的“掩码” (mask) 数组。
    #   * 例如，当 c=5 时，这个掩码数组中，所有属于第 5 簇的题目的位置为 True，其他都为 False。
    # norm_balance_weights[...]:
    #   * norm_balance_weights 是一个数组，包含了每一道题目的归一化权重。
    #   * norm_balance_weights[kmeans.labels_==c] 这个操作利用上面的布尔掩码，从 norm_balance_weights 中只挑选出那些属于簇`c` 的题目的权重。
    anchor_weights[scenario] = np.array([np.sum(norm_balance_weights[kmeans.labels_==c]) for c in range(number_item)])

Saving

Checking results

In [14]:
anchor = {'anchor_points':anchor_points,
          'anchor_weights':anchor_weights}

with open('data/anchor.pickle', 'wb') as handle:
    pickle.dump(anchor, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [15]:
anchor_points['mmlu']

array([ 3339,   216,   186,  4895,  4826,  8082,  9644, 13748, 11767,
        9610,  1215,  2854,   627,  5853, 11538,  5943,  4801, 11465,
        1880, 10257,  2354,  4882, 10627,  6597, 12963, 11964, 11970,
        7769,  2698,  5738, 13289,  9662, 12060,  8037,  3175,  2051,
        8692, 11950,    64,  2662,  9632,  5856,  4725,  5934,   614,
        2282,  2762,  4608,  9444,  7792,  9833,  7577,  8167,  8102,
        5561,  3233, 11120,  4194,  6624,   739,  5516,  1631, 10081,
        3889,   964,  5119,  2028,  8744,  4244,  4022,  9671, 12699,
       11920,  3439,  5151,  3237,  9447, 12142,  2268,   142,  1188,
        8422, 12505,  6321,  1557,  6551,  6380,  7861, 10460,  7677,
        6877,  1384,  5460,   571,  3356,  8524, 11024,  4790,  1059,
       11589], dtype=int64)

In [16]:
anchor_weights['mmlu']

array([0.00719509, 0.02925134, 0.00688233, 0.01077535, 0.01135904,
       0.00979226, 0.00669851, 0.01577534, 0.00868074, 0.01017981,
       0.00669675, 0.00840938, 0.00837327, 0.01174739, 0.01084033,
       0.01212135, 0.008107  , 0.00751334, 0.01068052, 0.01132448,
       0.00710094, 0.00855757, 0.00723255, 0.00838955, 0.00879337,
       0.00646765, 0.00925892, 0.01495362, 0.01062908, 0.01159708,
       0.00860773, 0.01295747, 0.00854535, 0.02359755, 0.00576909,
       0.00759589, 0.00455757, 0.01114833, 0.00522446, 0.01094169,
       0.01302504, 0.01764216, 0.00971956, 0.00580515, 0.00925852,
       0.00768513, 0.00876491, 0.01582696, 0.0079103 , 0.01071352,
       0.01975156, 0.00609008, 0.00780274, 0.01213502, 0.00831919,
       0.00980854, 0.00854498, 0.00879001, 0.02524748, 0.00838861,
       0.01593213, 0.01020836, 0.0090561 , 0.00836418, 0.00673287,
       0.00647885, 0.00727505, 0.00704453, 0.00873509, 0.0079446 ,
       0.00953117, 0.00639431, 0.00897941, 0.00801886, 0.01038

Using anchor points to estimate performance in the test set and reporting the average prediction error

In [None]:
# Y：（模型数，题目数），每个值为0/1
for scenario in scenarios.keys():
    # Y_test 形状 （100，题目数量），0/1值
    # [:,scenarios_position[scenario]] 第一次筛选，100个模型，和所有对应scenario的题目，使变量变为(100,14042)
    # [:,anchor_points[scenario]] 第二次筛选，100个模型，和所有对应scenario的题目中的锚点题目，使变量变为（100，100）
    Y_anchor = Y_test[:,scenarios_position[scenario]][:,anchor_points[scenario]]
    # `Y_hat`: 通过模型在100个代表性题目上的表现，乘以这些题目的重要性权重，最终估算出每个模型的总成绩。
    Y_hat = (Y_anchor*anchor_weights[scenario]).sum(axis=1)
    Y_true = (balance_weights*Y_test)[:,scenarios_position[scenario]].mean(axis=1)

    print(f"scenario: {scenario}, avg. error: {np.abs(Y_hat-Y_true).mean():.3f}")

scenario: harness_truthfulqa_mc_0, avg. error: 0.019
scenario: gsm8k, avg. error: 0.025
scenario: winogrande, avg. error: 0.026
scenario: arc, avg. error: 0.018
scenario: hellaswag, avg. error: 0.012
scenario: mmlu, avg. error: 0.022


## 生成中文 MMLU 的两个相似子集（各 300 题）

依据 `mmlu_CN_Prompt.mdc` 的要求：
- 只使用英文侧的题目表现信号（此处采用英文侧模型在题目上的平均正确率作为难度 proxy），不使用中文答案信息。
- 排除 subjects：`high_school_us_history`, `security_studies`, `high_school_government_and_politics`, `jurisprudence`, `business_ethics`, `us_foreign_policy`, `global_facts`。
- 分层标准：按 `Subject` 与难度分位桶（5 桶）联合分层；在每个分层中按整体占比进行抽样，生成两个尽量分布一致且不重叠的 300 题子集。
- 输出：
  - `../tutorials/mmlu_ZH-CN_subset_1.csv`
  - `../tutorials/mmlu_ZH-CN_subset_2.csv`
  - 摘要：`../tutorials/mmlu_ZH-CN_subset_summary.json`

使用方法：依次从头运行本 Notebook 全部单元，最后执行下方代码单元以生成结果文件。


In [18]:
# 基于英文侧题目表现信号，生成两个中文子集各300题，并输出摘要
np.random.seed(random_state)

# 1) 计算英文侧 MMLU 每题平均正确率（难度 proxy；越低越难）
Y_mmlu = Y[:, scenarios_position['mmlu']]
item_acc = Y_mmlu.mean(axis=0)

# 2) 读取中文题库，按 ID 对齐英文指标
cn_path = 'mmlu_ZH-CN.csv'
df_cn = pd.read_csv(cn_path)
assert df_cn.shape[0] == item_acc.shape[0], '中文题库行数与英文 MMLU 题数不一致'
df_cn['acc'] = item_acc[df_cn['ID'].values]

# 3) 排除指定 subjects
excluded_subjects = set([
    'high_school_us_history',
    'security_studies',
    'high_school_government_and_politics',
    'jurisprudence',
    'business_ethics',
    'us_foreign_policy',
    'global_facts',
])
mask_keep = ~df_cn['Subject'].isin(excluded_subjects)
df_pool = df_cn[mask_keep].copy().reset_index(drop=True)

# 4) 构造难度分位桶（5 桶）。高 acc 更容易；仅用于分布对齐
num_buckets = 5
# 若acc取值重复较多，qcut可能会掉桶，允许自动降重
df_pool['acc_bucket'] = pd.qcut(df_pool['acc'], q=num_buckets, labels=False, duplicates='drop')

# 若桶数不足，降级到按 rank 切分
if df_pool['acc_bucket'].isna().any():
    # 使用秩近似分桶
    ranks = df_pool['acc'].rank(method='average') / (len(df_pool))
    df_pool['acc_bucket'] = np.minimum((ranks * num_buckets).astype(int), num_buckets - 1)

# 5) 按 Subject x acc_bucket 联合分层抽样，生成两个不重叠子集
TOTAL = 300

def sample_proportional_stratified(df_source, total, seed):
    rng = np.random.RandomState(seed)
    group_cols = ['Subject', 'acc_bucket']
    counts = df_source.groupby(group_cols).size()
    props = counts / counts.sum()
    raw_targets = props * total
    base = np.floor(raw_targets).astype(int)
    remainder = (raw_targets - base).sort_values(ascending=False)
    need = total - base.sum()
    # 分配剩余名额
    add = pd.Series(0, index=base.index)
    if need > 0:
        add.loc[remainder.index[:need]] = 1
    targets = base + add
    # 按每个分层抽样
    parts = []
    for grp, n in targets.items():
        if n <= 0:
            continue
        sub = df_source[(df_source['Subject'] == grp[0]) & (df_source['acc_bucket'] == grp[1])]
        k = min(n, len(sub))
        if k > 0:
            parts.append(sub.sample(n=k, random_state=rng.randint(0, 1_000_000)))
    out = pd.concat(parts, axis=0) if len(parts) else df_source.iloc[0:0]
    # 若不足 total，随机从剩余中补齐
    short = total - len(out)
    if short > 0:
        remain = df_source.drop(out.index)
        if len(remain) >= short:
            out = pd.concat([out, remain.sample(n=short, random_state=rng.randint(0, 1_000_000))], axis=0)
        else:
            out = pd.concat([out, remain], axis=0)
    return out.iloc[:total].sample(frac=1.0, random_state=seed).reset_index(drop=True)

subset1 = sample_proportional_stratified(df_pool, TOTAL, random_state)
remain_pool = df_pool[~df_pool['ID'].isin(subset1['ID'])].copy()
subset2 = sample_proportional_stratified(remain_pool, TOTAL, random_state + 1)

# 6) 计算摘要统计

def dist_counts_props(series):
    counts = series.value_counts().sort_index()
    props = (counts / counts.sum()).round(4)
    return counts.to_dict(), props.to_dict()

sub1_subject_counts, sub1_subject_props = dist_counts_props(subset1['Subject'])
sub2_subject_counts, sub2_subject_props = dist_counts_props(subset2['Subject'])
sub1_bucket_counts, sub1_bucket_props = dist_counts_props(subset1['acc_bucket'])
sub2_bucket_counts, sub2_bucket_props = dist_counts_props(subset2['acc_bucket'])

overlap_ids = sorted(list(set(subset1['ID'].tolist()) & set(subset2['ID'].tolist())))

# 简单的分布对齐度量（L1差）：
def l1_diff(props1, props2):
    keys = set(props1.keys()) | set(props2.keys())
    return float(sum(abs(props1.get(k, 0.0) - props2.get(k, 0.0)) for k in keys))

alignment = {
    'subject_props_L1': round(l1_diff(sub1_subject_props, sub2_subject_props), 4),
    'difficulty_bucket_props_L1': round(l1_diff(sub1_bucket_props, sub2_bucket_props), 4),
}

summary = {
    'subset_1': {
        'size': int(len(subset1)),
        'subject_counts': sub1_subject_counts,
        'subject_props': sub1_subject_props,
        'difficulty_bucket_props': sub1_bucket_props,
    },
    'subset_2': {
        'size': int(len(subset2)),
        'subject_counts': sub2_subject_counts,
        'subject_props': sub2_subject_props,
        'difficulty_bucket_props': sub2_bucket_props,
    },
    'overlap': {
        'count': int(len(overlap_ids)),
        'ids': overlap_ids,
    },
    'alignment': alignment,
    'excluded_subjects': sorted(list(excluded_subjects)),
}

# 7) 输出 CSV 与摘要 JSON
out_dir = '../tutorials'
subset1[['ID', 'Question', 'A', 'B', 'C', 'D', 'Answer', 'Subject']].to_csv(f'{out_dir}/mmlu_ZH-CN_subset_1.csv', index=False)
subset2[['ID', 'Question', 'A', 'B', 'C', 'D', 'Answer', 'Subject']].to_csv(f'{out_dir}/mmlu_ZH-CN_subset_2.csv', index=False)
with open(f'{out_dir}/mmlu_ZH-CN_subset_summary.json', 'w', encoding='utf-8') as f:
    json.dump(summary, f, ensure_ascii=False, indent=2)

print('生成完成:')
print('- tutorials/mmlu_ZH-CN_subset_1.csv')
print('- tutorials/mmlu_ZH-CN_subset_2.csv')
print('- tutorials/mmlu_ZH-CN_subset_summary.json')
print('子集大小: ', len(subset1), len(subset2))
print('重叠题目数: ', len(overlap_ids))
print('对齐度(L1):', alignment)


生成完成:
- tutorials/mmlu_ZH-CN_subset_1.csv
- tutorials/mmlu_ZH-CN_subset_2.csv
- tutorials/mmlu_ZH-CN_subset_summary.json
子集大小:  300 300
重叠题目数:  0
对齐度(L1): {'subject_props_L1': 0.3135, 'difficulty_bucket_props_L1': 0.1933}
