<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/Chatbot_Arena_MLE_Elo_Rating_(Bradley_Terry_model)_Calculation_(March_29%2C_2024).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

**目的**： 主要关注 学习Elo的评估计算方法，进行数据分析
- compute_online_elo 在线线性更新算法计算Elo评分;选择了一个小的k因子4，以使Elo评级更稳定，更少地偏向于最近的游戏。（动态更新方式）
- compute_mle_elo 最大似然估计的Elo评估计算方法，模型可以被假定为静态的。在这种情况下，可以通过最大似然估计方法(又称Bradley-Terry模型)直接拟合评分，从而得到非常稳定的评分。这里提供一个逻辑回归的实现。（静态方式）

在本笔记中，我们提出了对聊天机器人Arena在2023年4月24日至2024年3月13日期间从 https://arena.lmsys.org 收集的数据进行的数据分析。

介绍了用于模型排序的不同Elo计算方法(在线Elo和MLE Elo，也称为布拉德利-特里模型)。

要查看最新排行榜，请参阅 https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard。


In [1]:
!pip install --upgrade --no-cache-dir gdown

Collecting gdown
  Downloading gdown-5.1.0-py3-none-any.whl (17 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.7.3
    Uninstalling gdown-4.7.3:
      Successfully uninstalled gdown-4.7.3
Successfully installed gdown-5.1.0


In [2]:
from collections import defaultdict
import json, math, gdown
import numpy as np
import pandas as pd
import plotly.express as px
from tqdm import tqdm
pd.options.display.float_format = '{:.2f}'.format

# 获取和清理比赛数据
我们在谷歌driver上以JSON文件的形式托管最初的比赛结果。我们使用`gdown`函数来下载数据。数据包含为排名模型收集的所有battels和投票结果。

In [3]:
url = "https://drive.google.com/file/d/1_72443egRzwRTmJfIyOQcf1ug7sKbqbX/view?usp=sharing"
filename = gdown.download(url, fuzzy=True)

Downloading...
From (original): https://drive.google.com/uc?id=1_72443egRzwRTmJfIyOQcf1ug7sKbqbX
From (redirected): https://drive.google.com/uc?id=1_72443egRzwRTmJfIyOQcf1ug7sKbqbX&confirm=t&uuid=f503f676-d6c9-4fc8-a017-88a3bd96f177
To: /content/clean_battle_20240329.json
100%|██████████| 178M/178M [00:03<00:00, 44.7MB/s]


In [4]:
battles = pd.read_json(filename).sort_values(ascending=True, by=["tstamp"])
battles

Unnamed: 0,model_a,model_b,winner,judge,turn,anony,language,tstamp
0,vicuna-13b,koala-13b,model_a,arena_user_jzVGPu2kZS2hwmwoKAaGnD,1,False,English,1681813950.52
1,vicuna-13b,dolly-v2-12b,model_a,arena_user_jzVGPu2kZS2hwmwoKAaGnD,1,False,English,1681814024.73
2,vicuna-13b,chatglm-6b,model_a,arena_user_SR7AQVHLwDPRkEu97fnSgd,3,False,English,1681814074.80
3,vicuna-13b,koala-13b,model_a,arena_user_jzVGPu2kZS2hwmwoKAaGnD,1,False,English,1681814233.55
4,vicuna-13b,chatglm-6b,tie,arena_user_jzVGPu2kZS2hwmwoKAaGnD,1,False,Chinese,1681814462.85
...,...,...,...,...,...,...,...,...
718081,claude-3-haiku-20240307,gpt-4-0613,tie,arena_user_NJSCUS6KjVXtRUATM4pDsT,1,True,English,1711738468.85
718082,starling-lm-7b-beta,claude-3-sonnet-20240229,tie,arena_user_8VJnaWyLBUszsu4tfuZ7bk,1,True,French,1711738476.00
718083,starling-lm-7b-beta,mixtral-8x7b-instruct-v0.1,model_b,arena_user_ckR9QhB8PBRaYGjkJ8E4oJ,1,True,English,1711738487.16
718084,starling-lm-7b-alpha,claude-3-sonnet-20240229,model_b,arena_user_iohj752KynfCzUSWWy6zxT,1,True,Russian,1711738500.81


In [5]:
battles = battles[battles["anony"] == True]
print(len(battles))

511252


# 探索性分析

在计算Elo评级之前，我们首先进行一些基本的探索性分析，以突出这些数据的几个关键属性和注意事项。

## 统计

我们允许用户在模型对之间声明一个tie。为了收集额外的数据，在比赛的后期，我们还允许用户声明两个模型都不好的平局。有很大一部分结果是相同的。

In [6]:
fig = px.bar(battles["winner"].value_counts(),
             title="Counts of Battle Outcomes", text_auto=True, height=400)
fig.update_layout(xaxis_title="Battle Outcome", yaxis_title="Count",
                  showlegend=False)
fig

In [7]:
battles_no_ties = battles[~battles["winner"].str.contains("tie")]

## 非均匀模型频率

模型频率不均匀的原因如下:
- 使用了几种不同的匹配和采样算法。我们采用均匀采样和加权采样方法，为更好的模型分配更大的权重。
- 之后添加了一些新模型。

In [11]:
fig = px.bar(pd.concat([battles["model_a"], battles["model_b"]]).value_counts(),
             title="Battle Count for Each Model", text_auto=True)
fig.update_layout(xaxis_title="model", yaxis_title="Battle Count", height=700,
                  showlegend=False)
fig

我们检查了每种模型组合的成对数。

In [14]:
def visualize_battle_count(battles, title, show_num_models=30):
    ptbl = pd.pivot_table(battles, index="model_a", columns="model_b", aggfunc="size",
                          fill_value=0)
    battle_counts = ptbl + ptbl.T
    ordering = battle_counts.sum().sort_values(ascending=False).index
    ordering = ordering[:show_num_models]
    fig = px.imshow(battle_counts.loc[ordering, ordering],
                    title=title, text_auto=True)
    fig.update_layout(xaxis_title="Model B",
                      yaxis_title="Model A",
                      xaxis_side="top", height=1200, width=1200,
                      title_y=0.07, title_x=0.5,
                      font=dict(size=10))
    fig.update_traces(hovertemplate=
                      "Model A: %{y}<br>Model B: %{x}<br>Count: %{z}<extra></extra>")
    return fig

fig = visualize_battle_count(battles, title="Battle Count of Each Combination of Models", show_num_models=30)
fig

### Battles Excluding Ties (排除平局的pk)

In [15]:
visualize_battle_count(battles_no_ties, "Battle Count for Each Combination of Models (without Ties)")

### Counting Ties (平局数目)

In [16]:
visualize_battle_count(battles[battles['winner'].str.contains("tie")], "Tie Count for Each Combination of Models")

## 推断语言

我们还使用`polyglot`包推断出每个对话使用的语言。这只是一个估计，但将有助于指导未来的分析。绝大多数的对话都是用英语进行的。

In [17]:
topk = 20
fig = px.bar(battles["language"].value_counts().head(topk),
             title=f"Battle Counts for the Top {topk} Languages",
             text_auto=True, height=500)
fig.update_layout(xaxis_title="Language", yaxis_title="Count", showlegend=False)
fig

## 对话回合次数

我们还注意到，大多数对话只有一个回合。

In [19]:
fig = px.histogram(battles["turn"],
             title=f"Number of Conversation Turns",
             text_auto=True, height=500)
fig.update_layout(xaxis_title="Turns", yaxis_title="Count", showlegend=False)
fig

## 成对获胜分数

最后，我们还可以计算成对获胜分数。然而，由于每个模型都可以作为模型A和模型B进行比赛，并且在两种情况下都获胜，我们需要计算两种配置下的胜利除以每个模型的对数。

In [21]:
def compute_pairwise_win_fraction(battles, max_num_models=30):
    # Times each model wins as Model A
    a_win_ptbl = pd.pivot_table(
        battles[battles['winner'] == "model_a"],
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Table counting times each model wins as Model B
    b_win_ptbl = pd.pivot_table(
        battles[battles['winner'] == "model_b"],
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Table counting number of A-B pairs
    num_battles_ptbl = pd.pivot_table(battles,
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Computing the proportion of wins for each model as A and as B
    # against all other models
    row_beats_col_freq = (
        (a_win_ptbl + b_win_ptbl.T) /
        (num_battles_ptbl + num_battles_ptbl.T)
    )

    # Arrange ordering according to proprition of wins
    prop_wins = row_beats_col_freq.mean(axis=1).sort_values(ascending=False)
    prop_wins = prop_wins[:max_num_models]
    model_names = list(prop_wins.keys())
    row_beats_col = row_beats_col_freq.loc[model_names, model_names]
    return row_beats_col

def visualize_pairwise_win_fraction(battles, title, max_num_models=30):
    row_beats_col = compute_pairwise_win_fraction(battles, max_num_models)
    fig = px.imshow(row_beats_col, color_continuous_scale='RdBu',
                    text_auto=".2f", title=title)
    fig.update_layout(xaxis_title=" Model B: Loser",
                  yaxis_title="Model A: Winner",
                  xaxis_side="top", height=1200, width=1200,
                  title_y=0.07, title_x=0.5)
    fig.update_traces(hovertemplate=
                  "Model A: %{y}<br>Model B: %{x}<br>Fraction of A Wins: %{z}<extra></extra>")

    return fig

In [23]:
fig = visualize_pairwise_win_fraction(battles_no_ties,
      title = "Fraction of Model A Wins for All Non-tied A vs. B Battles")
fig

## 初步排名

仅使用与所有其他模型的平均胜率，我们就可以计算出一个估计的排行榜。
然而，这种方法的可扩展性可能不如我们稍后使用的Elo评级系统，因为这种方法需要来自所有模型组合的数据。

In [24]:
row_beats_col_freq = compute_pairwise_win_fraction(battles_no_ties)
fig = px.bar(row_beats_col_freq.mean(axis=1).sort_values(ascending=False),
             title="Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)",
             text_auto=".2f")
fig.update_layout(yaxis_title="Average Win Rate", xaxis_title="Model",
                  showlegend=False)
fig

# 值得信赖的评级

[**Elo评分系统**](https://en.wikipedia.org/wiki/Elo_rating_system)(直接看下公式理解下) 是一种计算玩家相对技能水平的方法，在国际象棋和其他竞技游戏中被广泛采用。两名球员评分的差异可以用来预测比赛的结果。Elo评级系统很适合我们的案例，因为我们有多个模型，我们在它们之间运行成对的战斗。
在本节中，我们将介绍计算Elo评级的不同方法。

### 计算评分
首先使用在线线性更新算法计算Elo评分;
我们选择了一个小的k因子4，以使Elo评级更稳定，更少地偏向于最近的游戏。

In [25]:
def compute_online_elo(battles, K=4, SCALE=400, BASE=10, INIT_RATING=1000):
    rating = defaultdict(lambda: INIT_RATING)

    for rd, model_a, model_b, winner in battles[['model_a', 'model_b', 'winner']].itertuples():
        ra = rating[model_a]
        rb = rating[model_b]
        ea = 1 / (1 + BASE ** ((rb - ra) / SCALE))
        eb = 1 / (1 + BASE ** ((ra - rb) / SCALE))
        if winner == "model_a":
            sa = 1
        elif winner == "model_b":
            sa = 0
        elif winner == "tie" or winner == "tie (bothbad)":
            sa = 0.5
        else:
            raise Exception(f"unexpected vote {winner}")
        rating[model_a] += K * (sa - ea)
        rating[model_b] += K * (1 - sa - eb)

    # 校准 llama-13b to 800
    delta = (800-rating["llama-13b"])
    for model in battles["model_a"].unique():
        rating[model] += delta

    return rating

In [27]:
def preety_print_model_ratings(ratings):
    df = pd.DataFrame([
        [n, ratings[n]] for n in ratings.keys()
    ], columns=["Model", "Elo rating"]).sort_values("Elo rating", ascending=False).reset_index(drop=True)
    df["Elo rating"] = (df["Elo rating"] + 0.5).astype(int)
    df.index = df.index + 1
    return df

online_elo_ratings = compute_online_elo(battles)
preety_print_model_ratings(online_elo_ratings)

Unnamed: 0,Model,Elo rating
1,gpt-4-0125-preview,1123
2,claude-3-opus-20240229,1123
3,gpt-4-1106-preview,1115
4,claude-3-sonnet-20240229,1086
5,claude-3-haiku-20240307,1053
...,...,...
72,chatglm3-6b,821
73,oasst-pythia-12b,818
74,fastchat-t5-3b,802
75,llama-13b,800


然而，即使k因子很小，我们仍然发现这种在线更新算法是不稳定的。

为了证明这一点，我们使用颠倒的游戏顺序重新计算Elo评级，并观察到由于在线更新Elo导致最近的游戏产生偏差而产生的显著差异。

In [28]:
def preety_print_two_ratings(ratings_1, ratings_2, column_names):
    df = pd.DataFrame([
        [n, ratings_1[n], ratings_2[n]] for n in ratings_1.keys()
    ], columns=["Model", column_names[0], column_names[1]]).sort_values(column_names[0], ascending=False).reset_index(drop=True)
    df[column_names[0]] = (df[column_names[0]] + 0.5).astype(int)
    df[column_names[1]] = (df[column_names[1]] + 0.5).astype(int)
    df.index = df.index + 1
    return df

elo_mle_ratings_reverse = compute_online_elo(battles.iloc[::-1])
preety_print_two_ratings(online_elo_ratings,
                         elo_mle_ratings_reverse,
                         column_names=["Elo rating", "Elo rating with reverse order"])

Unnamed: 0,Model,Elo rating,Elo rating with reverse order
1,gpt-4-0125-preview,1123,1159
2,claude-3-opus-20240229,1123,1121
3,gpt-4-1106-preview,1115,1173
4,claude-3-sonnet-20240229,1086,1063
5,claude-3-haiku-20240307,1053,1052
...,...,...,...
72,chatglm3-6b,821,888
73,oasst-pythia-12b,818,905
74,fastchat-t5-3b,802,867
75,llama-13b,800,800


### Elo评分的[最大似然估计](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation)(又名[布拉德利-特里模型](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model))

在LLM评估的背景下，模型可以被假定为静态的。在这种情况下，我们可以通过最大似然估计方法(又称Bradley-Terry模型)直接拟合评分，从而得到非常稳定的评分。这里我们提供一个逻辑回归的实现。

In [29]:
def compute_mle_elo(df, SCALE=400, BASE=10, INIT_RATING=1000):
    from sklearn.linear_model import LogisticRegression
    models = pd.concat([df["model_a"], df["model_b"]]).unique()
    models = pd.Series(np.arange(len(models)), index=models)

    # duplicate battles
    df = pd.concat([df, df], ignore_index=True)
    p = len(models.index)
    n = df.shape[0]

    X = np.zeros([n, p])
    X[np.arange(n), models[df["model_a"]]] = +math.log(BASE)
    X[np.arange(n), models[df["model_b"]]] = -math.log(BASE)

    # one A win => two A win
    Y = np.zeros(n)
    Y[df["winner"] == "model_a"] = 1.0

    # one tie => one A win + one B win
    # find tie + tie (both bad) index
    tie_idx = (df["winner"] == "tie") | (df["winner"] == "tie (bothbad)")
    tie_idx[len(tie_idx)//2:] = False
    Y[tie_idx] = 1.0

    lr = LogisticRegression(fit_intercept=False)
    lr.fit(X,Y)

    elo_scores = SCALE * lr.coef_[0] + INIT_RATING

    # set anchor as llama-2-70b-chat = 1082
    if "llama-2-70b-chat" in models.index:
        elo_scores += 1082 - elo_scores[models["llama-2-70b-chat"]]
    return pd.Series(elo_scores, index = models.index).sort_values(ascending=False)

In [30]:
elo_mle_ratings = compute_mle_elo(battles)
preety_print_model_ratings(elo_mle_ratings)

Unnamed: 0,Model,Elo rating
1,claude-3-opus-20240229,1254
2,gpt-4-1106-preview,1251
3,gpt-4-0125-preview,1248
4,bard-jan-24-gemini-pro,1204
5,claude-3-sonnet-20240229,1199
...,...,...
72,chatglm-6b,879
73,fastchat-t5-3b,870
74,stablelm-tuned-alpha-7b,842
75,dolly-v2-12b,819


### 计算MLE Elo分数的Bootstrap方法置信区间

我们还可以进一步使用bootstrap方法来估计置信区间。

In [31]:
def get_bootstrap_result(battles, func_compute_elo, num_round):
    rows = []
    for i in tqdm(range(num_round), desc="bootstrap"):
        rows.append(func_compute_elo(battles.sample(frac=1.0, replace=True)))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]


In [32]:
BOOTSTRAP_ROUNDS = 100

np.random.seed(42)
bootstrap_elo_lu = get_bootstrap_result(battles, compute_mle_elo, BOOTSTRAP_ROUNDS)

bootstrap: 100%|██████████| 100/100 [19:09<00:00, 11.49s/it]


In [34]:
def visualize_bootstrap_scores(df, title):
    bars = pd.DataFrame(dict(
        lower = df.quantile(.025),
        rating = df.quantile(.5),
        upper = df.quantile(.975))).reset_index(names="model").sort_values("rating", ascending=False)
    bars['error_y'] = bars['upper'] - bars["rating"]
    bars['error_y_minus'] = bars['rating'] - bars["lower"]
    bars['rating_rounded'] = np.round(bars['rating'], 2)
    fig = px.scatter(bars, x="model", y="rating", error_y="error_y",
                     error_y_minus="error_y_minus", text="rating_rounded",
                     title=title)
    fig.update_layout(xaxis_title="Model", yaxis_title="Rating",
                      height=1000)
    return fig

fig = visualize_bootstrap_scores(bootstrap_elo_lu, "Bootstrap of MLE Elo Rating Estimates")
fig

我们之前在在线Elo上应用了 bootstrapping ，以获得更稳定的评级。

In [35]:
np.random.seed(42)
bootstrap_online_elo = get_bootstrap_result(battles, compute_online_elo, BOOTSTRAP_ROUNDS)

bootstrap: 100%|██████████| 100/100 [03:42<00:00,  2.23s/it]


我们可以看到两种方法得到的bootstrapping中位数是相似的。

In [37]:
preety_print_two_ratings(bootstrap_elo_lu.quantile(.5),
                         bootstrap_online_elo.quantile(.5),
                         column_names=["Bootstrap Median of MLE Elo", "Bootstrap Median of Online Elo"])

Unnamed: 0,Model,Bootstrap Median of MLE Elo,Bootstrap Median of Online Elo
1,claude-3-opus-20240229,1253,1257
2,gpt-4-1106-preview,1251,1259
3,gpt-4-0125-preview,1248,1252
4,bard-jan-24-gemini-pro,1203,1210
5,claude-3-sonnet-20240229,1199,1204
...,...,...,...
72,chatglm-6b,879,884
73,fastchat-t5-3b,870,878
74,stablelm-tuned-alpha-7b,841,846
75,dolly-v2-12b,818,822


然而，在线Elo的置信区间明显大于MLE Elo。

In [38]:
fig = visualize_bootstrap_scores(bootstrap_online_elo, "Bootstrap of Online Elo Rating Estimates")
fig

### 预测胜率
利用Elo评级允许我们预测获胜概率。通过预测胜率与实际胜率的比较，可以了解Elo评分系统的准确性和质量。

In [39]:
def predict_win_rate(elo_ratings, SCALE=400, BASE=10, INIT_RATING=1000):
    names = sorted(list(elo_ratings.keys()))
    wins = defaultdict(lambda: defaultdict(lambda: 0))
    for a in names:
        for b in names:
            ea = 1 / (1 + BASE ** ((elo_ratings[b] - elo_ratings[a]) / SCALE))
            wins[a][b] = ea
            wins[b][a] = 1 - ea

    data = {
        a: [wins[a][b] if a != b else np.NAN for b in names]
        for a in names
    }

    df = pd.DataFrame(data, index=names)
    df.index.name = "model_a"
    df.columns.name = "model_b"
    return df.T

In [41]:
win_rate = predict_win_rate(dict(bootstrap_elo_lu.quantile(0.5)))
ordered_models = win_rate.mean(axis=1).sort_values(ascending=False).index
ordered_models = ordered_models[:30]
fig = px.imshow(win_rate.loc[ordered_models, ordered_models],
                color_continuous_scale='RdBu', text_auto=".2f",
                title="Predicted Win Rate Using Elo Ratings for Model A in an A vs. B Battle")
fig.update_layout(xaxis_title="Model B",
                  yaxis_title="Model A",
                  xaxis_side="top", height=1200, width=1200,
                  title_y=0.07, title_x=0.5)
fig.update_traces(hovertemplate=
                  "Model A: %{y}<br>Model B: %{x}<br>Win Rate: %{z}<extra></extra>")
fig

### 假设均匀采样，计算Bootstrap置信区间

我们还研究了如果我们只对每个模型对进行相同数量的战斗采样，评级将如何变化。

In [42]:
def sample_battle_even(battles, n_per_battle):
    groups = battles.groupby(["model_a", "model_b"], as_index=False)
    resampled = (groups
                 .apply(lambda grp: grp.sample(n_per_battle, replace=True))
                 .reset_index(drop=True))
    return resampled

In [43]:
num_samples = 50
battles_even = sample_battle_even(battles, num_samples)
pd.pivot_table(battles_even, index="model_a", columns="model_b", aggfunc="size", fill_value=0)

model_b,RWKV-4-Raven-14B,alpaca-13b,bard-jan-24-gemini-pro,chatglm-6b,chatglm2-6b,chatglm3-6b,claude-1,claude-2.0,claude-2.1,claude-3-haiku-20240307,...,stripedhyena-nous-7b,tulu-2-dpo-70b,vicuna-13b,vicuna-33b,vicuna-7b,wizardlm-13b,wizardlm-70b,yi-34b-chat,zephyr-7b-alpha,zephyr-7b-beta
model_a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
RWKV-4-Raven-14B,0,50,0,50,0,0,50,50,0,0,...,0,0,50,50,50,50,0,0,0,0
alpaca-13b,50,0,0,50,0,0,50,50,0,0,...,0,0,50,50,50,50,0,0,0,0
bard-jan-24-gemini-pro,0,0,0,0,0,50,50,50,50,0,...,50,50,50,50,0,0,50,50,0,50
chatglm-6b,50,50,0,0,0,0,50,50,0,0,...,0,0,50,50,50,50,0,0,0,0
chatglm2-6b,0,50,0,0,0,50,50,50,50,0,...,0,50,50,50,50,50,50,50,50,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
wizardlm-13b,50,50,0,50,50,50,50,50,50,0,...,0,50,50,50,50,0,50,50,50,50
wizardlm-70b,0,0,50,0,50,50,50,50,50,0,...,50,50,50,50,50,50,0,50,50,50
yi-34b-chat,0,0,50,0,50,50,50,50,50,0,...,50,50,50,50,50,50,50,0,0,50
zephyr-7b-alpha,0,0,0,0,50,0,50,50,0,0,...,0,0,50,50,50,50,50,0,0,50


In [44]:
# Sampling Battles Evenly
def get_bootstrap_even_sample(battles, n_per_battle, func_compute_elo, num_round=BOOTSTRAP_ROUNDS):
    rows = []
    for n in tqdm(range(num_round), desc="sampling battles evenly"):
        resampled = sample_battle_even(battles, n_per_battle)
        rows.append(func_compute_elo(resampled))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]

In [45]:
print("number of samples per battle pair:", num_samples)
bootstrap_even_lu = get_bootstrap_even_sample(battles, num_samples, compute_mle_elo, num_round=100)

number of samples per battle pair: 50


sampling battles evenly: 100%|██████████| 100/100 [07:28<00:00,  4.49s/it]


In [46]:
fig = visualize_bootstrap_scores(bootstrap_even_lu, f"Bootstrap of MLE Elo Estimates - Even sample")
fig

# 特定语言的排行榜
本文提出两个特定于语言的排行榜，通过将聊天数据根据语言分离为两个子集:(1)纯英语和(2)非英语。

## 纯英文

In [47]:
english_only_battles = battles[battles["language"] == "English"]
elo_ratings = compute_mle_elo(english_only_battles)
pd.DataFrame(elo_ratings)

Unnamed: 0,0
gpt-4-1106-preview,1232.36
gpt-4-0125-preview,1231.02
claude-3-opus-20240229,1219.20
bard-jan-24-gemini-pro,1178.45
claude-3-sonnet-20240229,1172.52
...,...
fastchat-t5-3b,869.54
chatglm-6b,855.11
stablelm-tuned-alpha-7b,829.35
dolly-v2-12b,799.24


## 非英文

In [48]:
non_english_battles = battles[battles["language"] != "English"]
elo_ratings = compute_mle_elo(non_english_battles)
pd.DataFrame(elo_ratings)

Unnamed: 0,0
claude-3-opus-20240229,1317.63
gpt-4-1106-preview,1296.71
gpt-4-0125-preview,1289.39
bard-jan-24-gemini-pro,1276.47
claude-3-sonnet-20240229,1251.20
...,...
oasst-pythia-12b,899.71
dolly-v2-12b,870.12
stablelm-tuned-alpha-7b,865.49
llama-13b,855.12


# Links


了解更多关于Elo评级系统的资源:
- Elo评级系统 https://en.wikipedia.org/wiki/Elo_rating_system

- 布拉德利-特里模型 Bradley-Terry model https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model

- 一个介绍视频 https://www.youtube.com/watch?v=AsYfbmp0To0

- FiveThirtyEight的一篇文章 https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/