Fix Brier Score (#1847)

`gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes.
EleutherAI · May 24, 2024 · 7d747ea · 7d747ea
1 parent 5f3a662
commit 7d747ea
Showing 1 changed file with 3 additions and 2 deletions.
diff --git a/lm_eval/api/metrics.py b/lm_eval/api/metrics.py
@@ -119,9 +119,10 @@ def ter(items):
 @register_aggregation("brier_score")
 def brier_score(items):  # This is a passthrough function
     gold, predictions = list(zip(*items))
+    bs, num_class = np.array(predictions).shape
+
     gold = list(gold)
-    gold_one_hot = np.eye(np.max(gold) + 1)[gold]
-    predictions = list(zip(*items))[1]
+    gold_one_hot = np.eye(num_class)[gold]
     return np.mean(np.sum((predictions - gold_one_hot) ** 2, axis=1))