# Part III: Test-time Scaling on Qwen Model

## Task 1: Run generation with different temperatures

This notebook implements test-time scaling experiments on three math problem-solving benchmarks:
- Math500 (500 problems, 16 rollouts per problem)
- AMC23 (40 problems, 64 rollouts per problem)
- AIME25 (30 problems, 64 rollouts per problem)

We evaluate two models with three temperature settings (0.6, 1.0, 1.2) while keeping top-p fixed at 0.9.

**Models:**
- Base model: Qwen/Qwen2.5-Math-1.5B
- GRPO-tuned model: Qwen/Qwen2.5-Math-1.5B-Instruct

**Expected outputs:**
- For each configuration, we generate JSONL files with model predictions
- We then evaluate using pass@k metrics and majority vote aggregation

In [1]:
# this notebook is for kaggle
!pip install --no-index --find-links=/kaggle/input/it-is-vllm-0-8-5 -q vllm
!pip install -q pylatexenc math-verify[antlr4_9_3]

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.12.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
dask-cuda 25.2.0 requires numba<0.61.0a0,>=0.59.1, but you have numba 0.61.2 which is incompatible.
cuml-cu12 25.2.1 requires numba<0.61.0a0,>=0.59.1, but you have numba 0.61.2 which is incompatible.
cudf-cu12 25.2.2 requires numba<0.61.0a0,>=0.59.1, but you have numba 0.61.2 which is incompatible.
distributed-ucxx-cu12 0.42.0 requires numba<0.61.0a0,>=0.59.1, but you have numba 0.61.2 which is incompatible.
google-adk 1.18.0 requires opentelemetry-api<=1.37.0,>=1.37.0, but you have opentelemetry-api 1.26.0 which is incompatible.
google-adk 1.18.0 requires opentelemetry-exporter-otlp-proto-http>=1.36.0, but you have opentelemetry-exporter-otlp-proto-http 1.26.0 which is incompatible.
google-adk 1.18.0 requi

In [2]:
!mkdir /kaggle/working/src
!cp -r /kaggle/input/mlproject/* /kaggle/working/src

%cd /kaggle/working/src
!ls

/kaggle/working/src
evaluate.py  inference.py  verifier


## I. Base Model: Qwen/Qwen2.5-Math-1.5B

### Math500 Dataset (rollout-n = 16)

#### Temperature = 0.6 (more deterministic)

In [3]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B" \
  --dataset "math" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 16 \
  --temperature 0.6 \
  --top-p 0.9 \
  --output_file outputs/math500_base_t0.6.jsonl

2025-12-02 14:05:16.329676: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764684316.509205      61 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764684316.567206      61 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 14:05:34 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 14:05:43.682068: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 14:05:43.682125: E external/local_xla/xla/stream_exec

In [4]:
!python evaluate.py \
  --input_file outputs/math500_base_t0.6.jsonl \
  --output_file outputs/math500_base_t0.6_eval.jsonl

Counting lines in outputs/math500_base_t0.6.jsonl...
Scoring generations from outputs/math500_base_t0.6.jsonl...
Processing lines: 100%|████████████████████| 8000/8000 [00:31<00:00, 256.46it/s]
Processing complete. Scored 8000 lines across 500 unique problems.

Pass@k Metrics:
  pass@1   : 35.99%
  pass@2   : 51.28%
  pass@4   : 65.74%
  pass@8   : 77.06%
  pass@16  : 84.60%

Majority Vote Metric:
  maj@1    : 45.60%

Scored results saved to outputs/math500_base_t0.6_eval.jsonl


#### Temperature = 1.0 (original distribution)

In [5]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B" \
  --dataset "math" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 16 \
  --temperature 1.0 \
  --top-p 0.9 \
  --output_file outputs/math500_base_t1.0.jsonl

2025-12-02 15:08:05.122668: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764688085.147077     266 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764688085.155083     266 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 15:08:11 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 15:08:19.248150: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764688099.269465     284 cuda_dnn.cc:8310] Unable t

In [6]:
!python evaluate.py \
  --input_file outputs/math500_base_t1.0.jsonl \
  --output_file outputs/math500_base_t1.0_eval.jsonl

Counting lines in outputs/math500_base_t1.0.jsonl...
Scoring generations from outputs/math500_base_t1.0.jsonl...
Processing lines:  33%|██████▌             | 2619/8000 [00:15<00:28, 190.70it/s]Timeout during comparison
Processing lines:  52%|██████████▍         | 4184/8000 [00:23<00:14, 272.15it/s]Timeout during comparison
Processing lines: 100%|████████████████████| 8000/8000 [00:46<00:00, 170.98it/s]
Processing complete. Scored 8000 lines across 500 unique problems.

Pass@k Metrics:
  pass@1   : 28.85%
  pass@2   : 44.83%
  pass@4   : 61.93%
  pass@8   : 76.61%
  pass@16  : 86.00%

Majority Vote Metric:
  maj@1    : 41.40%

Scored results saved to outputs/math500_base_t1.0_eval.jsonl


#### Temperature = 1.2 (more random)

In [7]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B" \
  --dataset "math" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 16 \
  --temperature 1.2 \
  --top-p 0.9 \
  --output_file outputs/math500_base_t1.2.jsonl

2025-12-02 16:06:44.107478: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764691604.128499     437 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764691604.134815     437 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 16:06:50 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 16:06:58.159315: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 16:06:58.170439: E external/local_xla/xla/stream_exec

In [8]:
!python evaluate.py \
  --input_file outputs/math500_base_t1.2.jsonl \
  --output_file outputs/math500_base_t1.2_eval.jsonl

Counting lines in outputs/math500_base_t1.2.jsonl...
Scoring generations from outputs/math500_base_t1.2.jsonl...
Processing lines:   8%|█▋                   | 629/8000 [00:03<00:31, 233.63it/s]Timeout during comparison
Processing lines:  50%|██████████          | 4029/8000 [00:20<00:31, 124.60it/s]Timeout during comparison
Processing lines:  91%|██████████████████▏ | 7266/8000 [00:36<00:02, 305.53it/s]Timeout during comparison
Processing lines: 100%|████████████████████| 8000/8000 [00:42<00:00, 187.97it/s]
Processing complete. Scored 8000 lines across 500 unique problems.

Pass@k Metrics:
  pass@1   : 18.80%
  pass@2   : 31.79%
  pass@4   : 48.68%
  pass@8   : 66.21%
  pass@16  : 80.00%

Majority Vote Metric:
  maj@1    : 17.00%

Scored results saved to outputs/math500_base_t1.2_eval.jsonl


### AMC23 Dataset (rollout-n = 64)

#### Temperature = 0.6

In [9]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B" \
  --dataset "amc" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 0.6 \
  --top-p 0.9 \
  --output_file outputs/amc23_base_t0.6.jsonl

2025-12-02 17:02:49.303038: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764694969.326800     608 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764694969.334047     608 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 17:02:55 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 17:03:03.087007: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 17:03:03.087479: E external/local_xla/xla/stream_exec

In [10]:
!python evaluate.py \
  --input_file outputs/amc23_base_t0.6.jsonl \
  --output_file outputs/amc23_base_t0.6_eval.jsonl

Counting lines in outputs/amc23_base_t0.6.jsonl...
Scoring generations from outputs/amc23_base_t0.6.jsonl...
Processing lines: 100%|████████████████████| 2560/2560 [00:06<00:00, 413.79it/s]
Processing complete. Scored 2560 lines across 40 unique problems.

Pass@k Metrics:
  pass@1   : 34.45%
  pass@2   : 48.57%
  pass@4   : 61.68%
  pass@8   : 72.23%
  pass@16  : 80.21%
  pass@32  : 86.84%
  pass@64  : 92.50%

Majority Vote Metric:
  maj@1    : 50.00%

Scored results saved to outputs/amc23_base_t0.6_eval.jsonl


#### Temperature = 1.0

In [11]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B" \
  --dataset "amc" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 1.0 \
  --top-p 0.9 \
  --output_file outputs/amc23_base_t1.0.jsonl

2025-12-02 17:25:26.262862: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764696326.282736     794 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764696326.288797     794 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 17:25:32 [__init__.py:239] Automatically detected platform cuda.
DP Rank 1 (Local 1) mapped to GPU(s): 1
DP Rank 0 (Local 0) mapped to GPU(s): 0
2025-12-02 17:25:40.248287: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 17:25:40.254934: E external/local_xla/xla/stream_exec

In [12]:
!python evaluate.py \
  --input_file outputs/amc23_base_t1.0.jsonl \
  --output_file outputs/amc23_base_t1.0_eval.jsonl

Counting lines in outputs/amc23_base_t1.0.jsonl...
Scoring generations from outputs/amc23_base_t1.0.jsonl...
Processing lines: 100%|████████████████████| 2560/2560 [00:08<00:00, 301.24it/s]
Processing complete. Scored 2560 lines across 40 unique problems.

Pass@k Metrics:
  pass@1   : 25.70%
  pass@2   : 40.05%
  pass@4   : 55.36%
  pass@8   : 68.97%
  pass@16  : 79.66%
  pass@32  : 87.80%
  pass@64  : 95.00%

Majority Vote Metric:
  maj@1    : 42.50%

Scored results saved to outputs/amc23_base_t1.0_eval.jsonl


#### Temperature = 1.2

In [13]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B" \
  --dataset "amc" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 1.2 \
  --top-p 0.9 \
  --output_file outputs/amc23_base_t1.2.jsonl

2025-12-02 17:46:32.775057: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764697592.795447     965 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764697592.801643     965 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 17:46:38 [__init__.py:239] Automatically detected platform cuda.
DP Rank 1 (Local 1) mapped to GPU(s): 1
DP Rank 0 (Local 0) mapped to GPU(s): 0
2025-12-02 17:46:46.613480: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764697606.633604     984 cuda_dnn.cc:8310] Unable t

In [14]:
!python evaluate.py \
  --input_file outputs/amc23_base_t1.2.jsonl \
  --output_file outputs/amc23_base_t1.2_eval.jsonl

Counting lines in outputs/amc23_base_t1.2.jsonl...
Scoring generations from outputs/amc23_base_t1.2.jsonl...
Processing lines:   6%|█▏                   | 141/2560 [00:00<00:08, 290.06it/s]Timeout during comparison
Processing lines: 100%|████████████████████| 2560/2560 [00:10<00:00, 252.37it/s]
Processing complete. Scored 2560 lines across 40 unique problems.

Pass@k Metrics:
  pass@1   : 14.30%
  pass@2   : 24.84%
  pass@4   : 39.23%
  pass@8   : 55.01%
  pass@16  : 68.90%
  pass@32  : 79.57%
  pass@64  : 87.50%

Majority Vote Metric:
  maj@1    : 10.00%

Scored results saved to outputs/amc23_base_t1.2_eval.jsonl


### AIME25 Dataset (rollout-n = 64)

#### Temperature = 0.6

In [15]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B" \
  --dataset "aime" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 0.6 \
  --top-p 0.9 \
  --output_file outputs/aime25_base_t0.6.jsonl

2025-12-02 18:07:23.289011: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764698843.309752    1136 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764698843.316043    1136 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 18:07:29 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 18:07:37.213373: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 18:07:37.228718: E external/local_xla/xla/stream_exec

In [16]:
!python evaluate.py \
  --input_file outputs/aime25_base_t0.6.jsonl \
  --output_file outputs/aime25_base_t0.6_eval.jsonl

Counting lines in outputs/aime25_base_t0.6.jsonl...
Scoring generations from outputs/aime25_base_t0.6.jsonl...
Processing lines:  83%|████████████████▌   | 1592/1920 [00:05<00:00, 340.28it/s]Timeout during comparison
Processing lines: 100%|████████████████████| 1920/1920 [00:06<00:00, 277.58it/s]
Processing complete. Scored 1920 lines across 30 unique problems.

Pass@k Metrics:
  pass@1   : 4.22%
  pass@2   : 7.52%
  pass@4   : 12.32%
  pass@8   : 18.13%
  pass@16  : 23.96%
  pass@32  : 29.83%
  pass@64  : 36.67%

Majority Vote Metric:
  maj@1    : 3.33%

Scored results saved to outputs/aime25_base_t0.6_eval.jsonl


#### Temperature = 1.0

In [17]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B" \
  --dataset "aime" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 1.0 \
  --top-p 0.9 \
  --output_file outputs/aime25_base_t1.0.jsonl

2025-12-02 18:25:56.053027: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764699956.075916    1313 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764699956.082738    1313 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 18:26:02 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 18:26:10.071690: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764699970.093063    1332 cuda_dnn.cc:8310] Unable t

In [18]:
!python evaluate.py \
  --input_file outputs/aime25_base_t1.0.jsonl \
  --output_file outputs/aime25_base_t1.0_eval.jsonl

Counting lines in outputs/aime25_base_t1.0.jsonl...
Scoring generations from outputs/aime25_base_t1.0.jsonl...
Processing lines: 100%|████████████████████| 1920/1920 [00:07<00:00, 244.93it/s]
Processing complete. Scored 1920 lines across 30 unique problems.

Pass@k Metrics:
  pass@1   : 2.76%
  pass@2   : 5.09%
  pass@4   : 8.74%
  pass@8   : 13.53%
  pass@16  : 18.72%
  pass@32  : 24.58%
  pass@64  : 33.33%

Majority Vote Metric:
  maj@1    : 3.33%

Scored results saved to outputs/aime25_base_t1.0_eval.jsonl


#### Temperature = 1.2

In [19]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B" \
  --dataset "aime" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 1.2 \
  --top-p 0.9 \
  --output_file outputs/aime25_base_t1.2.jsonl

2025-12-02 18:44:14.259936: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764701054.280094    1484 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764701054.286162    1484 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 18:44:20 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 18:44:28.303420: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 18:44:28.303666: E external/local_xla/xla/stream_exec

In [20]:
!python evaluate.py \
  --input_file outputs/aime25_base_t1.2.jsonl \
  --output_file outputs/aime25_base_t1.2_eval.jsonl

Counting lines in outputs/aime25_base_t1.2.jsonl...
Scoring generations from outputs/aime25_base_t1.2.jsonl...
Processing lines: 100%|████████████████████| 1920/1920 [00:06<00:00, 293.56it/s]
Processing complete. Scored 1920 lines across 30 unique problems.

Pass@k Metrics:
  pass@1   : 1.25%
  pass@2   : 2.43%
  pass@4   : 4.62%
  pass@8   : 8.38%
  pass@16  : 14.10%
  pass@32  : 21.66%
  pass@64  : 30.00%

Majority Vote Metric:
  maj@1    : 0.00%

Scored results saved to outputs/aime25_base_t1.2_eval.jsonl


## II. GRPO-tuned Model: Qwen/Qwen2.5-Math-1.5B-Instruct

### Math500 Dataset (rollout-n = 16)

#### Temperature = 0.6

In [21]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B-Instruct" \
  --dataset "math" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 16 \
  --temperature 0.6 \
  --top-p 0.9 \
  --output_file outputs/math500_instruct_t0.6.jsonl

2025-12-02 19:02:10.865303: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764702130.886301    1655 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764702130.892476    1655 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 19:02:16 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 19:02:24.794666: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 19:02:24.795231: E external/local_xla/xla/stream_exec

In [22]:
!python evaluate.py \
  --input_file outputs/math500_instruct_t0.6.jsonl \
  --output_file outputs/math500_instruct_t0.6_eval.jsonl

Counting lines in outputs/math500_instruct_t0.6.jsonl...
Scoring generations from outputs/math500_instruct_t0.6.jsonl...
Processing lines: 100%|████████████████████| 8000/8000 [00:26<00:00, 302.71it/s]
Processing complete. Scored 8000 lines across 500 unique problems.

Pass@k Metrics:
  pass@1   : 75.00%
  pass@2   : 80.49%
  pass@4   : 84.83%
  pass@8   : 88.19%
  pass@16  : 90.20%

Majority Vote Metric:
  maj@1    : 79.40%

Scored results saved to outputs/math500_instruct_t0.6_eval.jsonl


#### Temperature = 1.0

In [23]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B-Instruct" \
  --dataset "math" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 16 \
  --temperature 1.0 \
  --top-p 0.9 \
  --output_file outputs/math500_instruct_t1.0.jsonl

2025-12-02 19:33:25.149640: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764704005.171275    1859 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764704005.177607    1859 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 19:33:31 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 19:33:39.779487: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 19:33:39.779861: E external/local_xla/xla/stream_exec

In [24]:
!python evaluate.py \
  --input_file outputs/math500_instruct_t1.0.jsonl \
  --output_file outputs/math500_instruct_t1.0_eval.jsonl

Counting lines in outputs/math500_instruct_t1.0.jsonl...
Scoring generations from outputs/math500_instruct_t1.0.jsonl...
Processing lines: 100%|████████████████████| 8000/8000 [00:27<00:00, 295.13it/s]
Processing complete. Scored 8000 lines across 500 unique problems.

Pass@k Metrics:
  pass@1   : 74.42%
  pass@2   : 80.90%
  pass@4   : 85.77%
  pass@8   : 89.32%
  pass@16  : 92.20%

Majority Vote Metric:
  maj@1    : 80.40%

Scored results saved to outputs/math500_instruct_t1.0_eval.jsonl


#### Temperature = 1.2

In [25]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B-Instruct" \
  --dataset "math" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 16 \
  --temperature 1.2 \
  --top-p 0.9 \
  --output_file outputs/math500_instruct_t1.2.jsonl

2025-12-02 20:03:51.976798: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764705831.997006    2030 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764705832.003280    2030 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 20:03:58 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 20:04:05.896110: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 20:04:05.896318: E external/local_xla/xla/stream_exec

In [26]:
!python evaluate.py \
  --input_file outputs/math500_instruct_t1.2.jsonl \
  --output_file outputs/math500_instruct_t1.2_eval.jsonl

Counting lines in outputs/math500_instruct_t1.2.jsonl...
Scoring generations from outputs/math500_instruct_t1.2.jsonl...
Processing lines: 100%|████████████████████| 8000/8000 [00:26<00:00, 304.37it/s]
Processing complete. Scored 8000 lines across 500 unique problems.

Pass@k Metrics:
  pass@1   : 70.20%
  pass@2   : 77.66%
  pass@4   : 83.00%
  pass@8   : 87.04%
  pass@16  : 90.20%

Majority Vote Metric:
  maj@1    : 76.40%

Scored results saved to outputs/math500_instruct_t1.2_eval.jsonl


### AMC23 Dataset (rollout-n = 64)

#### Temperature = 0.6

In [27]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B-Instruct" \
  --dataset "amc" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 0.6 \
  --top-p 0.9 \
  --output_file outputs/amc23_instruct_t0.6.jsonl

2025-12-02 20:39:21.475100: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764707961.495030    2201 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764707961.501076    2201 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 20:39:27 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 20:39:35.332825: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 20:39:35.336208: E external/local_xla/xla/stream_exec

In [28]:
!python evaluate.py \
  --input_file outputs/amc23_instruct_t0.6.jsonl \
  --output_file outputs/amc23_instruct_t0.6_eval.jsonl

Counting lines in outputs/amc23_instruct_t0.6.jsonl...
Scoring generations from outputs/amc23_instruct_t0.6.jsonl...
Processing lines: 100%|████████████████████| 2560/2560 [00:05<00:00, 438.46it/s]
Processing complete. Scored 2560 lines across 40 unique problems.

Pass@k Metrics:
  pass@1   : 53.36%
  pass@2   : 63.30%
  pass@4   : 70.77%
  pass@8   : 76.37%
  pass@16  : 81.60%
  pass@32  : 86.64%
  pass@64  : 92.50%

Majority Vote Metric:
  maj@1    : 67.50%

Scored results saved to outputs/amc23_instruct_t0.6_eval.jsonl


#### Temperature = 1.0

In [29]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B-Instruct" \
  --dataset "amc" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 1.0 \
  --top-p 0.9 \
  --output_file outputs/amc23_instruct_t1.0.jsonl

2025-12-02 20:53:08.444092: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764708788.465115    2372 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764708788.470933    2372 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 20:53:14 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 20:53:22.223281: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 20:53:22.223453: E external/local_xla/xla/stream_exec

In [30]:
!python evaluate.py \
  --input_file outputs/amc23_instruct_t1.0.jsonl \
  --output_file outputs/amc23_instruct_t1.0_eval.jsonl

Counting lines in outputs/amc23_instruct_t1.0.jsonl...
Scoring generations from outputs/amc23_instruct_t1.0.jsonl...
Processing lines: 100%|████████████████████| 2560/2560 [00:06<00:00, 422.69it/s]
Processing complete. Scored 2560 lines across 40 unique problems.

Pass@k Metrics:
  pass@1   : 52.27%
  pass@2   : 63.47%
  pass@4   : 72.55%
  pass@8   : 79.53%
  pass@16  : 85.08%
  pass@32  : 88.71%
  pass@64  : 90.00%

Majority Vote Metric:
  maj@1    : 65.00%

Scored results saved to outputs/amc23_instruct_t1.0_eval.jsonl


#### Temperature = 1.2

In [31]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B-Instruct" \
  --dataset "amc" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 1.2 \
  --top-p 0.9 \
  --output_file outputs/amc23_instruct_t1.2.jsonl

2025-12-02 21:06:13.064745: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764709573.084338    2543 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764709573.090203    2543 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 21:06:19 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 21:06:26.921790: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764709586.942238    2561 cuda_dnn.cc:8310] Unable t

In [32]:
!python evaluate.py \
  --input_file outputs/amc23_instruct_t1.2.jsonl \
  --output_file outputs/amc23_instruct_t1.2_eval.jsonl

Counting lines in outputs/amc23_instruct_t1.2.jsonl...
Scoring generations from outputs/amc23_instruct_t1.2.jsonl...
Processing lines: 100%|████████████████████| 2560/2560 [00:05<00:00, 436.13it/s]
Processing complete. Scored 2560 lines across 40 unique problems.

Pass@k Metrics:
  pass@1   : 46.91%
  pass@2   : 57.60%
  pass@4   : 66.72%
  pass@8   : 74.20%
  pass@16  : 80.07%
  pass@32  : 83.78%
  pass@64  : 85.00%

Majority Vote Metric:
  maj@1    : 57.50%

Scored results saved to outputs/amc23_instruct_t1.2_eval.jsonl


### AIME25 Dataset (rollout-n = 64)

#### Temperature = 0.6

In [33]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B-Instruct" \
  --dataset "aime" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 0.6 \
  --top-p 0.9 \
  --output_file outputs/aime25_instruct_t0.6.jsonl

2025-12-02 21:21:54.859428: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764710514.879695    2714 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764710514.885733    2714 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 21:22:01 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 21:22:08.885279: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 21:22:08.886162: E external/local_xla/xla/stream_exec

In [34]:
!python evaluate.py \
  --input_file outputs/aime25_instruct_t0.6.jsonl \
  --output_file outputs/aime25_instruct_t0.6_eval.jsonl

Counting lines in outputs/aime25_instruct_t0.6.jsonl...
Scoring generations from outputs/aime25_instruct_t0.6.jsonl...
Processing lines:  84%|████████████████▋   | 1606/1920 [00:04<00:00, 337.49it/s]Timeout during comparison
Processing lines:  85%|█████████████████▉   | 1640/1920 [00:06<00:03, 81.45it/s]Timeout during comparison
Processing lines: 100%|████████████████████| 1920/1920 [00:08<00:00, 238.43it/s]
Processing complete. Scored 1920 lines across 30 unique problems.

Pass@k Metrics:
  pass@1   : 9.22%
  pass@2   : 14.78%
  pass@4   : 20.98%
  pass@8   : 26.92%
  pass@16  : 32.57%
  pass@32  : 37.71%
  pass@64  : 43.33%

Majority Vote Metric:
  maj@1    : 13.33%

Scored results saved to outputs/aime25_instruct_t0.6_eval.jsonl


#### Temperature = 1.0

In [35]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B-Instruct" \
  --dataset "aime" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 1.0 \
  --top-p 0.9 \
  --output_file outputs/aime25_instruct_t1.0.jsonl

2025-12-02 21:34:55.253432: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764711295.274736    2885 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764711295.281204    2885 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 21:35:01 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 21:35:09.252963: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 21:35:09.256675: E external/local_xla/xla/stream_exec

In [36]:
!python evaluate.py \
  --input_file outputs/aime25_instruct_t1.0.jsonl \
  --output_file outputs/aime25_instruct_t1.0_eval.jsonl

Counting lines in outputs/aime25_instruct_t1.0.jsonl...
Scoring generations from outputs/aime25_instruct_t1.0.jsonl...
Processing lines:  84%|████████████████▊   | 1615/1920 [00:05<00:00, 337.49it/s]Timeout during comparison
Timeout during comparison
Timeout during comparison
Processing lines: 100%|████████████████████| 1920/1920 [00:09<00:00, 210.79it/s]
Processing complete. Scored 1920 lines across 30 unique problems.

Pass@k Metrics:
  pass@1   : 8.65%
  pass@2   : 14.55%
  pass@4   : 21.98%
  pass@8   : 29.66%
  pass@16  : 36.93%
  pass@32  : 44.55%
  pass@64  : 53.33%

Majority Vote Metric:
  maj@1    : 16.67%

Scored results saved to outputs/aime25_instruct_t1.0_eval.jsonl


#### Temperature = 1.2

In [37]:
!python inference.py \
  --model "Qwen/Qwen2.5-Math-1.5B-Instruct" \
  --dataset "aime" \
  --dp-size 2 \
  --batch-size 16 \
  --rollout-n 64 \
  --temperature 1.2 \
  --top-p 0.9 \
  --output_file outputs/aime25_instruct_t1.2.jsonl

2025-12-02 21:47:21.751223: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764712041.771802    3056 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764712041.777913    3056 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 12-02 21:47:27 [__init__.py:239] Automatically detected platform cuda.
DP Rank 0 (Local 0) mapped to GPU(s): 0
DP Rank 1 (Local 1) mapped to GPU(s): 1
2025-12-02 21:47:35.710604: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-02 21:47:35.728395: E external/local_xla/xla/stream_exec

In [38]:
!python evaluate.py \
  --input_file outputs/aime25_instruct_t1.2.jsonl \
  --output_file outputs/aime25_instruct_t1.2_eval.jsonl

Counting lines in outputs/aime25_instruct_t1.2.jsonl...
Scoring generations from outputs/aime25_instruct_t1.2.jsonl...
Processing lines:  84%|████████████████▊   | 1618/1920 [00:04<00:00, 388.87it/s]Timeout during comparison
Processing lines: 100%|████████████████████| 1920/1920 [00:06<00:00, 285.60it/s]
Processing complete. Scored 1920 lines across 30 unique problems.

Pass@k Metrics:
  pass@1   : 8.23%
  pass@2   : 13.45%
  pass@4   : 19.53%
  pass@8   : 25.40%
  pass@16  : 30.51%
  pass@32  : 34.77%
  pass@64  : 40.00%

Majority Vote Metric:
  maj@1    : 20.00%

Scored results saved to outputs/aime25_instruct_t1.2_eval.jsonl


## Task 2: Analyze and Compare Results

Extract metrics from evaluation results and create summary tables.

In [39]:
import json
import pandas as pd
from pathlib import Path

# Create output directory
output_dir = Path("outputs")
output_dir.mkdir(exist_ok=True)

# Define evaluation files for all configurations
eval_configs = {
    "Base_Math500_0.6": "math500_base_t0.6_eval.jsonl",
    "Base_Math500_1.0": "math500_base_t1.0_eval.jsonl",
    "Base_Math500_1.2": "math500_base_t1.2_eval.jsonl",
    "Base_AMC23_0.6": "amc23_base_t0.6_eval.jsonl",
    "Base_AMC23_1.0": "amc23_base_t1.0_eval.jsonl",
    "Base_AMC23_1.2": "amc23_base_t1.2_eval.jsonl",
    "Base_AIME25_0.6": "aime25_base_t0.6_eval.jsonl",
    "Base_AIME25_1.0": "aime25_base_t1.0_eval.jsonl",
    "Base_AIME25_1.2": "aime25_base_t1.2_eval.jsonl",
    "Instruct_Math500_0.6": "math500_instruct_t0.6_eval.jsonl",
    "Instruct_Math500_1.0": "math500_instruct_t1.0_eval.jsonl",
    "Instruct_Math500_1.2": "math500_instruct_t1.2_eval.jsonl",
    "Instruct_AMC23_0.6": "amc23_instruct_t0.6_eval.jsonl",
    "Instruct_AMC23_1.0": "amc23_instruct_t1.0_eval.jsonl",
    "Instruct_AMC23_1.2": "amc23_instruct_t1.2_eval.jsonl",
    "Instruct_AIME25_0.6": "aime25_instruct_t0.6_eval.jsonl",
    "Instruct_AIME25_1.0": "aime25_instruct_t1.0_eval.jsonl",
    "Instruct_AIME25_1.2": "aime25_instruct_t1.2_eval.jsonl",
}

def extract_metrics_from_eval_file(filepath):
    """
    Extract pass@k and maj@1 metrics from evaluation output.
    Returns a dict with metrics.
    """
    metrics = {}
    if Path(filepath).exists():
        # Read the eval file to find pass@k and maj@1 metrics
        # These would be printed during evaluation and saved
        metrics['file_exists'] = True
    else:
        metrics['file_exists'] = False
    return metrics

# Display summary
print("=" * 80)
print("EVALUATION CONFIGURATIONS")
print("=" * 80)
for config_name, filepath in eval_configs.items():
    print(f"{config_name:30} -> {filepath}")
print("\nNote: Run the generation and evaluation cells above to generate these files.")
print("Once complete, extract metrics from the output JSONL files for analysis.")

EVALUATION CONFIGURATIONS
Base_Math500_0.6               -> math500_base_t0.6_eval.jsonl
Base_Math500_1.0               -> math500_base_t1.0_eval.jsonl
Base_Math500_1.2               -> math500_base_t1.2_eval.jsonl
Base_AMC23_0.6                 -> amc23_base_t0.6_eval.jsonl
Base_AMC23_1.0                 -> amc23_base_t1.0_eval.jsonl
Base_AMC23_1.2                 -> amc23_base_t1.2_eval.jsonl
Base_AIME25_0.6                -> aime25_base_t0.6_eval.jsonl
Base_AIME25_1.0                -> aime25_base_t1.0_eval.jsonl
Base_AIME25_1.2                -> aime25_base_t1.2_eval.jsonl
Instruct_Math500_0.6           -> math500_instruct_t0.6_eval.jsonl
Instruct_Math500_1.0           -> math500_instruct_t1.0_eval.jsonl
Instruct_Math500_1.2           -> math500_instruct_t1.2_eval.jsonl
Instruct_AMC23_0.6             -> amc23_instruct_t0.6_eval.jsonl
Instruct_AMC23_1.0             -> amc23_instruct_t1.0_eval.jsonl
Instruct_AMC23_1.2             -> amc23_instruct_t1.2_eval.jsonl
Instruct_AIME25_0.6 

## Task 3: Analyze and Visualize Results

Now we will run the analysis script to process all evaluation outputs, generate summary files (`all_results.json`, `summary.csv`), and create plots in the `outputs/plots/` directory.

In [2]:
!python src/analysis.py --input_dir outputs --output_dir outputs

--- Stage 1: Evaluating generation files ---
Found 18 raw generation files to process.
Scoring aime25_base_t0.6.jsonl:  81%|████▊ | 1552/1920 [00:03<00:00, 506.12it/s]Timeout during comparison
Scoring aime25_base_t0.6.jsonl: 100%|██████| 1920/1920 [00:04<00:00, 403.90it/s]

--- Metrics for aime25_base_t0.6.jsonl ---
Pass@k Metrics:
  pass@1   : 4.22%
  pass@2   : 7.52%
  pass@4   : 12.32%
  pass@8   : 18.13%
  pass@16  : 23.96%
  pass@32  : 29.83%
  pass@64  : 36.67%

Majority Vote Metric:
  maj@1    : 3.33%
Scored results saved to outputs/aime25_base_t0.6_eval.jsonl

Scoring aime25_base_t1.0.jsonl: 100%|██████| 1920/1920 [00:04<00:00, 388.15it/s]

--- Metrics for aime25_base_t1.0.jsonl ---
Pass@k Metrics:
  pass@1   : 2.76%
  pass@2   : 5.09%
  pass@4   : 8.74%
  pass@8   : 13.53%
  pass@16  : 18.72%
  pass@32  : 24.58%
  pass@64  : 33.33%

Majority Vote Metric:
  maj@1    : 3.33%
Scored results saved to outputs/aime25_base_t1.0_eval.jsonl

Scoring aime25_base_t1.2.jsonl: 100%|██████|