Install libraries in case you don't have them installed: 
```bash
pip install polars numpy matplotlib
```

First we read our data and have them in a nice easy to use format

In [21]:
!pip install polars numpy matplotlib

Collecting polars
  Downloading polars-1.36.1-py3-none-any.whl.metadata (10 kB)
Collecting polars-runtime-32==1.36.1 (from polars)
  Downloading polars_runtime_32-1.36.1-cp39-abi3-win_amd64.whl.metadata (1.5 kB)
Downloading polars-1.36.1-py3-none-any.whl (802 kB)
   ---------------------------------------- 0.0/802.4 kB ? eta -:--:--
   ---------------------------------------- 802.4/802.4 kB 5.7 MB/s eta 0:00:00
Downloading polars_runtime_32-1.36.1-cp39-abi3-win_amd64.whl (44.5 MB)
   ---------------------------------------- 0.0/44.5 MB ? eta -:--:--
   - -------------------------------------- 1.8/44.5 MB 11.2 MB/s eta 0:00:04
   --- ------------------------------------ 3.4/44.5 MB 9.2 MB/s eta 0:00:05
   ---- ----------------------------------- 5.5/44.5 MB 10.5 MB/s eta 0:00:04
   ------- -------------------------------- 7.9/44.5 MB 10.4 MB/s eta 0:00:04
   -------- ------------------------------- 9.4/44.5 MB 9.8 MB/s eta 0:00:04
   --------- ------------------------------ 11.0/44.5 MB

In [36]:
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import re
from typing import Tuple, Optional


scenarios = ['easy', 'medium', 'hard']
datasets = ['raw', 'cod', 'cot']
data = {}
data['easy'] = {}
data['medium'] = {}
data['hard'] = {}

data['easy']['raw'] = pl.read_ndjson('../data/raw/gsm8k_easy.jsonl')
data['easy']['cod'] = pl.read_ndjson('../data/training/cod_easy.jsonl')
data['easy']['cot'] = pl.read_ndjson('../data/training/cot_easy.jsonl')

data['medium']['raw'] = pl.read_ndjson('../data/raw/qwedsacf_competition_math_medium.jsonl')
data['medium']['cod'] = pl.read_ndjson('../data/training/cod_medium.jsonl')
data['medium']['cot'] = pl.read_ndjson('../data/training/cot_medium.jsonl')

data['hard']['raw'] = pl.read_ndjson('../data/raw/qwedsacf_competition_math_hard.jsonl')
data['hard']['cod'] = pl.read_ndjson('../data/training/cod_hard.jsonl')
data['hard']['cot'] = pl.read_ndjson('../data/training/cot_hard.jsonl')

Then we have a look at how it looks like

In [23]:
data['hard']['raw'].head()

problem,level,type,solution
str,str,str,str
"""What is the degree of the poly…","""Level 3""","""Algebra""","""This polynomial is not written…"
"""Evaluate $\left\lceil3\left(6-…","""Level 3""","""Algebra""","""Firstly, $3\left(6-\frac12\rig…"
"""Sam is hired for a 20-day peri…","""Level 3""","""Algebra""","""Call $x$ the number of days Sa…"
"""Find the center of the circle …","""Level 4""","""Algebra""","""Completing the square, we get …"
"""The points $(9, -5)$ and $(-3,…","""Level 3""","""Algebra""","""The center of the circle is lo…"


In [24]:
data['easy']['raw'].head()

question,answer
str,str
"""Natalia sold clips to 48 of he…","""Natalia sold 48/2 = <<48/2=24>…"
"""Weng earns $12 an hour for bab…","""Weng earns 12/60 = $<<12/60=0.…"
"""Betty is saving money for a ne…","""In the beginning, Betty has on…"
"""Julie is reading a 120-page bo…","""Maila read 12 x 2 = <<12*2=24>…"
"""James writes a 3-page letter t…","""He writes each friend 3*2=<<3*…"


First we confirm all datasets are 1000 unique samples, in the above cells we realize that the first column is always the question so we can use that information to confirm the number of unique samples.

Second, we need to confirm that for our math raw datasets we have the correct filters applied. I know that I've applied filters in the original code but there's nothing like a healthy bit of paranoia.

In [49]:
### Check for the 1000 unique samples
for scenario in scenarios:
    for dataset in datasets:
       if  data[scenario][dataset][:,0].n_unique() != 1000:
           print(f"{scenario} {dataset} has {data[scenario][dataset][:,0].n_unique()} samples")

print(f"Medium MATH has levels: {list(data['medium']['raw']['level'].unique())} and types: {list(data['medium']['raw']['type'].unique())}")

print(f"Hard MATH has levels: {list(data['hard']['raw']['level'].unique())} and types: {list(data['hard']['raw']['type'].unique())}")

Medium MATH has levels: ['Level 1', 'Level 2'] and types: ['Algebra', 'Intermediate Algebra', 'Precalculus']
Hard MATH has levels: ['Level 3', 'Level 4'] and types: ['Algebra', 'Intermediate Algebra']


We're good. Before starting the analysis of the output, let's take a look at the structure of it and the columns we have

In [31]:
columns = []
for scenario in ['easy', 'medium', 'hard']:
    for dataset in [ 'cod', 'cot']:
        columns.extend(data[scenario][dataset].columns)
set(columns)

{'id', 'input', 'instruction', 'output'}

In [35]:
data['easy']['cot'].head()

instruction,input,output,id
str,str,str,str
"""Joe played catch with Derek an…","""""","""Joe caught the ball 23 times -…","""idx_500"""
"""John is very unfit and decides…","""""","""Identify the stages of progres…","""idx_501"""
"""At a garage sale, Tish bought …","""""","""The problem states that the co…","""idx_502"""
"""Yasna has two books. One book …","""""","""First, find the total number o…","""idx_503"""
"""From March to August, Sam made…","""""","""First, determine Sam's hourly …","""idx_504"""


Now that we've confirmed the validity of our inputs, it's time to check validate our outputs. Mainly the following 3 points

1. The LLM gave us an answer marked by #### separator
2. The LLM gave us the steps marked by -> separator
3. The LLM gave us a correct final answer

The first 2 are relatively easy to validate, a normal string matching should be fine. However the final one as noted in the blog we need to account for variability in the results when matching with the ground truths. We therefore create the following helper functions.

In [None]:
def extract_boxed_content(text: str) -> Optional[str]:
    if not text: return None
    idx = text.rfind("\\boxed{")
    if idx == -1: return None
    start_idx = idx + 7
    balance = 1
    for i in range(start_idx, len(text)):
        char = text[i]
        if char == "{": balance += 1
        elif char == "}":
            balance -= 1
            if balance == 0: return text[start_idx:i]
    return None

def clean_competition_math_answer(text: str) -> str:
    if not text: return ""
    text = text.replace("$", "")
    text = text.replace(",", "").strip()
    return text

def extract_answer(text: str, scenario: str, is_ground_truth: bool = False) -> str:
    """
    Extracts answer based on scenario (easy=gsm8k, medium/hard=math).
    """
    if not text: return ""
    text = str(text)

    # Logic Mapping based on file names observed in your notebook
    is_math = scenario in ['medium', 'hard']
    is_gsm8k = scenario == 'easy'

    if not is_ground_truth:
        parts = text.split("####")
        if len(parts) > 1: return parts[-1].strip()
        return ""

    if is_gsm8k:
        if "####" in text: return text.split("####")[-1].strip()
        return text.strip()

    if is_math:
        boxed = extract_boxed_content(text)
        if boxed: return clean_competition_math_answer(boxed)
        if "####" in text: return text.split("####")[-1].strip()
        return text.strip()

    # Fallback
    if "####" in text: return text.split("####")[-1].strip()
    boxed = extract_boxed_content(text)
    if boxed: return boxed
    return text.strip()

def normalize_string(text: str) -> str:
    if not text: return ""
    text = str(text).strip()
    text = text.replace(",", "")
    if text.endswith("."): text = text[:-1]
    return text

def parse_number(text: str) -> Tuple[Optional[float], bool]:
    clean_text = text.replace(",", "")
    pattern = r'(-?\d+\.?\d*|-?\.\d+)(%)?'
    match = re.search(pattern, clean_text)
    if match:
        try:
            return float(match.group(1)), bool(match.group(2))
        except ValueError:
            pass
    return None, False

def check_equality(ans1: str, ans2: str) -> bool:
    s1, s2 = normalize_string(ans1), normalize_string(ans2)
    if s1 == s2: return True
    
    v1, p1 = parse_number(ans1)
    v2, p2 = parse_number(ans2)
    if v1 is None or v2 is None: return False
    
    def is_close(a, b): return abs(a - b) < 1e-6
    
    if p1 == p2: return is_close(v1, v2)
    if p1 and not p2: return is_close(v1, v2) or is_close(v1/100.0, v2)
    if p2 and not p1: return is_close(v2, v1) or is_close(v2/100.0, v1)
    return False

In [34]:
results = []
for scenario in scenarios:
    for ds_name in ['cot', 'cod']:
        current_df = data[scenario][ds_name]
        counts = current_df.select(
            n_answers = pl.col('output').str.contains('####', literal=True).sum(),
            n_steps   = pl.col('output').str.contains('->', literal=True).sum()
        ).row(0) # Returns a tuple like (10, 5)
        
        results.append({
            "Scenario": scenario,
            "Dataset": ds_name,
            "Count (####)": counts[0],
            "Count (->)": counts[1]
        })

summary_df = pl.DataFrame(results)
print(summary_df)

shape: (6, 4)
┌──────────┬─────────┬──────────────┬────────────┐
│ Scenario ┆ Dataset ┆ Count (####) ┆ Count (->) │
│ ---      ┆ ---     ┆ ---          ┆ ---        │
│ str      ┆ str     ┆ i64          ┆ i64        │
╞══════════╪═════════╪══════════════╪════════════╡
│ easy     ┆ cot     ┆ 1000         ┆ 997        │
│ easy     ┆ cod     ┆ 1000         ┆ 1000       │
│ medium   ┆ cot     ┆ 1000         ┆ 900        │
│ medium   ┆ cod     ┆ 1000         ┆ 1000       │
│ hard     ┆ cot     ┆ 1000         ┆ 828        │
│ hard     ┆ cod     ┆ 1000         ┆ 996        │
└──────────┴─────────┴──────────────┴────────────┘


Now for the first 3, we can heuteristically identify steps as . followed by a new line, so a .\n can be replaced with ->.

However before doing that we would need to identify that the ones with steps follow a similar pattern, so we'll grab some random samples and investigate

In [42]:
steps_df = data['hard']['cod'].filter(
    pl.col('output').str.contains('->', literal=True)
)
for i, (question, answer) in enumerate(zip(steps_df['instruction'], steps_df['output'])):
    if i>=5:
        break
    print(f"--- Sample {i+1} ---")
    print(f"Q: {question}")
    print(f"A: {answer}\n")

--- Sample 1 ---
Q: Define a function $h(x),$ for positive integer values of $x,$ by \[h(x) = \left\{\begin{aligned} \log_2 x & \quad \text{ if } \log_2 x \text{ is an integer} \\ 1 + h(x + 1) & \quad \text{ otherwise}. \end{aligned} \right.\]Compute $h(100).$
A: Find next power of 2 -> $128$ is nearest upper power -> Steps needed: $128 - 100 = 28$ -> $h(128) = \log_2 128 = 7$ -> result is $28 + 7 = 35$

#### 35

--- Sample 2 ---
Q: Determine if the graph of the equation below is a parabola, circle, ellipse, hyperbola, point, line, two lines, or empty.

$x^2 + 2y^2 - 6x - 8y + 21 = 0$
A: Analyze equation -> Complete square x -> Complete square y -> Simplify constants -> Check solvability -> Empty set

####
empty

--- Sample 3 ---
Q: The parabolas $y = (x + 1)^2$ and $x + 4 = (y - 3)^2$ intersect at four points $(x_1,y_1),$ $(x_2,y_2),$ $(x_3,y_3),$ and $(x_4,y_4).$  Find
\[x_1 + x_2 + x_3 + x_4 + y_1 + y_2 + y_3 + y_4.\]
A: Substitute y into x equation -> Find polynomial for x -> Sum o