Install libraries in case you don't have them installed: 
```bash
pip install polars numpy matplotlib
```

First we read our data and have them in a nice easy to use format

In [None]:
!pip install polars numpy matplotlib



In [1]:
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import re
from typing import Tuple, Optional
import copy


scenarios = ['easy', 'medium', 'hard']
datasets = ['raw', 'cod', 'cot']
data = {}
data['easy'] = {}
data['medium'] = {}
data['hard'] = {}

data['easy']['raw'] = pl.read_ndjson('../data/raw/gsm8k_easy.jsonl')
data['easy']['cod'] = pl.read_ndjson('../data/training/cod_easy.jsonl')
data['easy']['cot'] = pl.read_ndjson('../data/training/cot_easy.jsonl')

data['medium']['raw'] = pl.read_ndjson('../data/raw/qwedsacf_competition_math_medium.jsonl')
data['medium']['cod'] = pl.read_ndjson('../data/training/cod_medium.jsonl')
data['medium']['cot'] = pl.read_ndjson('../data/training/cot_medium.jsonl')

data['hard']['raw'] = pl.read_ndjson('../data/raw/qwedsacf_competition_math_hard.jsonl')
data['hard']['cod'] = pl.read_ndjson('../data/training/cod_hard.jsonl')
data['hard']['cot'] = pl.read_ndjson('../data/training/cot_hard.jsonl')

Then we have a look at how it looks like

In [2]:
data['hard']['raw'].head()

problem,level,type,solution
str,str,str,str
"""What is the degree of the poly…","""Level 3""","""Algebra""","""This polynomial is not written…"
"""Evaluate $\left\lceil3\left(6-…","""Level 3""","""Algebra""","""Firstly, $3\left(6-\frac12\rig…"
"""Sam is hired for a 20-day peri…","""Level 3""","""Algebra""","""Call $x$ the number of days Sa…"
"""Find the center of the circle …","""Level 4""","""Algebra""","""Completing the square, we get …"
"""The points $(9, -5)$ and $(-3,…","""Level 3""","""Algebra""","""The center of the circle is lo…"


In [3]:
data['easy']['raw'].head()

question,answer
str,str
"""Natalia sold clips to 48 of he…","""Natalia sold 48/2 = <<48/2=24>…"
"""Weng earns $12 an hour for bab…","""Weng earns 12/60 = $<<12/60=0.…"
"""Betty is saving money for a ne…","""In the beginning, Betty has on…"
"""Julie is reading a 120-page bo…","""Maila read 12 x 2 = <<12*2=24>…"
"""James writes a 3-page letter t…","""He writes each friend 3*2=<<3*…"


First we confirm all datasets are 1000 unique samples, in the above cells we realize that the first column is always the question so we can use that information to confirm the number of unique samples.

Second, we need to confirm that for our math raw datasets we have the correct filters applied. I know that I've applied filters in the original code but there's nothing like a healthy bit of paranoia.

In [4]:
### Check for the 1000 unique samples
for scenario in scenarios:
    for dataset in datasets:
       if  data[scenario][dataset][:,0].n_unique() != 1000:
           print(f"{scenario} {dataset} has {data[scenario][dataset][:,0].n_unique()} samples")

print(f"Medium MATH has levels: {list(data['medium']['raw']['level'].unique())} and types: {list(data['medium']['raw']['type'].unique())}")

print(f"Hard MATH has levels: {list(data['hard']['raw']['level'].unique())} and types: {list(data['hard']['raw']['type'].unique())}")

Medium MATH has levels: ['Level 2', 'Level 1'] and types: ['Intermediate Algebra', 'Precalculus', 'Algebra']
Hard MATH has levels: ['Level 4', 'Level 3'] and types: ['Intermediate Algebra', 'Algebra']


We're good. Before starting the analysis of the output, let's take a look at the structure of it and the columns we have

In [5]:
columns = []
for scenario in ['easy', 'medium', 'hard']:
    for dataset in [ 'cod', 'cot']:
        columns.extend(data[scenario][dataset].columns)
set(columns)

{'id', 'input', 'instruction', 'output'}

In [6]:
data['easy']['cot'].head()

instruction,input,output,id
str,str,str,str
"""Mike can type 65 words per min…","""""","""First, determine the adjusted …","""idx_600"""
"""A cheetah can run at a top spe…","""""","""First, determine the relative …","""idx_601"""
"""A publishing house decides to …","""""","""First, identify the total amou…","""idx_602"""
"""Alyssa and Abigail need to col…","""""","""First, we identify the total n…","""idx_603"""
"""Santana has 7 brothers. 3 of t…","""""","""First, identify the months bel…","""idx_604"""


Now that we've confirmed the validity of our inputs, it's time to check validate our outputs. Mainly the following 3 points

1. The LLM gave us an answer marked by #### separator
2. The LLM gave us the steps marked by -> separator
3. The LLM gave us a correct final answer

The first 2 are relatively easy to validate, a normal string matching should be fine. However the final one as noted in the blog we need to account for variability in the results when matching with the ground truths. We therefore create the following helper functions.

In [7]:
def extract_boxed_content(text: str) -> Optional[str]:
    if not text: return None
    idx = text.rfind("\\boxed{")
    if idx == -1: return None
    start_idx = idx + 7
    balance = 1
    for i in range(start_idx, len(text)):
        char = text[i]
        if char == "{": balance += 1
        elif char == "}":
            balance -= 1
            if balance == 0: return text[start_idx:i]
    return None

def clean_competition_math_answer(text: str) -> str:
    if not text: return ""
    text = text.replace("$", "")
    text = text.replace(",", "").strip()
    return text

def extract_answer(text: str, scenario: str, is_ground_truth: bool = False) -> str:
    """
    Extracts answer based on scenario (easy=gsm8k, medium/hard=math).
    """
    if not text: return ""
    text = str(text)

    # Logic Mapping based on file names observed in your notebook
    is_math = scenario in ['medium', 'hard']
    is_gsm8k = scenario == 'easy'

    if not is_ground_truth:
        parts = text.split("####")
        if len(parts) > 1: return parts[-1].strip()
        return ""

    if is_gsm8k:
        if "####" in text: return text.split("####")[-1].strip()
        return text.strip()

    if is_math:
        boxed = extract_boxed_content(text)
        if boxed: return clean_competition_math_answer(boxed)
        if "####" in text: return text.split("####")[-1].strip()
        return text.strip()

    # Fallback
    if "####" in text: return text.split("####")[-1].strip()
    boxed = extract_boxed_content(text)
    if boxed: return boxed
    return text.strip()

def normalize_string(text: str) -> str:
    if not text: return ""
    text = str(text).strip()
    text = text.replace(",", "")
    if text.endswith("."): text = text[:-1]
    return text

def parse_number(text: str) -> Tuple[Optional[float], bool]:
    clean_text = text.replace(",", "")
    pattern = r'(-?\d+\.?\d*|-?\.\d+)(%)?'
    match = re.search(pattern, clean_text)
    if match:
        try:
            return float(match.group(1)), bool(match.group(2))
        except ValueError:
            pass
    return None, False

def check_equality(ans1: str, ans2: str) -> bool:
    s1, s2 = normalize_string(ans1), normalize_string(ans2)
    if s1 == s2: return True
    
    v1, p1 = parse_number(ans1)
    v2, p2 = parse_number(ans2)
    if v1 is None or v2 is None: return False
    
    def is_close(a, b): return abs(a - b) < 1e-6
    
    if p1 == p2: return is_close(v1, v2)
    if p1 and not p2: return is_close(v1, v2) or is_close(v1/100.0, v2)
    if p2 and not p1: return is_close(v2, v1) or is_close(v2/100.0, v1)
    return False

In [8]:
def generate_analysis_report(data_dict, scenarios, extract_fn, check_fn):
    """
    Generates summary metrics and error reports for CoT vs CoD datasets.

    Args:
        data_dict (dict): Nested dict like data[scenario]['raw'|'cot'|'cod']
        scenarios (list): List of scenario names (str) to process.
        extract_fn (callable): Function (text, scenario, is_gt) -> str
        check_fn (callable): Function (prediction, reference) -> bool

    Returns:
        tuple: (summary_df, errors_full_df, errors_extracted_df)
    """
    summary_results = []
    full_error_list = []
    extracted_error_list = []

    for scenario in scenarios:
        # Prepare Raw Data
        raw_df = data_dict[scenario]['raw']
        
        # Standardize Raw Columns
        if "problem" in raw_df.columns:
            raw_df = raw_df.rename({"problem": "question", "solution": "gt_answer"})
        elif "answer" in raw_df.columns:
            raw_df = raw_df.rename({"answer": "gt_answer"})
        
        # Clean keys for robust joining
        raw_df = raw_df.with_columns(pl.col("question").str.strip_chars())
        
        # Calculate Summary Metrics (Per Dataset)
        for ds_name in ['cot', 'cod']:
            if ds_name not in data_dict[scenario]:
                continue
                
            gen_df = data_dict[scenario][ds_name]
            
            # Prepare Gen DF (Rename + Clean Key)
            gen_prep = gen_df.rename({"instruction": "question", "output": "gen_output"})
            gen_prep = gen_prep.with_columns(pl.col("question").str.strip_chars())
            
            # Capture counts BEFORE join to detect data loss
            raw_count = raw_df.height
            
            # Join Raw and Generated
            combined = raw_df.join(gen_prep, on="question", how="inner")
            
            # Detect Join Failures
            joined_count = combined.height
            join_failures = raw_count - joined_count
            
            # Extract Answers & Check Equality
            # We use map_elements because custom python logic is required
            combined = combined.with_columns([
                pl.col("gt_answer").map_elements(
                    lambda x: extract_fn(x, scenario, True), return_dtype=pl.String
                ).alias("gt_ext"),
                pl.col("gen_output").map_elements(
                    lambda x: extract_fn(x, scenario, False), return_dtype=pl.String
                ).alias("gen_ext")
            ]).with_columns(
                pl.struct(["gen_ext", "gt_ext"]).map_elements(
                    lambda x: check_fn(x["gen_ext"], x["gt_ext"]), return_dtype=pl.Boolean
                ).alias("is_correct")
            )

            # Collect Stats
            n_answers = combined.filter(pl.col('gen_output').str.contains('####', literal=True)).height
            n_steps = combined.filter(pl.col('gen_output').str.contains('->', literal=True)).height
            n_correct = combined.filter(pl.col("is_correct")).height
            n_no_steps_incorrect = combined.filter(
                (~pl.col('gen_output').str.contains('->', literal=True)) & (~pl.col("is_correct"))
            ).height

            summary_results.append({
                "Scenario": scenario,
                "Dataset": ds_name,
                "Total Raw": raw_count,
                "Joined Count": joined_count,
                "Join Failures": join_failures,
                "Count (####)": n_answers,
                "Count (->)": n_steps,
                "Count (Correct)": n_correct,
                "No Steps & Incorrect": n_no_steps_incorrect
            })

        # Build Consolidated Error Reports (Side-by-Side)
        if 'cot' in data_dict[scenario] and 'cod' in data_dict[scenario]:
            # Prepare individual DFs
            cot = data_dict[scenario]['cot'].rename({"instruction": "question", "output": "cot_full"})
            cod = data_dict[scenario]['cod'].rename({"instruction": "question", "output": "cod_full"})
            
            # Clean keys
            cot = cot.with_columns(pl.col("question").str.strip_chars())
            cod = cod.with_columns(pl.col("question").str.strip_chars())
            
            # Join All Three (Raw + CoT + CoD)
            full_combined = raw_df.join(cot, on="question", how="inner").join(cod, on="question", how="inner")
            
            # Extract & Check
            full_combined = full_combined.with_columns([
                pl.col("gt_answer").map_elements(
                    lambda x: extract_fn(x, scenario, True), return_dtype=pl.String
                ).alias("gt_ext"),
                pl.col("cot_full").map_elements(
                    lambda x: extract_fn(x, scenario, False), return_dtype=pl.String
                ).alias("cot_ext"),
                pl.col("cod_full").map_elements(
                    lambda x: extract_fn(x, scenario, False), return_dtype=pl.String
                ).alias("cod_ext"),
            ])
            
            full_combined = full_combined.with_columns([
                pl.struct(["cot_ext", "gt_ext"]).map_elements(
                    lambda x: check_fn(x["cot_ext"], x["gt_ext"]), return_dtype=pl.Boolean
                ).alias("cot_correct"),
                pl.struct(["cod_ext", "gt_ext"]).map_elements(
                    lambda x: check_fn(x["cod_ext"], x["gt_ext"]), return_dtype=pl.Boolean
                ).alias("cod_correct"),
            ])
            
            # Filter: Keep row if EITHER model is incorrect
            errors_only = full_combined.filter((~pl.col("cot_correct")) | (~pl.col("cod_correct")))
            
            if errors_only.height > 0:
                # Report 1: Full Content
                df_full = errors_only.select([
                    pl.lit(scenario).alias("scenario"),
                    pl.col("question"),
                    pl.col("gt_answer").alias("gt_full"),
                    pl.col("cot_full"),
                    pl.col("cod_full")
                ])
                full_error_list.append(df_full)
                
                # Report 2: Extracted Content
                df_ext = errors_only.select([
                    pl.lit(scenario).alias("scenario"),
                    pl.col("question"),
                    pl.col("gt_ext"),
                    pl.col("cot_ext"),
                    pl.col("cod_ext")
                ])
                extracted_error_list.append(df_ext)

    # Final Outputs
    summary_df = pl.DataFrame(summary_results)
    errors_full_df = pl.concat(full_error_list) if full_error_list else pl.DataFrame()
    errors_extracted_df = pl.concat(extracted_error_list) if extracted_error_list else pl.DataFrame()

    return summary_df, errors_full_df, errors_extracted_df

summary_df, errors_full_df, errors_extracted_df = generate_analysis_report(data, scenarios, extract_answer, check_equality)
summary_df

Scenario,Dataset,Total Raw,Joined Count,Join Failures,Count (####),Count (->),Count (Correct),No Steps & Incorrect
str,str,i64,i64,i64,i64,i64,i64,i64
"""easy""","""cot""",1000,1000,0,1000,987,964,0
"""easy""","""cod""",1000,1000,0,1000,997,955,1
"""medium""","""cot""",1000,1000,0,1000,802,918,28
"""medium""","""cod""",1000,1000,0,997,964,923,6
"""hard""","""cot""",1000,1000,0,1000,798,896,20
"""hard""","""cod""",1000,1000,0,998,950,898,7


This is a result that brings a sad tear to my eyes. The lazy solution is to extend the limit to 1500, and rerun the generation code, this will gaurantee that we get the full correct 1000 samples, however in this analysis I'll assume this is what we have and this is what we need to fix. The second lazy solution is to resubmit with higher temperature and pray to our machine overlords it works this time, maybe with a modified prompt, but this will impact the experiment. So we have to get to cleaning.

However, we're not sure if this is the full accuracy, for a better gaurantee of the accuracy it's better to use an LLM as a judge and evaluate.

The good news is that almost all of them have answer separators and the accuracy is acceptable. The bad news is that the format for chain of thought is messed up. Gemini has failed me. Our next steps is as follows

1. Investigate chain of thought no steps format and see if we can fix them
2. Investigate incorrect answers and see if our parsers missed some formats

In [9]:
# Function to print output as we will be doing this a lot
def print_output(df, num_samples=5):
    for i, (question, answer) in enumerate(zip(df['instruction'], df['output'])):
        if i>=num_samples:
            break
        print(f"--- Sample {i+1} ---")
        print(f"Q: {question}")
        print(f"A: {answer}\n")

# Print hard output to see if there are patterns we can observe
missing_steps_df = data['hard']['cot'].filter(
    ~pl.col('output').str.contains('->', literal=True)
)

print_output(missing_steps_df)

--- Sample 1 ---
Q: What is the product of the coordinates of the midpoint of a line segment with endpoints at $(2,3)$ and $(-6,5)$?
A: To find the product of the coordinates of the midpoint, we first need to determine the coordinates of the midpoint itself.

1.  **Identify the endpoints:** The line segment has endpoints $(x_1, y_1) = (2, 3)$ and $(x_2, y_2) = (-6, 5)$.

2.  **Apply the midpoint formula:** The midpoint $M$ of a segment with endpoints $(x_1, y_1)$ and $(x_2, y_2)$ is the average of the coordinates:
    $$M = \left(\frac{x_1 + x_2}{2}, \frac{y_1 + y_2}{2}\right)$$

3.  **Calculate the x-coordinate of the midpoint:**
    Substitute $x_1 = 2$ and $x_2 = -6$:
    $$x_m = \frac{2 + (-6)}{2} = \frac{-4}{2} = -2$$

4.  **Calculate the y-coordinate of the midpoint:**
    Substitute $y_1 = 3$ and $y_2 = 5$:
    $$y_m = \frac{3 + 5}{2} = \frac{8}{2} = 4$$

    So, the midpoint is $(-2, 4)$.

5.  **Calculate the product of the coordinates:**
    Multiply the x-coordinate by the y-

The common pattern here is the two new lines, a \n\n. We can assume that this is the separator between the steps, but before we make this adjustment let's look at the medium data as well if there are new patterns

In [10]:
missing_steps_df = data['medium']['cot'].filter(
    ~pl.col('output').str.contains('->', literal=True)
)

print_output(missing_steps_df)

--- Sample 1 ---
Q: If $9s+5t=108$ and $s$ is two less than $t$, what is $t$?
A: First, identify the two equations provided in the problem statement. The first equation is $9s + 5t = 108$. The second piece of information states that "$s$ is two less than $t$," which translates to the equation $s = t - 2$.

Next, substitute the expression for $s$ into the first equation. Replacing $s$ with $(t - 2)$ in $9s + 5t = 108$ gives:
$9(t - 2) + 5t = 108$

Now, distribute the 9 across the terms inside the parentheses:
$9t - 18 + 5t = 108$

Combine the like terms involving $t$ ($9t$ and $5t$):
$14t - 18 = 108$

To isolate the term with $t$, add 18 to both sides of the equation:
$14t = 108 + 18$
$14t = 126$

Finally, divide both sides by 14 to solve for $t$:
$t = 126 / 14$
$t = 9$

#### 9

--- Sample 2 ---
Q: If $a @ b$ is defined as $a @ b$ = $3a - 3b$, what is the value of $3 @ 5$?
A: To solve for the value of $3 @ 5$, we must first understand the definition of the custom operation provided.

Th

Again, most of them follow the same case of \n\n to be a new step, however there are cases where this is an outlier. Such as the one where it wrote two methods or the one where it separates into parts and creates a numbered list for one part


For now we will just use \n\n as replace, our logic will be as follows

If arrows are not present, replace \n\n with an '-> ' unless it's the last one before the seperator

After the adjustment we have another look and evaluate the datasets

In [11]:
def clean_cot_conditional(text: str) -> str:
    if text is None:
        return ""

    # If the text already contains the arrow separator, we assume it is correct 
    # and return it as is.
    if "->" in text:
        return text
    
    # Otherwise, proceed with the cleaning logic
    
    # 1. Split into reasoning and answer
    reasoning, separator, answer = text.partition("####")
    
    if not separator:
        # No answer separator found, return as is, we will delete this later
        return text
    
    # 2. Process reasoning part
    # Replace \n\n with ' -> ' but preserve the last one if it acts as 
    # the spacer before the #### separator
    if reasoning.endswith("\n\n"):
        cleaned_reasoning = reasoning[:-2].replace("\n\n", " -> ") + "\n\n"
    else:
        cleaned_reasoning = reasoning.replace("\n\n", " -> ")

    # 3. Reassemble
    return cleaned_reasoning + separator + answer

# Keep original data
data_cleaned = copy.deepcopy(data)

target_scenarios = ['medium', 'hard']

for scenario in target_scenarios:
    print(f"Cleaning {scenario} - cot (Conditional)...")
    
    # Apply the conditional cleaning function
    data_cleaned[scenario]['cot'] = data_cleaned[scenario]['cot'].with_columns(
        pl.col('output')
        .map_elements(clean_cot_conditional, return_dtype=pl.String)
        .alias('output')
    )

# Print new summary report
summary_df_cleaned, errors_full_df_cleaned, errors_extracted_df_cleaned = generate_analysis_report(data_cleaned, scenarios, extract_answer, check_equality)
summary_df_cleaned

Cleaning medium - cot (Conditional)...
Cleaning hard - cot (Conditional)...


Scenario,Dataset,Total Raw,Joined Count,Join Failures,Count (####),Count (->),Count (Correct),No Steps & Incorrect
str,str,i64,i64,i64,i64,i64,i64,i64
"""easy""","""cot""",1000,1000,0,1000,987,964,0
"""easy""","""cod""",1000,1000,0,1000,997,955,1
"""medium""","""cot""",1000,1000,0,1000,999,918,0
"""medium""","""cod""",1000,1000,0,997,964,923,6
"""hard""","""cot""",1000,1000,0,1000,1000,896,0
"""hard""","""cod""",1000,1000,0,998,950,898,7


Again, let's print to verify it looks good

In [14]:
# Get missing arrow in original dataset
missing_arrow_mask = ~data[scenario][dataset]['output'].str.contains('->', literal=True)

# Apply  mask to cleaned data to get only the modified rows
modified_steps_df = data_cleaned[scenario][dataset]\
    .with_row_index(name="original_index")\
    .filter(missing_arrow_mask)

print_output(modified_steps_df)

--- Sample 1 ---
Q: What is the product of the coordinates of the midpoint of a line segment with endpoints at $(2,3)$ and $(-6,5)$?
A: To find the product of the coordinates of the midpoint, we first need to determine the coordinates of the midpoint itself. -> 1.  **Identify the endpoints:** The line segment has endpoints $(x_1, y_1) = (2, 3)$ and $(x_2, y_2) = (-6, 5)$. -> 2.  **Apply the midpoint formula:** The midpoint $M$ of a segment with endpoints $(x_1, y_1)$ and $(x_2, y_2)$ is the average of the coordinates:
    $$M = \left(\frac{x_1 + x_2}{2}, \frac{y_1 + y_2}{2}\right)$$ -> 3.  **Calculate the x-coordinate of the midpoint:**
    Substitute $x_1 = 2$ and $x_2 = -6$:
    $$x_m = \frac{2 + (-6)}{2} = \frac{-4}{2} = -2$$ -> 4.  **Calculate the y-coordinate of the midpoint:**
    Substitute $y_1 = 3$ and $y_2 = 5$:
    $$y_m = \frac{3 + 5}{2} = \frac{8}{2} = 4$$ ->     So, the midpoint is $(-2, 4)$. -> 5.  **Calculate the product of the coordinates:**
    Multiply the x-coordina

Let's look at a sample where the arrow exists

In [15]:
steps_df = data['medium']['cot'].filter(
    pl.col('output').str.contains('->', literal=True)
)
print_output(steps_df)

--- Sample 1 ---
Q: Point $M(4,4)$ is the midpoint of $\overline{AB}$. If point $A$ has coordinates $(8,4)$, what is the sum of the coordinates of point $B$?
A: We are given the midpoint $M(4,4)$ and one endpoint $A(8,4)$, and we need to find the coordinates of the other endpoint $B(x,y)$ to calculate their sum -> Using the midpoint formula, the x-coordinate of the midpoint is the average of the endpoints' x-coordinates: $\frac{8+x}{2} = 4$, which implies $8+x=8$, so $x=0$ -> Similarly, for the y-coordinate: $\frac{4+y}{2} = 4$, which implies $4+y=8$, so $y=4$ -> The coordinates of point $B$ are $(0,4)$, and the sum of these coordinates is $0+4=4$. #### 4

--- Sample 2 ---
Q: What is $\sqrt[4]{81} \cdot \sqrt[3]{27} \cdot \sqrt{9}$ expressed as a positive integer?
A: First, evaluate each radical term individually to simplify the expression -> The fourth root of 81 is 3, because $3^4 = 81$ -> The cube root of 27 is 3, because $3^3 = 27$ -> The square root of 9 is 3, because $3^2 = 9$ ->

They look similar enough. Now that we've fixed the issue with the missing arrows let's tackle the issue of incorrect answers. Previously we've generated in our report a errors_extracted_df that we can use to analyze the mistakes and see what went wrong

In [16]:
rows_list = errors_full_df_cleaned.filter(
    pl.col('scenario') == 'medium'
).tail(5).to_dicts()

# Print them out nicely
for i, row in enumerate(rows_list):
    print(f"--- Row {i} ---")
    print(row)

--- Row 0 ---
{'scenario': 'medium', 'question': 'Find all roots of the polynomial $x^3-5x^2+3x+9$. Enter your answer as a list of numbers separated by commas. If a root occurs more than once, enter it as many times as its multiplicity.', 'gt_full': 'By the Rational Root Theorem, any root of the polynomial must divide $9$. Therefore the roots are among the numbers $\\pm 1,3$. Since these are only four values, we can try all of them to find that $x=3$ and $x=-1$ are roots and $x=-3$ and $x=1$ are not.\n\nSince the given polynomial is cubic, it must have three roots. This means that one of $3$ or $-1$ is a root twice (i.e. has multiplicity $2$). The Factor Theorem tells us that since $-1$ and $3$ are roots of the polynomial, $x+1$ and $x-3$ must be factors of the polynomial. To find which root occurs twice, we can divide $x^3-5x^2+3x+9$ by $x+1$ to get  $x^3-5x^2+3x+9 = (x+1)(x^2-6x+9)$.\n\nWe can factorise $x^2-6x+9$ as $(x-3)^2$ which means that the root $x=3$ has multiplicity 2.  Thus

In [17]:
rows_list = errors_extracted_df_cleaned.filter(
    pl.col('scenario') == 'medium'
).tail(5).to_dicts()

# Print them out nicely
for i, row in enumerate(rows_list):
    print(f"--- Row {i} ---")
    print(row)

--- Row 0 ---
{'scenario': 'medium', 'question': 'Find all roots of the polynomial $x^3-5x^2+3x+9$. Enter your answer as a list of numbers separated by commas. If a root occurs more than once, enter it as many times as its multiplicity.', 'gt_ext': '-133', 'cot_ext': '-1, 3, 3', 'cod_ext': '-1, 3, 3'}
--- Row 1 ---
{'scenario': 'medium', 'question': 'Which type of conic section is described by the equation \\[|y+5| = \\sqrt{(x-2)^2 + y^2}?\\]Enter "C" for circle, "P" for parabola, "E" for ellipse, "H" for hyperbola, and "N" for none of the above.', 'gt_ext': '\\text{(P)}', 'cot_ext': 'P', 'cod_ext': 'P'}
--- Row 2 ---
{'scenario': 'medium', 'question': 'The hyperbola given by the equation \\[\\frac{y^2}{9}-\\frac{x^2}{4} = 1\\]has asymptotes $y = \\pm mx,$ where $m$ is positive. Find $m.$', 'gt_ext': '\\frac32', 'cot_ext': '\\frac{3}{2}', 'cod_ext': '\\frac{3}{2}'}
--- Row 3 ---
{'scenario': 'medium', 'question': 'When a polynomial is divided by $2x^2 - 7x + 18,$ what are the possible 

Looking at answers in our datasets, we notice some main descrepancies:

1. Fractions in latex such as \\frac{1}{2} are not getting parsed correctly, so we decided to resolve fractions.
2. Cases with \\text{P} are also not getting parsed correct, so we decided to resolve latex

We therefore update our extract functions to resolve these issues.

In [18]:
def resolve_fractions(text: str) -> str:
    """
    Resolves both LaTeX fractions (\\frac{1}{2}) AND plain text fractions (1/2) to decimals.
    Examples: 
      "\\frac{1}{2}" -> "0.5"
      "1/2"          -> "0.5"
      "1,000/4"      -> "250.0"
    """
    if not text: return text
    text = str(text)
    
    # --- Helper: Perform Division ---
    def calculate_div(n_str, d_str):
        try:
            n = float(n_str.replace(',', '').strip())
            d = float(d_str.replace(',', '').strip())
            if d == 0: return None
            return str(n / d)
        except ValueError:
            return None

    # --- Pass 1: LaTeX Fractions (\frac{a}{b}) ---
    def repl_latex(m):
        val = calculate_div(m.group(2), m.group(3))
        return val if val is not None else m.group(0)

    # Pattern matches \frac{...}{...} or \dfrac{...}{...}
    text = re.sub(r'\\(d?)frac\{([^{}]+)\}\{([^{}]+)\}', repl_latex, text)

    # --- Pass 2: Plain Text Fractions (a/b) ---
    def repl_plain(m):
        val = calculate_div(m.group(1), m.group(2))
        return val if val is not None else m.group(0)

    # Pattern matches: number / number
    # Handles negatives (-1/2) and commas (1,000/2)
    # We use a lookahead (?!\d) to ensure we don't cut off numbers, though straightforward matching works well here.
    plain_pattern = r'(-?\d+(?:,\d+)*)\s*/\s*(-?\d+(?:,\d+)*)'
    text = re.sub(plain_pattern, repl_plain, text)

    return text

def extract_boxed_content(text: str) -> Optional[str]:
    if not text: return None
    idx = text.rfind("\\boxed{")
    if idx == -1: return None
    start_idx = idx + 7
    balance = 1
    for i in range(start_idx, len(text)):
        char = text[i]
        if char == "{": balance += 1
        elif char == "}":
            balance -= 1
            if balance == 0: return text[start_idx:i]
    return None

def clean_competition_math_answer(text: str) -> str:
    if not text: return ""
    text = text.replace("$", "")
    text = text.replace(",", "").strip()
    return text

def extract_answer(text: str, scenario: str, is_ground_truth: bool = False) -> str:
    """
    Extracts answer based on scenario (easy=gsm8k, medium/hard=math).
    """
    if not text: return ""
    text = str(text)

    # --- ADDED: Fraction Resolution ---
    # We apply this globally first so \boxed{\frac{1}{2}} becomes \boxed{0.5}
    text = resolve_fractions(text)
    # ----------------------------------

    # Logic Mapping based on file names observed in your notebook
    is_math = scenario in ['medium', 'hard']
    is_gsm8k = scenario == 'easy'

    if not is_ground_truth:
        parts = text.split("####")
        if len(parts) > 1: return parts[-1].strip()
        return ""

    if is_gsm8k:
        if "####" in text: return text.split("####")[-1].strip()
        return text.strip()

    if is_math:
        boxed = extract_boxed_content(text)
        if boxed: return clean_competition_math_answer(boxed)
        if "####" in text: return text.split("####")[-1].strip()
        return text.strip()

    # Fallback
    if "####" in text: return text.split("####")[-1].strip()
    boxed = extract_boxed_content(text)
    if boxed: return boxed
    return text.strip()

def normalize_string(text: str) -> str:
    if not text: return ""
    text = str(text).strip()
    text = text.replace(",", "")
    if text.endswith("."): text = text[:-1]
    return text

def parse_number(text: str) -> Tuple[Optional[float], bool]:
    clean_text = text.replace(",", "")
    pattern = r'(-?\d+\.?\d*|-?\.\d+)(%)?'
    match = re.search(pattern, clean_text)
    if match:
        try:
            return float(match.group(1)), bool(match.group(2))
        except ValueError:
            pass
    return None, False

def check_equality(ans1: str, ans2: str) -> bool:
    s1, s2 = normalize_string(ans1), normalize_string(ans2)
    if s1 == s2: return True
    
    v1, p1 = parse_number(ans1)
    v2, p2 = parse_number(ans2)
    if v1 is None or v2 is None: return False
    
    def is_close(a, b): return abs(a - b) < 1e-6
    
    if p1 == p2: return is_close(v1, v2)
    if p1 and not p2: return is_close(v1, v2) or is_close(v1/100.0, v2)
    if p2 and not p1: return is_close(v2, v1) or is_close(v2/100.0, v1)
    return False

In [19]:
# Print new summary report
summary_df_cleaned, errors_full_df_cleaned, errors_extracted_df_cleaned = generate_analysis_report(data_cleaned, scenarios, extract_answer, check_equality)
summary_df_cleaned

Scenario,Dataset,Total Raw,Joined Count,Join Failures,Count (####),Count (->),Count (Correct),No Steps & Incorrect
str,str,i64,i64,i64,i64,i64,i64,i64
"""easy""","""cot""",1000,1000,0,1000,987,964,0
"""easy""","""cod""",1000,1000,0,1000,997,955,1
"""medium""","""cot""",1000,1000,0,1000,999,933,0
"""medium""","""cod""",1000,1000,0,997,964,932,6
"""hard""","""cot""",1000,1000,0,1000,1000,935,0
"""hard""","""cod""",1000,1000,0,998,950,932,8


Another sad tear to my eye. Despite the implementation of a better resolution algorithm the accuracy has decreased. This only means that our original resolution algorithm was giving false positives. There is more that can be done here but I'll stop at this for now.

If you want, you can pass this to an LLM as a judge for evaluation and get the actual accuracy. For now we'll resolve with this. Do the analysis and write the report.

In [20]:
def match_datasets(data_cleaned):
    """
    Balances 'cot' and 'cod' datasets for each scenario:
    1. Identifies the smaller dataset.
    2. Slices the larger dataset to match the size.
    3. PRIORITIZES keeping samples that DO exist in the smaller dataset (Intersection).
    """
    print(f"\n⚖️  Dataset Matching Report (Prioritizing Overlap)")
    print("=" * 70)
    print(f"{'Scenario':<10} | {'Small':<5} | {'Large':<5} | {'Overlap Kept':<12} | {'Unique Kept':<12} | {'Final Size'}")
    print("-" * 70)

    for scenario, splits in data_cleaned.items():
        if 'cot' not in splits or 'cod' not in splits:
            continue
            
        df_cot = splits['cot']
        df_cod = splits['cod']
        
        n_cot = df_cot.height
        n_cod = df_cod.height
        
        if n_cot == n_cod:
            print(f"{scenario:<10} | {'--':<5} | {'--':<5} | {'--':<12} | {'--':<12} | {n_cot}")
            continue

        # 1. Identify Smaller vs Larger
        if n_cot < n_cod:
            small_key, big_key = 'cot', 'cod'
            df_small, df_big = df_cot, df_cod
            target_n = n_cot
        else:
            small_key, big_key = 'cod', 'cot'
            df_small, df_big = df_cod, df_cot
            target_n = n_cod

        # 2. Identify Overlap
        # Fix: Convert to list to avoid Polars deprecation warning/ambiguity
        small_instr_list = df_small["instruction"].to_list()
        
        df_big = df_big.with_columns(
            is_overlap = pl.col("instruction").is_in(small_instr_list)
        )

        # 3. Sort by Overlap (Descending: True first) -> Intersection First
        df_big_sorted = df_big.sort("is_overlap", descending=True)
        
        # 4. Slice to Target Size
        df_big_sliced = df_big_sorted.head(target_n)
        
        # Stats
        n_overlap_kept = df_big_sliced.filter(pl.col("is_overlap")).height
        n_unique_kept = df_big_sliced.filter(~pl.col("is_overlap")).height
        
        # 5. Update Dictionary
        data_cleaned[scenario][big_key] = df_big_sliced.drop("is_overlap")
        
        print(f"{scenario:<10} | {small_key:<5} | {big_key:<5} | {n_overlap_kept:<12} | {n_unique_kept:<12} | {target_n}")

    print("=" * 70)
    return data_cleaned

def clean_and_struct_data(data_dict, scenarios, extract_fn, check_fn):
    """
    Cleans the datasets: Must have '->' AND be correct.
    
    Returns: 
        data_cleaned (dict): Same structure as input but filtered.
        final_counts (dict): {scenario: {'cod': int, 'cot': int}} for analysis.
    """
    data_cleaned = {s: {} for s in scenarios}
    final_counts = {s: {} for s in scenarios}
    
    print(f"{'Scenario':<10} | {'Type':<5} | {'Raw':<6} | {'Cleaned':<8} | {'Retention'}")
    print("-" * 55)

    for scenario in scenarios:
        # 1. Prepare Raw (Ground Truth)
        raw_df = data_dict[scenario]['raw']
        
        # Standardize Raw Columns
        if "problem" in raw_df.columns:
            raw_df = raw_df.rename({"problem": "question", "solution": "gt_answer"})
        elif "answer" in raw_df.columns:
            raw_df = raw_df.rename({"answer": "gt_answer"})
            
        raw_df = raw_df.with_columns(pl.col("question").str.strip_chars())
        
        # Copy Raw to Cleaned (Unchanged)
        data_cleaned[scenario]['raw'] = raw_df

        # 2. Process CoT and CoD
        for ds_type in ['cot', 'cod']:
            if ds_type not in data_dict[scenario]:
                continue
                
            gen_df = data_dict[scenario][ds_type]
            
            # Standardize Gen Columns
            gen_df = gen_df.rename({"instruction": "question", "output": "gen_output"})
            gen_df = gen_df.with_columns(pl.col("question").str.strip_chars())
            
            # Join with Raw to get GT
            combined = raw_df.join(gen_df, on="question", how="inner")
            
            # Extract and Check Correctness
            combined = combined.with_columns([
                pl.col("gt_answer").map_elements(
                    lambda x: extract_fn(x, scenario, True), return_dtype=pl.String
                ).alias("gt_ext"),
                pl.col("gen_output").map_elements(
                    lambda x: extract_fn(x, scenario, False), return_dtype=pl.String
                ).alias("gen_ext")
            ]).with_columns(
                pl.struct(["gen_ext", "gt_ext"]).map_elements(
                    lambda x: check_fn(x["gen_ext"], x["gt_ext"]), return_dtype=pl.Boolean
                ).alias("is_correct")
            )
            
            # --- FILTERING LOGIC ---
            filtered = combined.filter(
                pl.col("gen_output").str.contains("->", literal=True) & 
                pl.col("is_correct")
            )
            
            # Restore original column names for training compatibility
            final_df = filtered.select([
                pl.col("question").alias("instruction"),
                pl.col("gen_output").alias("output")
            ])
            
            data_cleaned[scenario][ds_type] = final_df
            final_counts[scenario][ds_type] = final_df.height
            
            print(f"{scenario:<10} | {ds_type:<5} | {gen_df.height:<6} | {final_df.height:<8} | {final_df.height/gen_df.height:.1%}")

    return data_cleaned, final_counts

In [21]:
def analyze_confidence_impact(data_input, baseline_n=1000):
    """
    Analyzes the MoE impact based on the dataset sizes.
    Accepts either a dict of counts OR a dict of DataFrames.
    """
    z = 1.96 # 95% Confidence
    p = 0.5  # Max variance assumption
    base_moe = z * np.sqrt((p * (1-p)) / baseline_n)
    
    print("\n📊 Confidence Impact Analysis (Dataset Size Reduction)")
    print("=" * 65)
    print(f"Baseline Reference (N={baseline_n}): MoE = ±{base_moe*100:.2f}%")
    print("-" * 65)
    print(f"{'Scenario':<10} | {'Type':<5} | {'Size (N)':<8} | {'MoE (±%)':<10} | {'Noise Factor'}")
    print("-" * 65)
    
    for scenario, types in data_input.items():
        for dtype, val in types.items():
            # Handle DataFrame vs Integer input
            if isinstance(val, pl.DataFrame):
                n = val.height
            else:
                n = val
            
            # Skip empty or invalid
            if n == 0:
                print(f"{scenario:<10} | {dtype:<5} | {n:<8} | N/A        | Infinite")
                continue
                
            moe = z * np.sqrt((p * (1-p)) / n)
            factor = moe / base_moe
            print(f"{scenario:<10} | {dtype:<5} | {n:<8} | {moe*100:.2f}%     | {factor:.2f}x")
            
    print("-" * 65)
    print("Interpretation: 'Noise Factor' > 1.0 means less confidence than baseline.")

In [22]:
# 1. Clean Data
data_cleaned_base, cleaning_stats = clean_and_struct_data(
    data_cleaned, 
    scenarios, 
    extract_answer, 
    check_equality
)

# 1. Match datasets (now warning-free)
data_final = match_datasets(data_cleaned_base)

# 2. Run analysis (now handles the DataFrames inside data_final automatically)
analyze_confidence_impact(data_final, baseline_n=1000)

Scenario   | Type  | Raw    | Cleaned  | Retention
-------------------------------------------------------
easy       | cot   | 1000   | 951      | 95.1%
easy       | cod   | 1000   | 953      | 95.3%
medium     | cot   | 1000   | 932      | 93.2%
medium     | cod   | 1000   | 902      | 90.2%
hard       | cot   | 1000   | 935      | 93.5%
hard       | cod   | 1000   | 890      | 89.0%

⚖️  Dataset Matching Report (Prioritizing Overlap)
Scenario   | Small | Large | Overlap Kept | Unique Kept  | Final Size
----------------------------------------------------------------------
easy       | cot   | cod   | 940          | 11           | 951
medium     | cod   | cot   | 899          | 3            | 902
hard       | cod   | cot   | 888          | 2            | 890

📊 Confidence Impact Analysis (Dataset Size Reduction)
Baseline Reference (N=1000): MoE = ±3.10%
-----------------------------------------------------------------
Scenario   | Type  | Size (N) | MoE (±%)   | Noise Factor
--------