Install libraries in case you don't have them installed: 
```bash
pip install pandas numpy matplotlib
```

First we read our data and have them in a nice easy to use format

In [21]:
!pip install polars numpy matplotlib

Collecting polars
  Downloading polars-1.36.1-py3-none-any.whl.metadata (10 kB)
Collecting polars-runtime-32==1.36.1 (from polars)
  Downloading polars_runtime_32-1.36.1-cp39-abi3-win_amd64.whl.metadata (1.5 kB)
Downloading polars-1.36.1-py3-none-any.whl (802 kB)
   ---------------------------------------- 0.0/802.4 kB ? eta -:--:--
   ---------------------------------------- 802.4/802.4 kB 5.7 MB/s eta 0:00:00
Downloading polars_runtime_32-1.36.1-cp39-abi3-win_amd64.whl (44.5 MB)
   ---------------------------------------- 0.0/44.5 MB ? eta -:--:--
   - -------------------------------------- 1.8/44.5 MB 11.2 MB/s eta 0:00:04
   --- ------------------------------------ 3.4/44.5 MB 9.2 MB/s eta 0:00:05
   ---- ----------------------------------- 5.5/44.5 MB 10.5 MB/s eta 0:00:04
   ------- -------------------------------- 7.9/44.5 MB 10.4 MB/s eta 0:00:04
   -------- ------------------------------- 9.4/44.5 MB 9.8 MB/s eta 0:00:04
   --------- ------------------------------ 11.0/44.5 MB

In [22]:
import polars as pl
import numpy as np
import matplotlib.pyplot as plt


scenarios = ['easy', 'medium', 'hard']
datasets = ['raw', 'cod', 'cot']
data = {}
data['easy'] = {}
data['medium'] = {}
data['hard'] = {}

data['easy']['raw'] = pl.read_ndjson('../data/raw/gsm8k_easy.jsonl')
data['easy']['cod'] = pl.read_ndjson('../data/training/cod_easy.jsonl')
data['easy']['cot'] = pl.read_ndjson('../data/training/cot_easy.jsonl')

data['medium']['raw'] = pl.read_ndjson('../data/raw/qwedsacf_competition_math_medium.jsonl')
data['medium']['cod'] = pl.read_ndjson('../data/training/cod_medium.jsonl')
data['medium']['cot'] = pl.read_ndjson('../data/training/cot_medium.jsonl')

data['hard']['raw'] = pl.read_ndjson('../data/raw/qwedsacf_competition_math_hard.jsonl')
data['hard']['cod'] = pl.read_ndjson('../data/training/cod_hard.jsonl')
data['hard']['cot'] = pl.read_ndjson('../data/training/cot_hard.jsonl')

Then we have a look at how it looks like

In [23]:
data['hard']['raw'].head()

problem,level,type,solution
str,str,str,str
"""What is the degree of the poly…","""Level 3""","""Algebra""","""This polynomial is not written…"
"""Evaluate $\left\lceil3\left(6-…","""Level 3""","""Algebra""","""Firstly, $3\left(6-\frac12\rig…"
"""Sam is hired for a 20-day peri…","""Level 3""","""Algebra""","""Call $x$ the number of days Sa…"
"""Find the center of the circle …","""Level 4""","""Algebra""","""Completing the square, we get …"
"""The points $(9, -5)$ and $(-3,…","""Level 3""","""Algebra""","""The center of the circle is lo…"


In [24]:
data['easy']['raw'].head()

question,answer
str,str
"""Natalia sold clips to 48 of he…","""Natalia sold 48/2 = <<48/2=24>…"
"""Weng earns $12 an hour for bab…","""Weng earns 12/60 = $<<12/60=0.…"
"""Betty is saving money for a ne…","""In the beginning, Betty has on…"
"""Julie is reading a 120-page bo…","""Maila read 12 x 2 = <<12*2=24>…"
"""James writes a 3-page letter t…","""He writes each friend 3*2=<<3*…"


First we confirm all datasets are 1000 unique samples, in the above cells we realize that the first column is always the question so we can use that information to confirm the number of unique samples.

In [26]:
for scenario in scenarios:
    for dataset in datasets:
       if  data[scenario][dataset][:,0].n_unique() != 1000:
           print(f"{scenario} {dataset} has {data[scenario][dataset][:,0].n_unique()} samples")

Looks good, no lines were printed so all of them have 1000 samples.

Now that we've confirmed there are no duplicates we need to confirm that all the outputs have the correct format, as in the -> for the steps and the #### for the final answer, and also need to measure the accuracy of the outputs.

Lucky for us, the answer column in our training dataset is always called output

In [31]:
columns = []
for scenario in ['easy', 'medium', 'hard']:
    for dataset in [ 'cod', 'cot']:
        columns.extend(data[scenario][dataset].columns)
set(columns)

{'id', 'input', 'instruction', 'output'}

In [34]:
results = []
for scenario in scenarios:
    for ds_name in ['cot', 'cod']:
        current_df = data[scenario][ds_name]
        counts = current_df.select(
            n_answers = pl.col('output').str.contains('####', literal=True).sum(),
            n_steps   = pl.col('output').str.contains('->', literal=True).sum()
        ).row(0) # Returns a tuple like (10, 5)
        
        results.append({
            "Scenario": scenario,
            "Dataset": ds_name,
            "Count (####)": counts[0],
            "Count (->)": counts[1]
        })

summary_df = pl.DataFrame(results)
print(summary_df)

shape: (6, 4)
┌──────────┬─────────┬──────────────┬────────────┐
│ Scenario ┆ Dataset ┆ Count (####) ┆ Count (->) │
│ ---      ┆ ---     ┆ ---          ┆ ---        │
│ str      ┆ str     ┆ i64          ┆ i64        │
╞══════════╪═════════╪══════════════╪════════════╡
│ easy     ┆ cot     ┆ 1000         ┆ 997        │
│ easy     ┆ cod     ┆ 1000         ┆ 1000       │
│ medium   ┆ cot     ┆ 1000         ┆ 900        │
│ medium   ┆ cod     ┆ 1000         ┆ 1000       │
│ hard     ┆ cot     ┆ 1000         ┆ 828        │
│ hard     ┆ cod     ┆ 1000         ┆ 996        │
└──────────┴─────────┴──────────────┴────────────┘
