# End-to-End Dataset Preparation for LLM Fine-Tuning

Welcome to this comprehensive guide on preparing a high-quality, mixed-domain dataset for fine-tuning a Large Language Model (LLM). The goal of fine-tuning is to adapt a pre-trained model to specific tasks or domains, enhancing its performance on them. The quality and structure of the fine-tuning dataset are paramount to achieving good results.

In this notebook, we will perform a complete, end-to-end data preparation pipeline. We will source data from multiple distinct domains to create a robust training set and a challenging validation set. Specifically, we will:

1.  **Source Datasets**: We will use three different datasets from the Hugging Face Hub:
    *   `zwhe99/DeepMath-103K`: A large dataset of mathematical problems, designed to enhance a model's logical and mathematical reasoning capabilities.
    *   `RUC-NLPIR/FlashRAG_datasets (Natural Questions)`: A widely-used open-domain question-answering dataset derived from real Google search queries. This adds general knowledge and conversational ability to our mix.
    *   `Maxwell-Jia/AIME_2024`: A set of problems from the American Invitational Mathematics Examination (AIME). These are highly challenging, competition-level math problems, making them an excellent choice for a validation set to test the model's advanced reasoning skills on unseen, difficult tasks.

2.  **Process and Standardize**: We will process each dataset, cleaning the data and transforming it into a single, unified format. This standardization is crucial for the model to learn from different data sources seamlessly.

3.  **Combine and Shuffle**: We will combine the processed math and question-answering datasets into a single training file. We will then shuffle it to ensure the model sees a random mix of data types during training, which prevents it from learning any spurious order-based patterns.

4.  **Prepare a Validation Set**: We will process the AIME dataset separately to serve as our validation set. This allows us to measure the model's performance on a task that is related but significantly more difficult than the training data.

5.  **Save for Use**: Finally, we will save our curated training and validation sets in the efficient Parquet file format, ready to be used in an LLM fine-tuning pipeline.

Let's begin!

## Section 1: Setup and Environment Preparation

Before we can begin processing data, we need to set up our environment. This involves installing the necessary Python libraries, importing them into our script, and creating the directory structure where we will save our final datasets.

### 1.1 Installing Dependencies

We will use several key libraries. The following command ensures they are installed. If you are running this in a new environment, uncomment and run the cell below.

*   `pandas` & `numpy`: Essential for data manipulation and numerical operations.
*   `datasets`: The Hugging Face library for easily loading and manipulating large datasets.
*   `ipywidgets` & `jupyter`: Required for rendering progress bars and interactive elements within the Jupyter environment.
*   `tqdm`: A library for creating smart, simple progress bars for our loops.

In [1]:
# pip install pandas numpy datasets ipywidgets jupyter tqdm

### 1.2 Importing Libraries

Now, let's import all the modules we'll need for this notebook. We also add a line to filter out warnings to keep our output clean.

In [2]:
# Standard library imports for interacting with the operating system and handling JSON data.
import os
import json

# Core data science libraries for data manipulation and numerical computation.
import pandas as pd
import numpy as np

# Hugging Face library for dataset loading and processing.
from datasets import load_dataset, concatenate_datasets, Dataset

# Utility for displaying progress bars, making long-running operations more informative.
from tqdm import tqdm

# This section is for suppressing ignorable warnings to keep the output tidy.
import warnings
warnings.filterwarnings("ignore")

### 1.3 Creating Output Directories

A well-organized project structure is crucial. We will create dedicated directories for our training and validation data. This practice helps keep our project clean and makes it easy to locate our final artifacts.

In [3]:
# Define the path for the training data output directory.
train_output_dir = "./data/train"
# Define the path for the validation data output directory.
val_output_dir = "./data/val"

# Use os.makedirs to create the directories. 
# The `exist_ok=True` argument prevents an error if the directories already exist.
os.makedirs(train_output_dir, exist_ok=True)
os.makedirs(val_output_dir, exist_ok=True)

print(f"Directories created: {train_output_dir}, {val_output_dir}")

Directories created: ./data/train, ./data/val


## Section 2: Preparing the Training Dataset

Our goal in this section is to create a single, high-quality training dataset by combining data from two different sources: mathematical problems and open-domain questions. This diversity will help the model become more versatile.

### 2.1 The DeepMath-103K Dataset (Mathematical Reasoning)

First, we'll process the `DeepMath-103K` dataset. This dataset contains over 100,000 math problems, complete with solutions. It's an excellent resource for teaching an LLM structured, step-by-step reasoning.

#### Loading the DeepMath Dataset

In [4]:
print("\n=== Loading DeepMath-103K ===")

# Use the `load_dataset` function from the `datasets` library.
# We specify the dataset name on the Hugging Face Hub: "zwhe99/DeepMath-103K".
# We also specify that we only want the "train" split of this dataset.
math_dataset = load_dataset(
    "zwhe99/DeepMath-103K",
    split="train"
)


=== Loading DeepMath-103K ===


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.05G [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

#### Exploring the DeepMath Dataset

Before processing, it's always a good idea to inspect the dataset. We'll check its columns, the total number of samples, and look at one example record to understand its structure.

In [5]:
# The `.column_names` attribute gives us a list of all columns in the dataset.
print("Columns:", math_dataset.column_names)
# The `len()` function tells us the total number of records (rows) in the dataset.
print("Total samples:", len(math_dataset))

Columns: ['question', 'final_answer', 'difficulty', 'topic', 'r1_solution_1', 'r1_solution_2', 'r1_solution_3']
Total samples: 103022


In [6]:
# Accessing an item by index, like a list, gives us a single record.
sample = math_dataset[0]

# The solution fields ('r1_solution_*') can be very long. 
# For a clean printout, we'll truncate them.
truncated_sample = sample.copy()
for key in ['r1_solution_1', 'r1_solution_2', 'r1_solution_3']:
    truncated_sample[key] = sample[key][:400]

# Use `json.dumps` with indentation for a pretty, readable print of the sample record.
print(json.dumps(truncated_sample, indent=2))

{
  "question": "Evaluate the limit: \\[ \\lim_{x \\to \\infty} \\sqrt{x} \\left( \\sqrt[3]{x+1} - \\sqrt[3]{x-1} \\right) \\]",
  "final_answer": "0",
  "difficulty": 4.5,
  "topic": "Mathematics -> Precalculus -> Limits",
  "r1_solution_1": "Okay, so I have this limit to evaluate: the limit as x approaches infinity of the square root of x times the difference between the cube root of (x plus 1) and the cube root of (x minus 1). Hmm, let me write that down again to make sure I have it right.\n\n\\[\n\\lim_{x \\to \\infty} \\sqrt{x} \\left( \\sqrt[3]{x+1} - \\sqrt[3]{x-1} \\right)\n\\]\n\nAlright, so it's the product of sqrt(x) and the difference of tw",
  "r1_solution_2": "Okay, so I need to evaluate the limit as x approaches infinity of sqrt(x) times (the cube root of (x+1) minus the cube root of (x-1)). Let me write that down again to make sure I got it right:\n\n\\[\n\\lim_{x \\to \\infty} \\sqrt{x} \\left( \\sqrt[3]{x+1} - \\sqrt[3]{x-1} \\right)\n\\]\n\nHmm. So the expression is 

#### Standardizing the DeepMath Dataset

Now we'll iterate through each record and convert it to our desired standard format. This format is generic and will allow us to combine it with other datasets later. 

Our target schema will be:
*   `id`: A unique identifier for each sample.
*   `question`: The problem or query text.
*   `chain`: A placeholder for chain-of-thought or reasoning steps (we'll leave it empty for now).
*   `result`: The final answer.
*   `source`: A string indicating the original dataset.
*   `extra_info`: A dictionary to hold any other useful metadata from the original record.

In [7]:
print("\n=== Processing MathHard ===")

# Initialize an empty list to store our processed records.
math_rows = []

# We iterate through the dataset using tqdm to get a nice progress bar.
# `enumerate` gives us both the index (`idx`) and the item for each record.
for idx, item in enumerate(tqdm(math_dataset, desc="Processing MathHard")):
    # Some datasets might use different keys for the same concept (e.g., 'question' vs 'Problem').
    # This logic handles such inconsistencies gracefully.
    if "question" in item:
        question = item["question"]
    elif "Problem" in item:
        question = item["Problem"]
    else:
        # If neither key is found, raise an error to stop execution, as this is unexpected.
        raise KeyError("Missing question field")

    # Similarly, handle potential inconsistencies for the answer field.
    if "final_answer" in item:
        answer = item["final_answer"]
    elif "Answer" in item:
        answer = item["Answer"]
    else:
        raise KeyError("Missing answer field")

    # Append a new dictionary to our list, structured according to our standard format.
    math_rows.append({
        "id": idx,  # Use the loop index as a temporary ID.
        "question": question,
        "chain": "",  # Placeholder for reasoning steps.
        "result": str(answer), # Ensure the answer is always a string.
        "source": "mathhard", # Tag the data source.
        "extra_info": { # Store original metadata.
            "ground_truth": str(answer),
            "idx": idx
        }
    })


=== Processing MathHard ===


Processing MathHard: 100%|██████████| 103022/103022 [00:03<00:00, 33261.05it/s]


In [8]:
# Verify that the number of processed rows matches the original dataset size.
print("Processed math samples:", len(math_rows))
print("\nProcessed sample:")
# Print the first processed sample to confirm it matches our target format.
print(json.dumps(math_rows[0], indent=2))

Processed math samples: 103022

Processed sample:
{
  "id": 0,
  "question": "Evaluate the limit: \\[ \\lim_{x \\to \\infty} \\sqrt{x} \\left( \\sqrt[3]{x+1} - \\sqrt[3]{x-1} \\right) \\]",
  "chain": "",
  "result": "0",
  "source": "mathhard",
  "extra_info": {
    "ground_truth": "0",
    "idx": 0
  }
}


#### Converting to a `Dataset` Object

Our processed data is currently a Python list of dictionaries. For better performance and compatibility with the Hugging Face ecosystem (like the `Trainer` API), we'll convert it into a `datasets.Dataset` object.

In [9]:
# First, convert the list of dictionaries into a pandas DataFrame.
# Then, use `Dataset.from_pandas` to create the Hugging Face Dataset object.
# `preserve_index=False` tells the function not to add the DataFrame's index as a new column.
ds_math = Dataset.from_pandas(
    pd.DataFrame(math_rows),
    preserve_index=False
)

### 2.2 The Natural Questions (NQ) Dataset (Open-Domain QA)

Now we'll repeat the process for the Natural Questions dataset. This dataset consists of real user questions posed to Google Search and their corresponding answers found on Wikipedia. Adding this data helps the model with general knowledge and fact-based queries.

#### Loading the NQ Dataset

In [10]:
print("\n=== Loading FlashRAG NQ ===")

# `load_dataset` can take multiple arguments.
# The first is the dataset group, "RUC-NLPIR/FlashRAG_datasets".
# The second is the specific dataset name within that group, "nq".
nq_dataset = load_dataset(
    "RUC-NLPIR/FlashRAG_datasets",
    "nq",
    split="train"
)


=== Loading FlashRAG NQ ===


Downloading builder script:   0%|          | 0.00/6.10k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.96M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

#### Exploring the NQ Dataset

In [11]:
# Inspect the NQ dataset's structure.
print("Columns:", nq_dataset.column_names)
print("Total samples:", len(nq_dataset))

Columns: ['id', 'question', 'golden_answers']
Total samples: 79168


In [12]:
# Look at the first sample to understand the data format.
print("\nRaw NQ sample:")
print(json.dumps(nq_dataset[0], indent=2))


Raw NQ sample:
{
  "id": "train_0",
  "question": "total number of death row inmates in the us",
  "golden_answers": [
    "2,718"
  ]
}


#### Standardizing the NQ Dataset

The processing for NQ is slightly more complex. We need to perform some cleaning:

1.  **Format Question**: Ensure every question ends with a question mark for consistency.
2.  **Handle Answer Types**: The `golden_answers` field can contain data in various formats (lists, numpy arrays, strings, etc.). Our code needs to robustly handle all these cases, extract the answer(s), and convert them to a single string.
3.  **Join Multiple Answers**: Some questions might have multiple valid answers. We will join them together into a single string, separated by a semicolon.

In [13]:
print("\n=== Processing NQ ===")

# Initialize an empty list to store processed NQ records.
nq_rows = []

# Iterate through the NQ dataset with a progress bar.
for idx, item in enumerate(tqdm(nq_dataset, desc="Processing NQ")):
    # Get the question, remove leading/trailing whitespace.
    question = item.get("question", "").strip()
    # Ensure the question ends with a '?' for consistency.
    if question and not question.endswith("?"):
        question += "?"

    # Get the answers, defaulting to an empty list if not present.
    golden_answers = item.get("golden_answers", [])
    cleaned_answers = [] # This list will hold valid, string-formatted answers.

    # The following block robustly handles various data types for the answers.
    if isinstance(golden_answers, np.ndarray):
        for x in golden_answers.flatten(): # Flatten in case of multi-dimensional array.
            if x is not None and pd.notna(x):
                cleaned_answers.append(str(x))
    elif isinstance(golden_answers, (list, tuple)):
        for x in golden_answers:
            if x is not None and pd.notna(x):
                cleaned_answers.append(str(x))
    elif isinstance(golden_answers, str):
        if golden_answers.strip():
            cleaned_answers.append(golden_answers.strip())
    elif isinstance(golden_answers, (int, float, np.generic)):
        if not pd.isna(golden_answers):
            cleaned_answers.append(str(golden_answers))
    else: # Catch-all for any other types.
        s = str(golden_answers).strip()
        if s and s != "nan": # Avoid adding 'nan' as an answer.
            cleaned_answers.append(s)

    # Join all cleaned answers into a single string, separated by "; ".
    final_result = "; ".join(cleaned_answers)

    # Append the record in our standard format.
    nq_rows.append({
        "id": idx,  # Temporary ID.
        "question": question,
        "chain": "",
        "result": final_result,
        "source": "nq", # Tag the source as Natural Questions.
        "extra_info": {
            "ground_truth": final_result,
            "idx": idx
        }
    })


=== Processing NQ ===


Processing NQ: 100%|██████████| 79168/79168 [00:01<00:00, 48264.44it/s]


In [14]:
# Verify the number of processed samples and check the first record.
print("Processed NQ samples:", len(nq_rows))
print("\nProcessed NQ sample:")
print(json.dumps(nq_rows[0], indent=2))

Processed NQ samples: 79168

Processed NQ sample:
{
  "id": 0,
  "question": "total number of death row inmates in the us?",
  "chain": "",
  "result": "2,718",
  "source": "nq",
  "extra_info": {
    "ground_truth": "2,718",
    "idx": 0
  }
}


#### Converting to a `Dataset` Object

In [15]:
# Convert the processed NQ data into a Hugging Face Dataset object.
ds_nq = Dataset.from_pandas(
    pd.DataFrame(nq_rows),
    preserve_index=False
)

### 2.3 Combining and Finalizing the Training Set

With both datasets processed and standardized, the final step is to merge them into a single training set. We will then shuffle this combined dataset and assign new, unique IDs to each record.

#### Concatenating Datasets

In [16]:
print("\n=== Combining datasets ===")
# `concatenate_datasets` takes a list of Dataset objects and merges them row-wise.
combined = concatenate_datasets([ds_nq, ds_math])
print("Combined size:", len(combined))


=== Combining datasets ===
Combined size: 182190


#### Shuffling and Re-indexing

**Shuffling** is a critical step. If we don't shuffle, the model will first see all 79,168 NQ samples, and then all 103,022 math samples. This can bias the learning process. By shuffling, we ensure that each training batch contains a random mix of data types, leading to more robust learning.

**Re-indexing** is necessary because after combining and shuffling, the original `id`s are no longer unique or sequential. We apply a mapping function to assign a new, clean, sequential ID from 0 to N-1.

In [17]:
print("\n=== Shuffling ===")
# The `.shuffle()` method randomizes the order of the rows in the dataset.
# Providing a `seed` ensures that the shuffle is reproducible. Anyone running this code
# with the same seed will get the exact same shuffled order.
combined = combined.shuffle(seed=42)

print("\n=== Re-indexing IDs ===")
# The `.map()` method applies a function to each element of the dataset.
# Here, we use a lambda function that ignores the sample (`_`) and uses the index (`idx`).
# `with_indices=True` provides the index of each row to our function.
# This effectively replaces the old 'id' column with a new one from 0 to len-1.
combined = combined.map(
    lambda _, idx: {"id": idx},
    with_indices=True
)


=== Shuffling ===

=== Re-indexing IDs ===


Shuffle:   0%|          | 0/182190 [00:00<?, ? examples/s]

Map:   0%|          | 0/182190 [00:00<?, ? examples/s]

In [18]:
# Let's inspect the very first sample of our final combined and shuffled dataset.
# Note the new `id` is 0, and the `source` is 'nq', showing the shuffle worked.
# The `idx` inside `extra_info` still refers to its original index in the NQ dataset.
print("\nFinal combined sample:")
print(json.dumps(combined[0], indent=2))


Final combined sample:
{
  "id": 0,
  "question": "he classical string quartet is a musical composition for?",
  "chain": "",
  "result": "a viola player; a cellist; two violin players",
  "source": "nq",
  "extra_info": {
    "ground_truth": "a viola player; a cellist; two violin players",
    "idx": 36700
  }
}


#### Saving the Training Dataset

Finally, we save our completed training dataset to a file. We use the Parquet format, which is a highly efficient, column-oriented data format ideal for large datasets. It's widely supported and generally faster to read than formats like CSV or JSON.

In [19]:
# Construct the full output file path using the directory we defined earlier.
output_path = os.path.join(train_output_dir, "combined_train.parquet")
# Use the `.to_parquet()` method to save the dataset.
combined.to_parquet(output_path)

print(f"\n✅ Train dataset saved to {output_path}")

Creating parquet from Arrow format:   0%|          | 0/19 [00:00<?, ?ba/s]


✅ Train dataset saved to ./data/train\combined_train.parquet


## Section 3: Preparing the Validation Dataset

A good validation set should be representative of the tasks we care about, but it should not overlap with the training data. For this purpose, we will use a small, high-quality, and challenging dataset: problems from the AIME 2024 competition. This will allow us to rigorously test our model's advanced reasoning abilities.

### 3.1 The AIME 2024 Dataset (Advanced Math Competition)

#### Loading and Exploring the AIME Dataset

In [20]:
print("\n=== Loading AIME 2024 ===")

# Load the AIME dataset from the Hugging Face Hub.
aime_dataset = load_dataset(
    "Maxwell-Jia/AIME_2024",
    split="train"
)


=== Loading AIME 2024 ===


Downloading builder script:   0%|          | 0.00/4.44k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/37.2k [00:00<?, ?B/s]

Generating train split: 30 examples [00:00, 192.65 examples/s]

In [21]:
# Check the columns and size of this small but challenging dataset.
print("Columns:", aime_dataset.column_names)
print("Total samples:", len(aime_dataset))

Columns: ['ID', 'Problem', 'Solution', 'Answer']
Total samples: 30


In [22]:
# Examine a sample record to see the structure.
print("\nRaw sample:")
print(json.dumps(aime_dataset[0], indent=2))


Raw sample:
{
  "ID": "2024-II-4",
  "Problem": "Let $x,y$ and $z$ be positive real numbers that satisfy the following system of equations: \n\\[\\log_2\\left({x \\over yz}\\right) = {1 \\over 2}\\]\n\\[\\log_2\\left({y \\over xz}\\right) = {1 \\over 3}\\]\n\\[\\log_2\\left({z \\over xy}\\right) = {1 \\over 4}\\]\nThen the value of $\\left|\\log_2(x^4y^3z^2)\\right|$ is $\\tfrac{m}{n}$ where $m$ and $n$ are relatively prime positive integers. Find $m+n$.",
  "Solution": "Denote $\\log_2(x) = a$, $\\log_2(y) = b$, and $\\log_2(z) = c$.\n\nThen, we have:\n$a-b-c = \\frac{1}{2}$,\n$-a+b-c = \\frac{1}{3}$,\n$-a-b+c = \\frac{1}{4}$.\n\nNow, we can solve to get $a = \\frac{-7}{24}, b = \\frac{-9}{24}, c = \\frac{-5}{12}$.\nPlugging these values in, we obtain $|4a + 3b + 2c|  = \\frac{25}{8} \\implies \\boxed{033}$.",
  "Answer": 33
}


#### Processing the AIME Dataset for Evaluation

We will process this dataset into our standard format, but with one important modification. For evaluation, it's very helpful if the model outputs its final answer in a predictable, easy-to-parse format. 

To achieve this, we will add a specific instruction to the end of each question prompt, telling the model to enclose its final answer in `<answer>` and `</answer>` tags. This is a common prompt engineering technique for improving the reliability of automated evaluation.

In [23]:
print("\n=== Processing AIME 2024 ===")

# This is the instruction string we will append to each question.
OUTPUT_FORMAT = (
    "When ready, output the final answer enclosed in <answer> and </answer> tags. "
    "Do not generate any content after the </answer> tag."
)

# Initialize an empty list for the processed validation data.
aime_rows = []

# Iterate through the AIME dataset.
for idx, item in enumerate(tqdm(aime_dataset, desc="Processing AIME 2024")):
    # Extract the problem and answer, cleaning up whitespace.
    problem = item.get("Problem", "").strip()
    answer = item.get("Answer", "")

    # Construct the full question by combining the original problem with our instruction string.
    full_question = (
        f"{problem}\n\n{OUTPUT_FORMAT}"
        if problem else OUTPUT_FORMAT
    )

    # Append the record in our standard format.
    aime_rows.append({
        "id": idx,
        "question": full_question,
        "chain": "", # Left empty as this is for evaluation input.
        "result": str(answer).strip(),
        "source": "aime2024",
        "extra_info": {
            "ground_truth": str(answer).strip(),
            "idx": idx,
            "original_problem": problem # Store the original problem text for reference.
        }
    })


=== Processing AIME 2024 ===


Processing AIME 2024: 100%|██████████| 30/30 [00:00<00:00, 31924.96it/s]


In [24]:
# Verify the processed count and check the first sample to see the appended instruction.
print("Processed AIME samples:", len(aime_rows))
print("\nProcessed sample:")
print(json.dumps(aime_rows[0], indent=2))

# Convert the list of dictionaries into a Dataset object.
ds_aime = Dataset.from_pandas(
    pd.DataFrame(aime_rows),
    preserve_index=False
)

Processed AIME samples: 30

Processed sample:
{
  "id": 0,
  "question": "Let $x,y$ and $z$ be positive real numbers that satisfy the following system of equations: \n\\[\\log_2\\left({x \\over yz}\\right) = {1 \\over 2}\\]\n\\[\\log_2\\left({y \\over xz}\\right) = {1 \\over 3}\\]\n\\[\\log_2\\left({z \\over xy}\\right) = {1 \\over 4}\\]\nThen the value of $\\left|\\log_2(x^4y^3z^2)\\right|$ is $\\tfrac{m}{n}$ where $m$ and $n$ are relatively prime positive integers. Find $m+n$.\n\nWhen ready, output the final answer enclosed in <answer> and </answer> tags. Do not generate any content after the </answer> tag.",
  "chain": "",
  "result": "33",
  "source": "aime2024",
  "extra_info": {
    "ground_truth": "33",
    "idx": 0,
    "original_problem": "Let $x,y$ and $z$ be positive real numbers that satisfy the following system of equations: \n\\[\\log_2\\left({x \\over yz}\\right) = {1 \\over 2}\\]\n\\[\\log_2\\left({y \\over xz}\\right) = {1 \\over 3}\\]\n\\[\\log_2\\left({z \\over xy}\\

#### Finalizing the Validation Set

Just as we did with the training data, we will shuffle and re-index the validation set. While shuffling is less critical for a validation set (since it's not used for gradient updates), it's good practice for consistency and to avoid any potential ordering bias during evaluation.

In [25]:
print("\n=== Shuffling validation data ===")
# Shuffle the validation set with the same seed for reproducibility.
ds_aime = ds_aime.shuffle(seed=42)

print("Re-indexing IDs...")
# Re-assign sequential IDs to the shuffled validation data.
ds_aime = ds_aime.map(
    lambda _, idx: {"id": idx},
    with_indices=True
)


=== Shuffling validation data ===
Re-indexing IDs...


Shuffle:   0%|          | 0/30 [00:00<?, ? examples/s]

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

#### Saving the Validation Dataset

In [26]:
# Construct the full path for the validation file.
val_path = os.path.join(val_output_dir, "aime24.parquet")
# Save the final validation dataset to a Parquet file.
ds_aime.to_parquet(val_path)

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

In [27]:
print(f"\n✅ Validation dataset saved to {val_path}")
# Print a sample prompt from the final validation set to confirm our formatting.
print("\nValidation prompt sample:\n")
print(ds_aime[0]["question"])


✅ Validation dataset saved to ./data/val\aime24.parquet

Validation prompt sample:

Let $\omega \neq 1$ be a 13th root of unity. Find the remainder when 
\[ \prod_{k=0}^{12}(2 - 2\omega^k + \omega^{2k}) \] is divided by 1000.

When ready, output the final answer enclosed in <answer> and </answer> tags. Do not generate any content after the </answer> tag.


## Section 4: Final Checks and Conclusion

We have successfully created our training and validation datasets. As a final step, let's perform some sanity checks to verify the size and sample counts of our output files.

### 4.1 Checking File Sizes

We'll check the on-disk size of the generated Parquet files. This is useful for understanding storage requirements and for a high-level confirmation that the files were written correctly.

In [28]:
# Get the file path for the training parquet file.
train_path = os.path.join(train_output_dir, "combined_train.parquet")
# Get the file path for the validation parquet file.
val_path = os.path.join(val_output_dir, "aime24.parquet")

# Use os.path.getsize to get the file size in bytes.
train_size = os.path.getsize(train_path)
val_size = os.path.getsize(val_path)

# Convert bytes to megabytes (MB) for easier reading and print the results.
print(f"\nTrain parquet size: {train_size / (1024 * 1024):.2f} MB")
print(f"Validation parquet size: {val_size / (1024 * 1024):.2f} MB")


Train parquet size: 18.92 MB
Validation parquet size: 0.02 MB


### 4.2 Checking Sample Counts

Let's verify the total number of records in each of our final datasets. This should match the counts we saw during processing.

In [29]:
# The length of our in-memory Dataset objects gives the total number of samples.
train_count = len(combined)
print(f"\nTotal train samples: {train_count}")
val_count = len(ds_aime)
print(f"Total validation samples: {val_count}")


Total train samples: 182190
Total validation samples: 30


### 4.3 Conclusion

In this notebook, we have successfully executed a full data preparation pipeline. We started with raw datasets from different domains, processed and standardized them, combined them into a robust training set, and created a challenging validation set.

Our final artifacts are:
*   `./data/train/combined_train.parquet`: A shuffled training set with 182,190 samples, mixing mathematical reasoning with open-domain question answering.
*   `./data/val/aime24.parquet`: A high-difficulty validation set with 30 competition-level math problems, formatted for easy evaluation.

These datasets are now ready to be used as inputs for fine-tuning a Large Language Model to enhance its reasoning and problem-solving capabilities.