

 <h1>
Welcome to the Math Question Answer Verification Competition! üöÄ

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Your model should output True if the solution is correct, and False otherwise.

This notebook is a starter guide designed to get you up and running quickly. We'll walk through a simplified training process using a small subset of the data (5,000 examples) and lightweight parameters. The main goal here is to understand the complete workflow, from loading data to generating a submission file, not to achieve a top score.

Good luck, and have fun! üéâ

## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [None]:
!pip uninstall -y transformers unsloth unsloth_zoo
!pip install "transformers<4.43.0"
!pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"
!pip install unsloth_zoo

Found existing installation: transformers 4.57.1
Uninstalling transformers-4.57.1:
  Successfully uninstalled transformers-4.57.1
[0mFound existing installation: unsloth 2025.10.12
Uninstalling unsloth-2025.10.12:
  Successfully uninstalled unsloth-2025.10.12
[0mCollecting transformers<4.43.0
  Using cached transformers-4.42.4-py3-none-any.whl.metadata (43 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers<4.43.0)
  Using cached tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Using cached transformers-4.42.4-py3-none-any.whl (9.3 MB)
Using cached tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[0mInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.22.1
    Uninstalling tokenizers-0.22.1:
      Successfully uninstalled tokenizers-0.22.1
Successfully installed tokenizers-0.19.1 transformers-4.42.4
Collecting uns

## **Step 2: Load the Model and Tokenizer**

Next, we'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 1024  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.




ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.12: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. For this starter notebook, we'll create a much smaller, more manageable version to speed things up: **5,000 samples for training** and **500 for validation**.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [None]:
from datasets import load_dataset

# Load the full training dataset
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Shuffle the dataset for randomness and create our smaller splits
shuffled_dataset = full_dataset.shuffle(seed=42)
train_dataset = shuffled_dataset.select(range(10000))      # Use the first 5,000 for training
validation_dataset = shuffled_dataset.select(range(10000, 11000)) # Use the next 500 for validation

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

data/train-00001-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/3.65M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [None]:
# The instructional prompt template for training
training_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
{}"""

# We must add an End Of Sequence (EOS) token to tell the model when a completion is finished.
EOS_TOKEN = tokenizer.eos_token

import re

def clean_text(txt: str):
    if not isinstance(txt, str):
        txt = str(txt)
    txt = re.sub(r"\s+", " ", txt)
    return txt.strip()

def smart_truncate(text, tokenizer, max_tokens):
    tokens = tokenizer(text, truncation=False)["input_ids"]

    if len(tokens) <= max_tokens:
      return text

    keep_start = int(max_tokens * 0.3)
    keep_end = int(max_tokens * 0.2)

    truncated_tokens = tokens[:keep_start] + tokens[-keep_end:]
    truncated_text = tokenizer.decode(truncated_tokens, skip_special_tokens=True)
    return truncated_text


def formatting_prompts_func(examples):
    questions = examples["question"]
    solutions = examples["solution"]
    outputs = examples["is_correct"]
    texts = []

    for question, solution, output in zip(questions, solutions, outputs):
        output_str = "True" if str(output).lower() in ["true", "1"] else "False"

        question = clean_text(question)
        solution = clean_text(solution)

        text = training_prompt.format(question, solution, output_str) + EOS_TOKEN

        tokens = tokenizer(text, truncation=False)["input_ids"]

        if len(tokens) > max_seq_length:

            meta = training_prompt.format("", "", output_str) + EOS_TOKEN
            meta_token_len = len(tokenizer(meta)["input_ids"])
            available = max_seq_length - meta_token_len

            question = smart_truncate(question, tokenizer, int(available * 0.35))
            solution = smart_truncate(solution, tokenizer, int(available * 0.65))

            text = training_prompt.format(question, solution, output_str) + EOS_TOKEN

        texts.append(text)

    return {"text": texts}


# Apply the formatting function to our training dataset
formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
formatted_val_dataset = validation_dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## **Step 4: Configure LoRA and Set Up the Trainer**

### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). üéõÔ∏è

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge. This is much faster and requires significantly less memory. We'll use a small **rank** (`r = 8`) to keep the training process light and quick for this starter notebook.


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Bigger r -> more trainable params, more capacity, better potential quality, more VRAM/comput, smaller r -> vice versa
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", # q/k/v/o_proj -> attention projections(query/key/value/output)
                      "gate_proj", "up_proj", "down_proj"], # gate/up/down -> MLP blocks in Llama, convering both attention and MLA usually outperforms attention-only
    lora_alpha = 48, # A common practice is to set alpha = 2 * r   1st version: 8 * 2 = 16
    lora_dropout = 0.15, # starting at 0 for quick baseline, scale up later  1st version: 0.01
    bias = "none",  # no bias term for LoRA
    use_gradient_checkpointing = "unsloth", # fits longer sequence batch, but will slower wall-clock
    random_state = 42,
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.15.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.10.12 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.



### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

We will train for just **one epoch** (a single pass over our 5,000-sample dataset) to keep this demonstration fast.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, EarlyStoppingCallback

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,  # Trainer will read the formatted prompt strings and do supervised fine-tune to imitate output
    eval_dataset = formatted_val_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing = False,   # Pack multiple short prompts into one sequence
    args = TrainingArguments(
        # wall-clock control
        # max_steps = 500,   # This overrides epoches, will only run 60 optimizer steps total, if  batch size = 8, will only update 60 x 8 = 480 examples 480/5000 ~10.41% of an epoch for 5000 example train set
        num_train_epochs = 5,  # will be stopped by early stop
        max_steps = -1,

        # batch shape
        per_device_train_batch_size = 4,    # Effective batch size = 2 x 4 = 8, examples per optimizer step
        per_device_eval_batch_size = 4,
        gradient_accumulation_steps = 4, # Effective batch : 4 * 8  = 32
        # warmup_steps = 10,   # will be ignore if set warnup_ratio

        # Optimizer & schedule
        learning_rate = 3e-5,               # original 2e-4, 1st version: 1e-4, if raise r, go lower
        warmup_ratio = 0.08,
        optim = "paged_adamw_8bit",
        weight_decay = 0.02, # stronger normalization
        lr_scheduler_type = "cosine",
        max_grad_norm = 0.3,

        # precision/speed
        fp16 = False,
        bf16 = True,
        gradient_checkpointing = True,
        gradient_checkpointing_kwargs = {"use_reentrant": False},
        group_by_length = False,
        dataloader_num_workers = 2,
        dataloader_pin_memory = True,

        # logging/checkpoints
        eval_strategy = "steps",
        save_strategy = "steps",
        eval_steps = 400,
        save_steps = 400,
        load_best_model_at_end = True,
        metric_for_best_model = "loss",
        greater_is_better = False,


        logging_steps = 25,
        save_total_limit = 2,
        save_safetensors = True,

        # misc
        seed = 42,
        output_dir = "outputs",
        report_to = "none",
    ),
    callbacks=[EarlyStoppingCallback(
        early_stopping_patience=3,
        early_stopping_threshold=1e-4
    )]
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/10000 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/1000 [00:00<?, ? examples/s]

## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process. Based on our settings, this will run for one full epoch over our 5,000 examples.

Grab a coffee, as this will take a few minutes\! ‚òï


In [None]:
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 5 | Total steps = 3,125
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)


Step,Training Loss,Validation Loss
400,0.6906,0.697057
800,0.6392,0.674833
1200,0.6445,0.656743
1600,0.5887,0.64791
2000,0.5254,0.647152
2400,0.5075,0.641147


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Step,Training Loss,Validation Loss
400,0.6906,0.697057
800,0.6392,0.674833
1200,0.6445,0.656743
1600,0.5887,0.64791
2000,0.5254,0.647152
2400,0.5075,0.641147
2800,0.4711,0.645191


TrainOutput(global_step=3125, training_loss=0.5997995327758789, metrics={'train_runtime': 9382.8864, 'train_samples_per_second': 5.329, 'train_steps_per_second': 0.333, 'total_flos': 9.590645534229135e+17, 'train_loss': 0.5997995327758789, 'epoch': 5.0})


## **Step 6: Inference and Evaluation**

Now that our model is trained, we need to test it on our validation set. We'll use a slightly different prompt for inference‚Äîone where we leave the `Output:` section blank for the model to complete.

Let's test it on a single example from our validation set to see what it predicts.

In [None]:
from unsloth import FastLanguageModel
import torch

FastLanguageModel.for_inference(model)

inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""

def batch_generate(model, tokenizer, dataset, batch_size=64, show_samples=10):
    hits = 0
    data = list(dataset)                      # ‚Üê convert to list of dicts
    total = len(data)

    for i in range(0, total, batch_size):
        batch = data[i:i+batch_size]          # now batch is list of dicts

        prompts = [
            inference_prompt.format(ex["question"], ex["solution"])
            for ex in batch
        ]

        inputs = tokenizer(
            prompts, return_tensors="pt", padding=True, truncation=True, max_length=1024
        ).to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=4,
                temperature=0.0,
                do_sample=False,
                use_cache=True
            )

        responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)

        for ex, resp in zip(batch, responses):
            raw_resp = resp.strip()
            lower_clean = raw_resp.lower()

            if lower_clean.startswith("true"):
                pred = True
            elif lower_clean.startswith("false"):
                pred = False
            else:
                pred = False

            if pred == ex["is_correct"]:
                hits += 1


    return hits / total

acc = batch_generate(model, tokenizer, validation_dataset, batch_size=32, show_samples=10)
print(f"\nValidation accuracy: {acc*100:.2f}%")


Validation accuracy: 59.50%


## **Step 7: Generate Submission File**

This is the final step\! We will now run our fine-tuned model on the official `test` dataset.

We will loop through each example in the test set, generate a prediction, and format the results into a CSV file with two columns: `ID` and `is_correct`, as required by the competition.


In [None]:
import pandas as pd
from tqdm import tqdm

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

def truncate_prompt(question, solution, max_length=1024):
    prompt = inference_prompt.format(question, str(solution))

    tokens = tokenizer(prompt, truncation=False, return_tensors="pt")
    input_ids = tokens['input_ids'][0]

    if len(input_ids) <= max_length:
        return prompt

    else:
        template_text = inference_prompt.format("", "")
        template_tokens = tokenizer(template_text, truncation=False)['input_ids']


        available_tokens = max_length - len(template_tokens) - 18


        question_tokens = tokenizer(str(question), truncation=False)['input_ids']
        solution_tokens = tokenizer(str(solution), truncation=False)['input_ids']

        max_question_tokens = min(len(question_tokens), int(available_tokens * 0.3))
        max_solution_tokens = available_tokens - max_question_tokens

        if len(question_tokens) > max_question_tokens:
            truncated_question = tokenizer.decode(
                question_tokens[:max_question_tokens],
                skip_special_tokens=True
            )
        else:
            truncated_question = str(question)

        if len(solution_tokens) > max_solution_tokens:
            truncated_solution = tokenizer.decode(
                solution_tokens[:max_solution_tokens],
                skip_special_tokens=True
            )
        else:
            truncated_solution = str(solution)

        return inference_prompt.format(truncated_question, truncated_solution)

# Loop through the test dataset and generate a prediction for each example
for example in tqdm(test_dataset):
    question = example["question"]
    solution = example["solution"]

    # Format the prompt
    prompt = truncate_prompt(question, solution, max_seq_length)
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [46:14<00:00,  3.60it/s]


Submission file 'submission.csv' created successfully!
You can now download this file and submit it to the Kaggle competition.





# SAVE THE MODEL TO DRIVE AND RUN INFERENCE
Add code to save the model checkpoint to Google Drive, load the model from the checkpoint, and generate the final submission CSV file.

## Mount google drive

### Subtask:
Mount Google Drive to save the model checkpoint.


**Reasoning**:
Mount Google Drive to save the model checkpoint.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Save model checkpoint

### Subtask:
Save the trained model checkpoint to the specified path in Google Drive.


**Reasoning**:
Define the save path and save the model and tokenizer to Google Drive.



In [None]:
import os


# Define the path to save the model checkpoint in Google Drive
save_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint"

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model checkpoint and tokenizer saved to: {save_path}")

In [None]:
import shutil
from google.colab import files


zip_path = shutil.make_archive(save_path, "zip", save_path)

files.download(zip_path)

## Load model from checkpoint

### Subtask:
Load the model from the saved checkpoint.


**Reasoning**:
Load the model and tokenizer from the saved checkpoint path in Google Drive and prepare the model for inference.



In [None]:
from unsloth import FastLanguageModel
import torch

# Define the path where the model checkpoint was saved in Google Drive
save_path = "/content/model"
max_seq_length = 1024  # Choose any sequence length
dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from the saved path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Prepare the loaded model for faster inference
FastLanguageModel.for_inference(model)

print(f"Model and tokenizer loaded from: {save_path}")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.




ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.12: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

Unsloth 2025.10.12 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Model and tokenizer loaded from: /content/model


## Generate submission file

### Subtask:
Generate the submission CSV file using the loaded model.


**Reasoning**:
Generate the submission CSV file by iterating through the test dataset, generating predictions using the loaded model, and saving the results to a pandas DataFrame.



In [None]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""

# A simple function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    # Find the text after "Output:"
    output_part = response_text.split("Output:\n")[-1]
    # Check if "True" is in that part, case-insensitively
    if 'true' in output_part.lower():
        return True
    return False

def truncate_prompt(question, solution, max_length=1024):
    prompt = inference_prompt.format(question, str(solution))

    tokens = tokenizer(prompt, truncation=False, return_tensors="pt")
    input_ids = tokens['input_ids'][0]

    if len(input_ids) <= max_length:
        return prompt

    else:
        template_text = inference_prompt.format("", "")
        template_tokens = tokenizer(template_text, truncation=False)['input_ids']


        available_tokens = max_length - len(template_tokens) - 18


        question_tokens = tokenizer(str(question), truncation=False)['input_ids']
        solution_tokens = tokenizer(str(solution), truncation=False)['input_ids']

        max_question_tokens = min(len(question_tokens), int(available_tokens * 0.3))
        max_solution_tokens = available_tokens - max_question_tokens

        if len(question_tokens) > max_question_tokens:
            truncated_question = tokenizer.decode(
                question_tokens[:max_question_tokens],
                skip_special_tokens=True
            )
        else:
            truncated_question = str(question)

        if len(solution_tokens) > max_solution_tokens:
            truncated_solution = tokenizer.decode(
                solution_tokens[:max_solution_tokens],
                skip_special_tokens=True
            )
        else:
            truncated_solution = str(solution)

        return inference_prompt.format(truncated_question, truncated_solution)

# Loop through the test dataset and generate a prediction for each example
for example in tqdm(test_dataset):
    question = example["question"]
    solution = example["solution"]

    # Format the prompt
    prompt = truncate_prompt(question, solution, max_seq_length)
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs)[0]

    # Parse the prediction and add it to our list
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully!")
print("You can now download this file and submit it to the Kaggle competition.")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

data/train-00001-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/3.65M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [1:25:53<00:00,  1.94it/s]


Submission file 'submission.csv' created successfully!
You can now download this file and submit it to the Kaggle competition.



