# Notebook Description: T5-base for Sequence-to-Sequence Math Problem Classification

This notebook explores framing the math problem classification task as a sequence-to-sequence problem using the t5-base model. Instead of predicting a class ID, the model is trained to generate the textual name of the math category (e.g., "Algebra", "Number Theory"). It uses data from this [this](https://www.kaggle.com/competitions/classification-of-math-problems-by-kasut-academy/overview) Kaggle competition.

### What this Notebook has:

- Task Formulation: Treats classification as a text-to-text generation task. The model takes the math problem as input (with a prefix) and generates the corresponding category name as output.
- Model: Uses the t5-base model and tokenizer.
- Loads training (train.csv) and test (test.csv) data.
- Maps the numeric labels (0-7) to their full string names (e.g., "Algebra", "Geometry and Trigonometry") using a predefined dictionary.
- Splits the training data into a 90% training set and a 10% validation set, stratified by the original numeric label.
- Creates a Hugging Face DatasetDict for train, validation, and test sets.
- Adds a prefix "Classify this math problem: " to the input questions.
- Tokenizes the prefixed questions as model inputs and the string label names as target labels.
- Uses DataCollatorForSeq2Seq for handling dynamic padding during training.
- Fine-tunes the t5-base model using Seq2SeqTrainer for 10 epochs with mixed precision (fp16=True).
- Uses a custom compute_metrics function that generates text predictions, decodes them, maps the predicted text labels back to numeric IDs (handling potential unknown generated labels), and calculates accuracy.
- Saves the best performing model based on validation accuracy during training.
- Evaluates the best model on the validation set.
- Reloads the fine-tuned model and tokenizer.
- Uses a text2text-generation pipeline to generate category name predictions for the official competition test set.
- Maps the generated text labels back to numeric IDs, assigning a default ID (0) if an unknown label name is generated.
- Creates and saves the final submission.csv file.

### Next steps for this exploration:

- Experiment with other sequence-to-sequence models (e.g., BART, larger T5 variants like t5-large).
- Tune hyperparameters like learning rate, batch size, number of epochs, and potentially the input prefix design.
- Analyze the instances where the model generates unknown or incorrect label names.
- Apply robust text cleaning to the input questions before tokenization to see if it improves performance.
- Directly compare the results and efficiency of this sequence-to-sequence approach against the sequence classification methods used in the other notebooks.

**Public Test Score:** 0.8239

# Install Packages

In [1]:
!pip install evaluate -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.12.0 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.8.4.1 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cudnn-cu12==9.1.0.70; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cudnn-cu12 9.3.0.75 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cufft-cu12==11.2.1.3; platform_system == "Linux" and platform_machine == "x86

# Import Libraries

In [2]:
import os
import torch
import evaluate
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

2025-04-28 04:13:59.410704: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745813639.628108      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745813639.690213      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Setup Config

In [None]:
MODEL_CHECKPOINT = "t5-base"
MAX_INPUT_LENGTH = 512
MAX_TARGET_LENGTH = 32
BATCH_SIZE_PER_DEVICE = 8
NUM_TRAIN_EPOCHS = 10
LEARNING_RATE = 5e-5
WEIGHT_DECAY = 0.01
OUTPUT_DIR = "./t5-math-classifier"
TRAIN_FILE = "https://raw.githubusercontent.com/PrudhvirajuChekuri/Final-Project-Group8/refs/heads/master/code/data/train.csv"
TEST_FILE = "https://raw.githubusercontent.com/PrudhvirajuChekuri/Final-Project-Group8/refs/heads/master/code/data/test.csv"
SUBMISSION_FILE = "submission.csv"

id2label = {
    0: "Algebra",
    1: "Geometry and Trigonometry",
    2: "Calculus and Analysis",
    3: "Probability and Statistics",
    4: "Number Theory",
    5: "Combinatorics and Discrete Math",
    6: "Linear Algebra",
    7: "Abstract Algebra and Topology"
}
label2id = {v: k for k, v in id2label.items()}
NUM_LABELS = len(id2label)

os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Using model: {MODEL_CHECKPOINT}")
print(f"Number of labels: {NUM_LABELS}")
print(f"Labels: {id2label}")

Using model: t5-base
Number of labels: 8
Labels: {0: 'Algebra', 1: 'Geometry and Trigonometry', 2: 'Calculus and Analysis', 3: 'Probability and Statistics', 4: 'Number Theory', 5: 'Combinatorics and Discrete Math', 6: 'Linear Algebra', 7: 'Abstract Algebra and Topology'}


# Load Data

In [4]:
train_df = pd.read_csv(TRAIN_FILE)
test_df = pd.read_csv(TEST_FILE)
display(train_df.head())
print(f"Train data shape: {train_df.shape}")
display(test_df.head())
print(f"Test data shape: {test_df.shape}")

Unnamed: 0,Question,label
0,A solitaire game is played as follows. Six di...,3
1,2. The school table tennis championship was he...,5
2,"Given that $x, y,$ and $z$ are real numbers th...",0
3,$25 \cdot 22$ Given three distinct points $P\l...,1
4,I am thinking of a five-digit number composed ...,5


Train data shape: (10189, 2)


Unnamed: 0,id,Question
0,0,b'Solve 0 = -i - 91*i - 1598*i - 64220 for i.\n'
1,1,Galperin G.A.\n\nA natural number $N$ is 999.....
2,2,Example 7 Calculate $\frac{1}{2 \sqrt{1}+\sqrt...
3,3,"If $A$, $B$, and $C$ represent three distinct ..."
4,4,2. Calculate $1+12+123+1234+12345+123456+12345...


Test data shape: (3044, 2)


# Split and prepare data

In [5]:
train_df['label_name'] = train_df['label'].map(id2label)
print("\nTrain data with label names:")
print(train_df.head())

train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42, stratify=train_df['label'])
print(f"\nTrain split shape: {train_df.shape}")
print(f"Validation split shape: {val_df.shape}")

train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

raw_datasets = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

print("\nDatasetDict created:")
print(raw_datasets)


Train data with label names:
                                            Question  label  \
0  A solitaire game is played as follows.  Six di...      3   
1  2. The school table tennis championship was he...      5   
2  Given that $x, y,$ and $z$ are real numbers th...      0   
3  $25 \cdot 22$ Given three distinct points $P\l...      1   
4  I am thinking of a five-digit number composed ...      5   

                        label_name  
0       Probability and Statistics  
1  Combinatorics and Discrete Math  
2                          Algebra  
3        Geometry and Trigonometry  
4  Combinatorics and Discrete Math  

Train split shape: (9170, 3)
Validation split shape: (1019, 3)

DatasetDict created:
DatasetDict({
    train: Dataset({
        features: ['Question', 'label', 'label_name', '__index_level_0__'],
        num_rows: 9170
    })
    validation: Dataset({
        features: ['Question', 'label', 'label_name', '__index_level_0__'],
        num_rows: 1019
    })
    test: 

# Preprocess and Tokenize data

In [6]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

prefix = "Classify this math problem: "

def preprocess_function(examples):
    """Preprocesses the data for T5: adds prefix, tokenizes inputs and labels."""
    inputs = [prefix + doc for doc in examples["Question"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True, padding=False) # Padding handled by DataCollator

    if "label_name" in examples:
        labels = tokenizer(text_target=examples["label_name"], max_length=MAX_TARGET_LENGTH, truncation=True, padding=False)
        model_inputs["labels"] = labels["input_ids"]

    return model_inputs


train_val_cols_to_remove = raw_datasets["train"].column_names
test_cols_to_remove = raw_datasets["test"].column_names

tokenized_train = raw_datasets['train'].map(
    preprocess_function,
    batched=True,
    remove_columns=train_val_cols_to_remove
)
tokenized_val = raw_datasets['validation'].map(
    preprocess_function,
    batched=True,
    remove_columns=train_val_cols_to_remove
)

tokenized_test = raw_datasets['test'].map(
    preprocess_function,
    batched=True,
    remove_columns=test_cols_to_remove
)


tokenized_datasets = DatasetDict({
    'train': tokenized_train,
    'validation': tokenized_val,
    'test': tokenized_test
})


print("\nTokenized datasets:")
print(tokenized_datasets)
print("\nExample tokenized input (train):")
print(tokenized_datasets['train'][0]['input_ids'])
print("\nDecoded example tokenized input (train):")
print(tokenizer.decode(tokenized_datasets['train'][0]['input_ids']))
print("\nExample tokenized label (train):")
print(tokenized_datasets['train'][0]['labels'])
print("\nDecoded example tokenized label (train):")
print(tokenizer.decode(tokenized_datasets['train'][0]['labels']))
print("\nExample tokenized input (test):")
print(tokenized_datasets['test'][0]['input_ids'])
print("\nDecoded example tokenized input (test):")
print(tokenizer.decode(tokenized_datasets['test'][0]['input_ids']))
print("\nColumns in tokenized test set:")
print(tokenized_datasets['test'].column_names)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Map:   0%|          | 0/9170 [00:00<?, ? examples/s]

Map:   0%|          | 0/1019 [00:00<?, ? examples/s]

Map:   0%|          | 0/3044 [00:00<?, ? examples/s]


Tokenized datasets:
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 9170
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1019
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 3044
    })
})

Example tokenized input (train):
[4501, 4921, 48, 7270, 682, 10, 3, 31305, 24, 1514, 3229, 2, 346, 122, 77, 2, 10116, 63, 2, 1741, 2, 75, 1741, 2, 117, 2, 75, 1741, 2, 75, 1741, 2, 75, 2, 3, 184, 184, 3, 184, 345, 834, 115, 3, 2, 3, 184, 3, 2, 715, 7, 3, 184, 3, 184, 276, 834, 115, 3, 2, 3, 2, 75, 747, 2, 21432, 2, 3, 184, 3, 184, 1593, 3, 184, 276, 834, 115, 6, 3, 2, 3, 2, 989, 2, 10116, 63, 2, 1514, 3229, 8352, 1514, 345, 3229, 6, 1514, 2247, 3229, 6, 11, 1514, 115, 3229, 4221, 386, 6746, 3, 9206, 7, 209, 7141, 5, 156, 1514, 2247, 2423, 2, 9880, 2, 345, 2, 357, 2, 3229, 6, 11, 1514, 345, 3229, 19, 192, 705, 145, 1514, 115

# Load and Fine-Tune T5-base model

In [7]:
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    padding=True
)
print("\nData collator initialized.")

accuracy_metric = evaluate.load("accuracy")
print(f"Using label2id mapping in metrics: {label2id}")

def postprocess_text(preds, labels):
    """ Helper function to clean up generated text """
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    """Computes accuracy score from model predictions."""
    preds, labels = eval_preds

    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    pred_ids = []
    label_ids = []
    unknown_preds_count = 0
    for pred_name, label_name in zip(decoded_preds, decoded_labels):
        pred_id = label2id.get(pred_name, -1)
        if pred_id == -1:
            unknown_preds_count += 1

        label_id = label2id.get(label_name, -2)
        if label_id == -2:
             print(f"Error: Could not map true label '{label_name}' to ID!")

        pred_ids.append(pred_id)
        label_ids.append(label_id)

    if unknown_preds_count > 0:
        print(f"Warning: Encountered {unknown_preds_count} predictions during evaluation that did not match known label names.")

    acc_result = accuracy_metric.compute(predictions=pred_ids, references=label_ids)

    return {"accuracy": acc_result["accuracy"]}

training_args = Seq2SeqTrainingArguments(
    output_dir=OUTPUT_DIR,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE_PER_DEVICE,
    per_device_eval_batch_size=BATCH_SIZE_PER_DEVICE * 2,
    weight_decay=WEIGHT_DECAY,
    num_train_epochs=NUM_TRAIN_EPOCHS,
    predict_with_generate=True,
    fp16=True,
    logging_dir=f"{OUTPUT_DIR}/logs",
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    report_to="none",
    save_total_limit=3
)

print("\nTraining arguments configured.")

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("\nSeq2SeqTrainer initialized.")

print("\nStarting training...")
train_result = trainer.train()
print("Training finished.")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]


Data collator initialized.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Using label2id mapping in metrics: {'Algebra': 0, 'Geometry and Trigonometry': 1, 'Calculus and Analysis': 2, 'Probability and Statistics': 3, 'Number Theory': 4, 'Combinatorics and Discrete Math': 5, 'Linear Algebra': 6, 'Abstract Algebra and Topology': 7}

Training arguments configured.


  trainer = Seq2SeqTrainer(



Seq2SeqTrainer initialized.

Starting training...


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1315,0.108545,0.779195
2,0.1116,0.09607,0.809617
3,0.0884,0.095894,0.822375
4,0.0773,0.090043,0.831207
5,0.0584,0.103286,0.825319
6,0.0495,0.113914,0.827282
7,0.0503,0.126213,0.829244
8,0.0299,0.134487,0.825319
9,0.0282,0.147056,0.821394
10,0.0228,0.149453,0.823356


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


Training finished.


# Save Model

In [8]:
trainer.save_model()
trainer.log_metrics("train", train_result.metrics)
trainer.save_metrics("train", train_result.metrics)
trainer.save_state()
print(f"Model saved to {OUTPUT_DIR}")

print("\nEvaluating the best model on the validation set...")
eval_metrics = trainer.evaluate()
trainer.log_metrics("eval", eval_metrics)
trainer.save_metrics("eval", eval_metrics)
print(f"Validation Metrics: {eval_metrics}")

***** train metrics *****
  epoch                    =       10.0
  total_flos               = 27196588GF
  train_loss               =     0.0787
  train_runtime            = 1:21:22.84
  train_samples_per_second =      18.78
  train_steps_per_second   =      1.176
Model saved to /kaggle/working/t5-math-classifier

Evaluating the best model on the validation set...




***** eval metrics *****
  epoch                   =       10.0
  eval_accuracy           =     0.8312
  eval_loss               =       0.09
  eval_runtime            = 0:00:48.17
  eval_samples_per_second =     21.154
  eval_steps_per_second   =      0.664
Validation Metrics: {'eval_loss': 0.0900426059961319, 'eval_accuracy': 0.831207065750736, 'eval_runtime': 48.1705, 'eval_samples_per_second': 21.154, 'eval_steps_per_second': 0.664, 'epoch': 10.0}


# Delete model

In [9]:
import torch
import gc

print("\nCleaning up training objects...")

del model
del trainer

gc.collect()
torch.cuda.empty_cache()
print("Training objects deleted and CUDA cache cleared.")


Cleaning up training objects...
Training objects deleted and CUDA cache cleared.


# Make predictions on Test set

In [10]:
print(f"\nLoading fine-tuned model and tokenizer from {OUTPUT_DIR}...")
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)

device = 0
model = AutoModelForSeq2SeqLM.from_pretrained(OUTPUT_DIR).to(f"cuda:{device}")
model.eval()

print("Model and tokenizer reloaded successfully.")

classifier_pipeline = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device
)

print("\nPredicting on the test set using pipeline...")

test_questions = test_df['Question'].tolist()
prefixed_test_questions = [prefix + q for q in test_questions]

pipeline_batch_size = BATCH_SIZE_PER_DEVICE * 8
raw_predictions = []
for i in tqdm(range(0, len(prefixed_test_questions), pipeline_batch_size)):
    batch = prefixed_test_questions[i:i + pipeline_batch_size]
    raw_predictions.extend(classifier_pipeline(batch, max_length=MAX_TARGET_LENGTH, clean_up_tokenization_spaces=True))

predicted_label_names = [pred['generated_text'].strip() for pred in raw_predictions]

print(f"\nNumber of predictions: {len(predicted_label_names)}")
print(predicted_label_names[:10])


Loading fine-tuned model and tokenizer from /kaggle/working/t5-math-classifier...


Device set to use cuda:0


Model and tokenizer reloaded successfully.

Predicting on the test set using pipeline...


  0%|          | 0/48 [00:00<?, ?it/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



Number of predictions: 3044
['Algebra', 'Number Theory', 'Algebra', 'Number Theory', 'Number Theory', 'Geometry and Trigonometry', 'Number Theory', 'Geometry and Trigonometry', 'Algebra', 'Algebra']


# Submit to Competition

In [11]:
cleaned_preds = predicted_label_names[:]

predicted_labels = []
unknown_count = 0
for pred_name in cleaned_preds:
    if pred_name in label2id:
        predicted_labels.append(label2id[pred_name])
    else:
        predicted_labels.append(0)
        unknown_count += 1
        print(f"Warning: Generated unknown label name '{pred_name}'. Assigned default 0.")

if unknown_count > 0:
     print(f"Total unknown labels generated: {unknown_count}")

submission_df = pd.DataFrame({
    'id': test_df['id'],
    'label': predicted_labels
})

print("\nSubmission DataFrame head:")
print(submission_df.head())

submission_df.to_csv(SUBMISSION_FILE, index=False)
print(f"\nSubmission file saved to {SUBMISSION_FILE}")


Submission DataFrame head:
   id  label
0   0      0
1   1      4
2   2      0
3   3      4
4   4      4

Submission file saved to /kaggle/working/submission.csv
