# Notebook Description

In this notebook I fine-tuned a simple bert variant (distillbert-base-uncased) on the math problem classification data from
[this](https://www.kaggle.com/competitions/classification-of-math-problems-by-kasut-academy/overview) Kaggle competition.

What this Notebook has:
- A custom text cleaning function (clean_math_text_final) using regex to preprocess the math question text, removing elements like URLs, hashtags, emojis, and standardizing format.
- Conversion of the pandas DataFrame into a Hugging Face Dataset object.
- Stratified splitting of the data into training, validation, and testing sets (80%/10%/10% split).
- Tokenization of the cleaned questions using the distilbert-base-uncased tokenizer, with padding and truncation.
- Loading the distilbert-base-uncased model for sequence classification with 8 labels.
- Setting up TrainingArguments and the Trainer from the Hugging Face library.
- Fine-tuning the model for 3 epochs, saving the best model based on validation F1-micro score.
- Evaluating the final model on the test set and reporting metrics (Accuracy, F1 scores, Loss).
- Saving the trained model, training state, and metrics.

Next steps for this exploration:

- Experiment with different pre-trained transformer models (e.g., BERT, RoBERTa, specialized models like MathBERT if available).
- Perform more extensive hyperparameter tuning (e.g., learning rate, batch size, number of epochs, weight decay) potentially using tools like Optuna or Ray Tune.
- Investigate more sophisticated text preprocessing or feature engineering techniques tailored to mathematical text.

# Install Packages

In [1]:
!pip install evaluate -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.12.0 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.8.4.1 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cudnn-cu12==9.1.0.70; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cudnn-cu12 9.3.0.75 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cufft-cu12==11.2.1.3; platform_system == "Linux" and platform_machine == "x86

# Import Libraries

In [None]:
import evaluate
import pandas as pd
import numpy as np
import torch
import re
from datasets import Dataset, DatasetDict, ClassLabel
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

2025-04-24 04:11:11.511292: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745467871.735668      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745467871.799103      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Set Device

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Config

In [None]:
MODEL_NAME = "distilbert-base-uncased"
MAX_LENGTH = 512
NUM_LABELS = 8
TEST_SIZE = 0.1
VALID_SIZE = 0.1
LEARNING_RATE = 2e-5
BATCH_SIZE = 64
EPOCHS = 3
OUTPUT_DIR = "./math_classifier_results"
LOGGING_DIR = "./math_classifier_logs"

# Load Data

In [None]:
df = pd.read_csv("/kaggle/input/classification-of-math-problems-by-kasut-academy/train.csv")

print(f"Loaded DataFrame shape: {df.shape}")
print("Label distribution:\n", df['label'].value_counts())

# Data Preprocessing

In [None]:
def clean_math_text_final(text):
    text = str(text)
    text = re.sub(r'^\s*\d+\.\s*', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', ' ', text)
    text = re.sub(r'#\w+', ' ', text)
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F" u"\U0001F300-\U0001F5FF" u"\U0001F680-\U0001F6FF"
                           u"\U0001F1E0-\U0001F1FF" u"\U00002702-\U000027B0" u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r' ', text)
    text = re.sub(r'\\text\{\([A-Z]\)\}.*$', '', text, flags=re.MULTILINE | re.DOTALL)
    text = re.sub(r'\n\s*\([A-Z]\).*$', '', text, flags=re.MULTILINE | re.DOTALL)
    text = text.replace('$', '')
    text = re.sub(r'\\[a-zA-Z]+', ' ', text)
    text = re.sub(r'\{([^}]*)\}', r' \1 ', text)
    text = re.sub(r"[^a-zA-Z0-9\s\.\?\!]", " ", text)
    text = re.sub(r'(\d+)([a-zA-Z])', r'\1 \2', text)
    text = re.sub(r'([a-zA-Z])(\d+)', r'\1 \2', text)
    text = re.sub(r'\s+', ' ', text).strip().lower()
    return text

print("\n--- Applying Text Cleaning ---")
df['cleaned_question'] = df['Question'].apply(clean_math_text_final)
print("Cleaning Done.")

# Prepare Data for Training

In [None]:
print("\n--- Creating Dataset & Casting Label Type ---")
dataset = Dataset.from_pandas(df)

class_label_feature = ClassLabel(num_classes=NUM_LABELS)
dataset = dataset.cast_column('label', class_label_feature)

print(f"Dataset features after casting 'label' column:")
print(dataset.features)

print("\n--- Splitting Dataset ---")
train_test_split = dataset.train_test_split(test_size=TEST_SIZE, stratify_by_column='label')
train_valid_split = train_test_split['train'].train_test_split(test_size=VALID_SIZE / (1 - TEST_SIZE), stratify_by_column='label')

final_datasets = DatasetDict({
    'train': train_valid_split['train'],
    'validation': train_valid_split['test'],
    'test': train_test_split['test']
})
print("Dataset splits created:")
print(final_datasets)

# Tokenization

In [None]:
print("\n--- Tokenization ---")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(examples):
    
    return tokenizer(examples["cleaned_question"],
                     padding="max_length",
                     truncation=True,
                     max_length=MAX_LENGTH)

tokenized_datasets = final_datasets.map(tokenize_function, batched=True)


tokenized_datasets = tokenized_datasets.remove_columns(["Question", "cleaned_question"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
print("Tokenization complete.")
print("Example train sample:", tokenized_datasets["train"][0])

# Define Evaluation Metrics

In [None]:
print("\n--- Setting up Metrics ---")
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    f1_weighted = f1_metric.compute(predictions=predictions, references=labels, average="weighted")["f1"]
    f1_micro = f1_metric.compute(predictions=predictions, references=labels, average="micro")["f1"]
    return {"accuracy": accuracy, "f1_weighted": f1_weighted, "f1_micro": f1_micro}

# Load and Train distilbert model with trainer

In [None]:
print("\n--- Loading Model ---")
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS
)

model.to(device)
print(f"Model '{MODEL_NAME}' loaded for {NUM_LABELS}-class classification.")


print("\n--- Defining Training Arguments ---")
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=EPOCHS,
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_dir=LOGGING_DIR,
    logging_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_micro",
    greater_is_better=True,
    push_to_hub=False,
    report_to="none",
)


print("\n--- Initializing Trainer ---")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


print("\n--- Starting Fine-Tuning ---")
train_result = trainer.train()


trainer.log_metrics("train", train_result.metrics)
trainer.save_metrics("train", train_result.metrics)
trainer.save_state()


trainer.save_model(OUTPUT_DIR + "/best_model")
print("Training finished. Best model saved.")


print("\n--- Evaluating on Test Set ---")
test_results = trainer.evaluate(eval_dataset=tokenized_datasets["test"])

print("Test Set Evaluation Results:")
print(test_results)
trainer.log_metrics("eval", test_results)
trainer.save_metrics("eval", test_results)

Using device: cuda
Loaded DataFrame shape: (10189, 2)
Label distribution:
 label
0    2618
1    2439
5    1827
4    1712
2    1039
3     368
6     100
7      86
Name: count, dtype: int64

--- Applying Text Cleaning ---
Cleaning applied.

--- Creating Dataset & Casting Label Type ---


Casting the dataset:   0%|          | 0/10189 [00:00<?, ? examples/s]

Dataset features after casting 'label' column:
{'Question': Value(dtype='string', id=None), 'label': ClassLabel(names=['0', '1', '2', '3', '4', '5', '6', '7'], id=None), 'cleaned_question': Value(dtype='string', id=None)}

--- Splitting Dataset ---
Dataset splits created:
DatasetDict({
    train: Dataset({
        features: ['Question', 'label', 'cleaned_question'],
        num_rows: 8151
    })
    validation: Dataset({
        features: ['Question', 'label', 'cleaned_question'],
        num_rows: 1019
    })
    test: Dataset({
        features: ['Question', 'label', 'cleaned_question'],
        num_rows: 1019
    })
})

--- Tokenization ---


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/8151 [00:00<?, ? examples/s]

Map:   0%|          | 0/1019 [00:00<?, ? examples/s]

Map:   0%|          | 0/1019 [00:00<?, ? examples/s]

Tokenization complete.
Example train sample: {'labels': tensor(5), 'input_ids': tensor([  101,  2169, 19449,  1997,  1037,  3180, 26489, 19281,  7446,  2260,
         2175,  2078,  2003,  2000,  2022,  6910,  2593,  2417,  2030,  2630,
         1998,  2947,  2045,  2024,  1016,  2260,  2825, 22276,  2015,  1012,
         2424,  1996,  2193,  1997,  2122, 22276,  2015,  2007,  1996,  3200,
         2008,  2053,  2176, 18984,  6910,  1996,  2168,  3609,  2024,  1996,
         2176, 18984,  1997,  1037, 28667, 23395,  1012,   102,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]


--- Loading Model ---


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model 'distilbert-base-uncased' loaded for 8-class classification.

--- Defining Training Arguments ---

--- Initializing Trainer ---

--- Starting Fine-Tuning ---


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1 Weighted,F1 Micro
1,1.509,1.075145,0.648675,0.610691,0.648675
2,0.9556,0.850776,0.726202,0.704554,0.726202
3,0.8065,0.793741,0.744848,0.729393,0.744848




***** train metrics *****
  epoch                    =        3.0
  total_flos               =  3017086GF
  train_loss               =     1.0904
  train_runtime            = 0:10:30.69
  train_samples_per_second =     38.771
  train_steps_per_second   =      0.304
Training finished. Best model saved.

--- Evaluating on Test Set ---




Test Set Evaluation Results:
{'eval_loss': 0.7611544728279114, 'eval_accuracy': 0.7566241413150148, 'eval_f1_weighted': 0.7439978650826935, 'eval_f1_micro': 0.7566241413150148, 'eval_runtime': 9.3559, 'eval_samples_per_second': 108.916, 'eval_steps_per_second': 0.855, 'epoch': 3.0}
***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.7566
  eval_f1_micro           =     0.7566
  eval_f1_weighted        =      0.744
  eval_loss               =     0.7612
  eval_runtime            = 0:00:09.35
  eval_samples_per_second =    108.916
  eval_steps_per_second   =      0.855
