# LM-based Multiclass Classification of Cultural Items
**Homework 1 - Multilingual Natural Language Processing**

*By Joshua Edwin & Clemens Kubach*

**Methodology**
In this project, we considered what might be advantageous for a classification of cultural items.
We hypothesized that an LM pretrained on text from a variety of languages would have more cultural knowledge because language and culture are closely intertwined.
To test this hypothesis, we fine-tuned two pretrained LMs on our down-stream task: "BERT" and "Multilingual BERT".
Our experiments suggest that our hypothesis might be correct - "Multilingual BERT" actually performs better.
It should be mentioned that this is not representative, as this would go beyond the scope of the project.

We use our fine-tuned "Multilingual BERT" model as the homework requested transformer-based model.

## Installs
Check that all required dependencies are installed as defined in `pyproject.toml`. Follow the `README.md` for more detailed instructions.

## Machine and Environment Setup

In [None]:
import pandas as pd
from datasets import DatasetDict, Dataset
import logging
import os


try:
    from google.colab import userdata  # type: ignore

    IN_COLAB = True
except ImportError:
    IN_COLAB = False

logger = logging.getLogger(__name__)

from pathlib import Path

REPO_ROOT = Path(str(os.path.abspath(''))).parent.parent
DATA_DIR = REPO_ROOT / "data"
LOG_DIR = REPO_ROOT / "logs"
WANDB_DIR = REPO_ROOT / "wandb"

os.environ["WANDB_PROJECT"] = "mnlp-h1-lm"
os.environ["WANDB_DIR"] = str(WANDB_DIR)

In [None]:
import wandb
wandb.login()

### Huggingface Login and Loading Data
To access the dataset, there are three options available that will be tried in the following fallback-order:


1.   From HF via HF_TOKEN secret/envvar if set.
2.   From HF via inserting the HF token manually in the login dialog.
3.   From local `./train.csv` and `./valid.csv` files.

Afterwards the hf dataset instance and train, val dataframes can be accessed via `hf_dataset`, `df_train` and `df_val`.

In [None]:
from os import environ

from datasets import load_dataset
from huggingface_hub import login
from datasets.exceptions import DatasetNotFoundError
from huggingface_hub.errors import HfHubHTTPError


def extract_dev_subsets_from_hf_dataset(
    ds: DatasetDict,
) -> tuple[pd.DataFrame, pd.DataFrame]:
    _df_train = pd.DataFrame(ds["train"])  # Silver-labeled training set
    _df_validation = pd.DataFrame(ds["validation"])  # Gold-labeled dev set
    return _df_train, _df_validation


def read_hf_token() -> str | None:
    if IN_COLAB:
        try:
            return userdata.get("HF_TOKEN")
        except KeyError:
            return None
    else:
        return environ.get("HF_TOKEN", None)


def do_blocking_hf_login():
    # run the login in a separate cell because login is non-blocking
    try:
        token = read_hf_token()
        login(token=token)
        if token is None:
            # block until logged-in
            input("Press enter of finish login!")
    except (HfHubHTTPError, DatasetNotFoundError):
        print(
            "Login via HF_TOKEN secret/envvar and via manual login widget failed "
            "or not authorized."
        )


do_blocking_hf_login()

In [None]:
def load_train_val_data() -> tuple[DatasetDict, pd.DataFrame, pd.DataFrame]:
    try:
        _ds = load_dataset("sapienzanlp/nlp2025_hw1_cultural_dataset")
        _df_train, _df_val = extract_dev_subsets_from_hf_dataset(_ds)
        return _ds, _df_train, _df_val
    except (HfHubHTTPError, DatasetNotFoundError) as e:
        logger.error(
            f"Something went wrong during HF dataset access: {e}. "
            "Falling back to local files:"
        )
    try:
        train_df = pd.read_csv("train.csv")
        valid_df = pd.read_csv("valid.csv")

        train_dataset = Dataset.from_pandas(train_df)
        valid_dataset = Dataset.from_pandas(valid_df)

        # Create a DatasetDict
        dataset_dict = DatasetDict(
            {"train": train_dataset, "validation": valid_dataset}
        )
        return dataset_dict, train_df, valid_df
    except FileNotFoundError:
        raise FileNotFoundError(
            "Tried to access the dataset from Huggingface "
            "(via HF_TOKEN secret/envvar and manual auth) and from the local disk"
            "(via train.csv and valid.csv in the cwd) without success."
        )


hf_dataset, df_train, df_val = load_train_val_data()

# Show samples
print("\nTrain Set:")
display(df_train.head())

print("\nValidation Set:")
display(df_val.head())

In [None]:
df_train["label"].unique()

In [None]:
import matplotlib.pyplot as plt

def plot_label_distribution(df: pd.DataFrame, set_name: str):
    df["label"].value_counts().plot(kind="bar")
    plt.title(f"Label Distribution in {set_name}")
    plt.xlabel("Label")
    plt.ylabel("Count")
    plt.xticks(rotation=45)
    plt.show()

plot_label_distribution(df_train, set_name="Training Set")
plot_label_distribution(df_val, set_name="Validation Set")

In [None]:
print("HF dataset instance keys:", list(hf_dataset.keys()))
print("Train columns:", list(df_train.columns))
print("Val columns:", list(df_val.columns))

## LM-based Approach

We use different pretrained language models that are based on transformer architectures. For this, we will use the Huggingface Transformers library.

We will focus on the following models:
- BERT Base Cased
- BERT Base Multilingual Cased

Further experiments with other/larger models:
- XLM RoBERTa Large
- XLM RoBERTa Base

We adjusted the hyperparameters and training settings to mitigate overfitting and improve the performance of the model.

The models will be trained on the training set and evaluated on the validation set. The results will be logged to Weights & Biases (wandb) for further analysis.

### Pretrained Model Selection
Select the pretrained LM model that should be used for fine-tuning on our down-stream task

In [None]:
MODEL_NAME = "google-bert/bert-base-cased"
# MODEL_NAME = "bert-base-multilingual-cased"

# Further experiments with other/larger models:
# MODEL_NAME = "FacebookAI/xlm-roberta-base"
# MODEL_NAME = "facebook/xlm-roberta-xl" # too large for gpu
# MODEL_NAME = "FacebookAI/xlm-roberta-large" # works with Adafactor and gradient checkpointing

Define our label mapping for internal numerical representation.

In [None]:
LABEL2ID = {
    "cultural agnostic": 0,
    "cultural representative": 1,
    "cultural exclusive": 2,
}
ID2LABEL = {k: v for v, k in LABEL2ID.items()}

In [None]:
from transformers import PreTrainedTokenizer, AutoTokenizer, DataCollatorWithPadding


class PreProcessor:

    def __init__(
            self,
            tokenizer_name: str,
            agg_in_fields: tuple[str, ...] = ("name", "description"),
    ):
        self.tokenizer_name = tokenizer_name
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

        self.agg_in_fields = agg_in_fields

    def __call__(self, samples):
        return self.preprocess_function(samples, self.tokenizer, self.agg_in_fields)

    @staticmethod
    def preprocess_function(
            samples,
            tokenizer: PreTrainedTokenizer,
            agg_in_fields: tuple[str, ...] = ("name", "category", "type", "subcategory", "description"),
    ):
        """Aggregate and tokenize input fields and one-hot encode labels."""
        to_tokenize = [samples[col] for col in agg_in_fields]
        input_samples = tokenizer(*to_tokenize, truncation=True, padding=True)
        input_samples["labels"] = [LABEL2ID[label] for label in samples["label"]]
        return input_samples

preprocessor = PreProcessor(MODEL_NAME)
data_collator = DataCollatorWithPadding(tokenizer=preprocessor.tokenizer)
preprocessed_hf_dataset = hf_dataset.map(preprocessor, batched=True)
required_columns = preprocessor.tokenizer.model_input_names + ['labels']
preprocessed_hf_dataset.set_format(type='torch', columns=required_columns)

preprocessed_hf_dataset


In [None]:
import evaluate
import numpy as np


class Evaluator:

    def __init__(self):
        self.accuracy = evaluate.load("accuracy")
        self.f1 = evaluate.load("f1")
        self.precision = evaluate.load("precision")
        self.recall = evaluate.load("recall")

    def __call__(self, eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        accuracy_results = self.accuracy.compute(predictions=predictions, references=labels)
        f1_results = self.f1.compute(predictions=predictions, references=labels, average="weighted")
        precision_results = self.precision.compute(predictions=predictions, references=labels, average="weighted")
        recall_results = self.recall.compute(predictions=predictions, references=labels, average="weighted")
        # combine results to one dict
        results = {**accuracy_results, **f1_results, **precision_results, **recall_results}
        return results


evaluator = Evaluator()

In [None]:
from collections import Counter
from transformers.integrations import WandbCallback



def decode_predictions(model, predictions):
    id2label = model.config.id2label
    labels = [id2label[i] for i in predictions.label_ids]
    logits = predictions.predictions.argmax(axis=-1)
    prediction_text = [id2label[i] for i in logits]
    return {"labels": labels, "predictions": prediction_text}


class CustomWandbCallback(WandbCallback):

    def __init__(self, trainer, tokenizer, sample_dataset: Dataset):
        super().__init__()
        self.trainer = trainer
        self.tokenizer = tokenizer
        self.sample_dataset = sample_dataset

    def on_evaluate(self, args, state, control, **kwargs):
        super().on_evaluate(args, state, control, **kwargs)
        predictions = self.trainer.predict(self.sample_dataset)

        predicted_class_ids = predictions.predictions.argmax(axis=-1)
        del predictions
        # Optionally decode to class names
        id2label = self.trainer.model.config.id2label
        predicted_labels = [id2label[i] for i in predicted_class_ids]

        # self._wandb.log({"eval/predicted_class_distribution": wandb.Histogram(predicted_class_ids)}, step=state.global_step)
        label_counts = Counter(predicted_labels)
        label_df = pd.DataFrame.from_dict(label_counts, orient='index', columns=['count']).reset_index()
        label_df.columns = ['label', 'count']

        self._wandb.log({
            "eval/predicted_class_distribution": wandb.plot.bar(
                wandb.Table(dataframe=label_df),
                "label", "count",
                title="Predicted Class Distribution"
            )
        }, step=state.global_step)

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, label2id=LABEL2ID, id2label=ID2LABEL, classifier_dropout=0.4
)

In [None]:
from transformers.optimization import AdafactorSchedule
from transformers import Adafactor

wandb.finish()
from datetime import datetime

timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
unique_run_name = f"{MODEL_NAME}-{timestamp}"

training_args = TrainingArguments(
    output_dir=LOG_DIR / MODEL_NAME,
    learning_rate=5e-6,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=10,
    num_train_epochs=30,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    report_to="wandb",
    run_name=unique_run_name,
    # gradient_accumulation_steps=16,
    # gradient_checkpointing=True,
    # fp16=True,
)

# optimizer = Adafactor(
#     model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None
# )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=preprocessed_hf_dataset["train"],
    eval_dataset=preprocessed_hf_dataset["validation"],
    processing_class=preprocessor.tokenizer,
    data_collator=data_collator,
    compute_metrics=evaluator,
    # optimizers=(optimizer, AdafactorSchedule(optimizer)),
)
trainer.add_callback(
    CustomWandbCallback(trainer, preprocessor.tokenizer, preprocessed_hf_dataset["validation"])
)

trainer.train()
wandb.finish()