# Fine-tuning DistilBERT Base Multilingal Cased for Identifying Authors of Latin Works

This notebook contains the code for fine-tuning the [DistilBERT Base Multilingual Cased model](https://huggingface.co/distilbert/distilbert-base-multilingual-cased) to identify authors of works in Latin. The purpose is to facilitate the processing of metadata records for items in the Digital Latin Library Catalog.

## Import the Required Packages

This notebook uses:

- `codecarbon` to track energy usage
- `datasets` from the HuggingFace API to manage the data
- `json` to format results
- `numpy` for formatting and managing data
- `os` to interact with the operating system
- `pandas` to manage data
- `random` to set a random seed for reproducibility
- `scikit-learn` for evaluation metrics
- `transformers` from the HuggingFace API to manage the model
- `torch` for general machine learning functionality

Note that a HuggingFace access token is required. For information, see <https://huggingface.co/docs/hub/security-tokens>.

In [None]:
from codecarbon import EmissionsTracker
from datasets import Dataset, ClassLabel, Features, Value
import json
import numpy as np
import os
import pandas as pd
import random
from sklearn.metrics import f1_score, accuracy_score
from transformers.data.data_collator import DataCollatorWithPadding
from transformers.models.distilbert import DistilBertForSequenceClassification, DistilBertTokenizerFast
from transformers.trainer_callback import EarlyStoppingCallback
from transformers.trainer import Trainer
from transformers.training_args import TrainingArguments
from transformers.trainer_utils import EvalPrediction, set_seed
import torch

## Make Sure the Directory Structure is in Place

To keep the information organized, I will make a directory for the model (`authors`) and training logs (`logs`). I will also designate the path for the training data.

In [6]:
# Create paths if they do not exist
if not os.path.exists("../authors"):
    os.makedirs("../authors")

# Establish file directories
output_dir = '../authors'  # Temporary directory to save the model

if not os.path.exists("../logs"):
    os.makedirs("../logs")

log_dir = '../logs'

local_file_path = '../data/author_data.csv'

## Set a Random Seed for Reproducibility

Since so much of machine learning depends upon randomness, setting a "random seed" makes the randomness consistent across runs, so that the results should be reproducible by anyone running the code in this notebook. The `seed` variable will be called in creating the dataset splits and in setting the parameters for the `Trainer`.

In [7]:
# Set the random seed for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(seed)

## Load the Data and Prepare it for Use with the Model

The data I gathered from the TLG and Perseus needs to be turned into tensors, the format expected by the model. The following operations load the original data, tokenize the data, convert it into the HuggingFace dataset format, and split the dataset into training, validation, and testing sets.

In [8]:
# Load the dataset
data = pd.read_csv(local_file_path, encoding='utf-8', quotechar='"')

### Determine the Correct max_length Value

Setting a `max_length` ensures that inputs to the model will be consistent in size. It will also make the training faster and more efficient.

In [9]:
# Set TOKENIZERS_PARALLELISM to false to avoid warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Initialize the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-multilingual-cased')

# Tokenize the data to find the max length
tokens = data['variant'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
max_length = int(pd.Series(tokens).map(len).quantile(0.95))
print("Determined max_length for padding/truncation:", max_length)

Determined max_length for padding/truncation: 19


### Convert the data into the Dataset format

The class `Dataset` in the HuggingFace API has a method for converting pandas dataframes into the expected format. The following block makes use of that. It also updates the dataset to use 

In [10]:
# Create the dataset from pandas
hf_dataset = Dataset.from_pandas(data)

# Get unique label names
label_names = sorted(data['dll_author_id'].unique())
num_labels = len(label_names)

# Define features with ClassLabel
features = Features({
    'variant': Value('string'),
    'dll_author_id': ClassLabel(names=label_names)
})

# Cast the dataset to use these features
hf_dataset = hf_dataset.cast(features)

# Rename 'Label' to 'labels' for transformers compatibility
hf_dataset = hf_dataset.rename_column('dll_author_id', 'labels')

Casting the dataset:   0%|          | 0/34876 [00:00<?, ? examples/s]

### Split the Dataset into Training, Validation, and Testing Sets

Splitting the dataset into training, validation, and testing sets ensures that the model is trained on a representative sample of the dataset, validated against another (smaller) representative sample, and tested with data that has not been used in the training or validation process.

In [11]:
# Split the dataset into training, validation, and testing sets
train_test_split = hf_dataset.train_test_split(test_size=0.2, seed=seed)  # 80% train, 20% test
# Further split the training set into training and validation sets
train_val_split = train_test_split['train'].train_test_split(test_size=0.125, seed=seed)  # 0.125 of 0.8 = 0.1 of original

train_dataset = train_val_split['train']
val_dataset = train_val_split['test']
test_dataset = train_test_split['test']

### Tokenize the Dataset

This ensures that the dataset is in the format expected by the `Trainer`, defined below.

In [12]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['variant'], padding="max_length", truncation=True, max_length=max_length)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/24412 [00:00<?, ? examples/s]

Map:   0%|          | 0/3488 [00:00<?, ? examples/s]

Map:   0%|          | 0/6976 [00:00<?, ? examples/s]

### Set the Format of the Dataset for PyTorch

The HuggingFace API is really an abstraction of many PyTorch calls, so the dataset must be in the form expected by PyTorch.

In [13]:
# Set the format for PyTorch
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Print the dataset to ensure correct setup
print("Training Dataset:", train_dataset)
print("Validation Dataset:", val_dataset)
print("Testing Dataset:", test_dataset)

Training Dataset: Dataset({
    features: ['variant', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 24412
})
Validation Dataset: Dataset({
    features: ['variant', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 3488
})
Testing Dataset: Dataset({
    features: ['variant', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 6976
})


### Get the Number of Labels

This counts the number of labels (two in this case: "Greek" and "Latin") and initializes the model's classification head with the correct number of output units. This is why I don't have to set anything manually for a binary or multiclass classification problem. That is handled automatically based on the value of `num_labels`.

In [14]:
num_labels = data['dll_author_id'].nunique()
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-multilingual-cased', num_labels=num_labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Set the Metrics

I will use `f1` and `accuracy` as basic metrics for the model's training. More information at <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html> and <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html>.

In [15]:
def compute_metrics(pred: EvalPrediction):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'f1': f1,
        'accuracy': acc
    }

## Create the TrainingArguments

The HuggingFace [`Trainer` class](https://huggingface.co/docs/transformers/main_classes/trainer) has many parameters. I will set a number of them in this array.

In [16]:
training_args = TrainingArguments(
    output_dir=output_dir,          # Directory for saving model and logs
    eval_strategy="epoch",          # Evaluate the model every 'eval_steps'
    eval_steps=500,                 # Number of steps to run evaluation
    logging_dir=log_dir,            # Directory for storing logs
    logging_strategy="steps",       # Log metrics every epoch
    logging_steps=100,              # Log metrics every 100 steps
    save_strategy="epoch",          # Save checkpoints every 'save_steps'
    save_steps=500,                 # Save the model every 500 steps
    save_total_limit=1,             # Limit the total amount of checkpoints
    seed=seed,                      # Set a random seed for reproducibility
    per_device_train_batch_size=16, # Batch size for training
    per_device_eval_batch_size=64,  # Batch size for evaluation
    num_train_epochs=20,            # Number of training epochs
    load_best_model_at_end=True,    # Load the best model at the end of training
    metric_for_best_model='f1',     # Use F1 score to find the best model
    greater_is_better=True,         # Higher F1 score is better
    report_to=["tensorboard"]       # Report metrics to TensorBoard
)

## Use the DataCollator

This next block creates a helper that dynamically pads the batches during training and evaluation, increasing efficiency and speed.

In [17]:
# Initialize the data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Initialize the Trainer

Assemble all the parameters defined above and bundle them into the Trainer.

In [18]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    processing_class=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    data_collator=data_collator
)

## Take Advantage of Available GPU or Fallback to CPU

The following code checks whether an NVIDIA GPU is available. If not, it checks for an Apple Silicon GPU (which I have). If neither one is available, it defaults to using the CPU.

In [19]:
# General device selection for Colab (CUDA), Mac (MPS), or CPU fallback
if torch.cuda.is_available():
    # Set the device to CUDA (NVIDIA GPU)
    device = torch.device("cuda")
    print("Using device: CUDA (GPU)")
elif torch.backends.mps.is_available():
    # Set the device to MPS (Metal Performance Shaders)
    device = torch.device("mps")
    print("Using device: MPS (Apple Silicon GPU)")
else:
    # Fallback to CPU if no GPU is available
    device = torch.device("cpu")
    print("Using device: CPU")

# Move the model to the selected device
model = model.to(device)

Using device: MPS (Apple Silicon GPU)


## Set Up the EmissionsTracker from CodeCarbon

Initialize a tracker and start tracking emissions. Data is printed to the screen and logged in `logs/author_emissions_log.csv`.

In [20]:
# Set up CodeCarbon's EmissionsTracker
tracker = EmissionsTracker(
    output_dir="../logs",
    output_file="author_emissions_log.csv"
)
tracker.start()

[codecarbon INFO @ 16:00:20] [setup] RAM Tracking...
[codecarbon INFO @ 16:00:20] [setup] GPU Tracking...
[codecarbon INFO @ 16:00:20] No GPU found.
[codecarbon INFO @ 16:00:20] [setup] CPU Tracking...
[codecarbon INFO @ 16:00:20] CPU Model on constant consumption mode: Apple M4 Pro
[codecarbon INFO @ 16:00:20] >>> Tracker's metadata:
[codecarbon INFO @ 16:00:20]   Platform system: macOS-15.5-arm64-arm-64bit
[codecarbon INFO @ 16:00:20]   Python version: 3.10.9
[codecarbon INFO @ 16:00:20]   CodeCarbon version: 2.2.2
[codecarbon INFO @ 16:00:20]   Available RAM : 24.000 GB
[codecarbon INFO @ 16:00:20]   CPU count: 12
[codecarbon INFO @ 16:00:20]   CPU model: Apple M4 Pro
[codecarbon INFO @ 16:00:20]   GPU count: None
[codecarbon INFO @ 16:00:20]   GPU model: None


## Train the Model

In [21]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1,Accuracy
1,6.4637,6.158919,0.060537,0.09203
2,2.5416,2.419547,0.641034,0.679759
3,0.7033,0.716746,0.879845,0.887615
4,0.2155,0.427254,0.914399,0.918291
5,0.096,0.344012,0.932674,0.93406
6,0.0399,0.323901,0.938889,0.940367
7,0.0255,0.325527,0.940329,0.942087
8,0.0229,0.329344,0.944532,0.945528
9,0.0118,0.325684,0.944972,0.946674
10,0.0062,0.315842,0.948272,0.949541


[codecarbon INFO @ 16:00:35] Energy consumed for RAM : 0.000038 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 16:00:35] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 16:00:35] 0.000215 kWh of electricity used since the beginning.
[codecarbon INFO @ 16:00:50] Energy consumed for RAM : 0.000075 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 16:00:50] Energy consumed for all CPUs : 0.000354 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 16:00:50] 0.000429 kWh of electricity used since the beginning.
[codecarbon INFO @ 16:01:05] Energy consumed for RAM : 0.000113 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 16:01:05] Energy consumed for all CPUs : 0.000531 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 16:01:05] 0.000644 kWh of electricity used since the beginning.
[codecarbon INFO @ 16:01:20] Energy consumed for RAM : 0.000150 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 16:01:20] Energy consumed for a

TrainOutput(global_step=25942, training_loss=0.7744246792102354, metrics={'train_runtime': 2196.4578, 'train_samples_per_second': 222.285, 'train_steps_per_second': 13.895, 'total_flos': 2154126104491128.0, 'train_loss': 0.7744246792102354, 'epoch': 17.0})

In [22]:
# Stop the emissions tracker after training is complete
emissions = tracker.stop()
print(f"Estimated CO2 emissions for training: {emissions} kg")

[codecarbon INFO @ 16:36:57] Energy consumed for RAM : 0.005493 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 16:36:57] Energy consumed for all CPUs : 0.025939 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 16:36:57] 0.031432 kWh of electricity used since the beginning.


Estimated CO2 emissions for training: 0.014880591645000947 kg


  df = pd.concat([df, pd.DataFrame.from_records([dict(data.values)])])
  df = pd.concat([df, pd.DataFrame.from_records([dict(data.values)])])


## Save the Labels in the Model's Configuration

This is important for reproducibility because it ensures the mappings are always explicit and correct.

In [23]:
label_feature = train_dataset.features['labels']
model.config.label2id = {name: i for i, name in enumerate(label_feature.names)}
model.config.id2label = {i: name for i, name in enumerate(label_feature.names)}
# Check the mappings
print(model.config.label2id)
print(model.config.id2label)

{'A1868': 0, 'A1870': 1, 'A2181': 2, 'A2491': 3, 'A2492': 4, 'A2493': 5, 'A2494': 6, 'A2495': 7, 'A2508': 8, 'A2755': 9, 'A2868': 10, 'A2870': 11, 'A2871': 12, 'A2872': 13, 'A2873': 14, 'A2874': 15, 'A2875': 16, 'A2876': 17, 'A2877': 18, 'A2878': 19, 'A2879': 20, 'A2880': 21, 'A2881': 22, 'A2882': 23, 'A2883': 24, 'A2884': 25, 'A2885': 26, 'A2886': 27, 'A2887': 28, 'A2888': 29, 'A2889': 30, 'A2890': 31, 'A2891': 32, 'A2892': 33, 'A2893': 34, 'A2894': 35, 'A2895': 36, 'A2896': 37, 'A2897': 38, 'A2898': 39, 'A2901': 40, 'A2902': 41, 'A2903': 42, 'A2904': 43, 'A2905': 44, 'A2906': 45, 'A2907': 46, 'A2908': 47, 'A2909': 48, 'A2910': 49, 'A2911': 50, 'A2912': 51, 'A2913': 52, 'A2914': 53, 'A2915': 54, 'A2916': 55, 'A2917': 56, 'A2918': 57, 'A2919': 58, 'A2920': 59, 'A2921': 60, 'A2922': 61, 'A2923': 62, 'A2924': 63, 'A2925': 64, 'A2926': 65, 'A2927': 66, 'A2928': 67, 'A2929': 68, 'A2930': 69, 'A2931': 70, 'A2932': 71, 'A2933': 72, 'A2934': 73, 'A2935': 74, 'A2936': 75, 'A2937': 76, 'A2938':

## Save the Model with its Tokenizer

Saving the model with its tokenizer is an important step for reproducibility, since it makes it easy to reload both the model and tokenizer later for use in inferencing.

In [24]:
# Save the model and tokenizer
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

('../authors/tokenizer_config.json',
 '../authors/special_tokens_map.json',
 '../authors/vocab.txt',
 '../authors/added_tokens.json',
 '../authors/tokenizer.json')

In [25]:
# Evaluate the model on the test dataset
test_results = trainer.evaluate(test_dataset)

print("Test Results:", test_results)

Test Results: {'eval_loss': 0.34222376346588135, 'eval_f1': 0.9497108240694611, 'eval_accuracy': 0.9508314220183486, 'eval_runtime': 4.5249, 'eval_samples_per_second': 1541.689, 'eval_steps_per_second': 24.089, 'epoch': 17.0}


In [26]:
# Load the saved model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained('../authors')
tokenizer = DistilBertTokenizerFast.from_pretrained('../authors')
# Example texts for inference
# Caesar = A4644, Cicero = A5129, Livy = A4979
texts = ["Caesar, Julius","Cicero, Marcus Tullius","Livy"]

# Tokenize the texts
inputs = tokenizer(texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt")

# Make predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Convert predictions back to labels
predicted_labels = [label_feature.int2str(label_id) for label_id in predictions.tolist()]

for text, label in zip(texts, predicted_labels):
    print(f"Text: {text} - Predicted Label: {label}")


Text: Caesar, Julius - Predicted Label: A4644
Text: Cicero, Marcus Tullius - Predicted Label: A5129
Text: Livy - Predicted Label: A4979


In [27]:
# Get the label mapping as a dictionary
label_mapping = {
    label_feature.int2str(i): i
    for i in range(label_feature.num_classes)
}

# Save the mapping to a JSON file
with open('../authors/label_mapping.json', 'w') as f:
    json.dump(label_mapping, f)