# Fine-tuning distilbert/distilbert-base-multilingual-cased for Greek-Latin Author Labeling

The titles of critical editions of Greek texts are often published with Latinized forms of the authors' names and titles. That presents a problem for a project that seeks to catalog only editions of Latin texts. In this context, Greek texts with Latinized author names and titles are false positives, so it is best to winnow them out of the data before seeking to reconcile authors to authority records.

I pulled lists of Greek author names from the Thesaurus Linguae Graecae and the Perseus Project (see `python/cleaning_exploration/greek-data-prep.ipynb` for more on the process). 

I will use those records to train a model to identify Greek and Latin authors.

## Import the Required Packages

This notebook uses:

- `codecarbon` to track energy usage
- `datasets` from the HuggingFace API to manage the data
- `json` to format results
- `numpy` for formatting and managing data
- `os` to interact with the operating system
- `pandas` to manage data
- `random` to set a random seed for reproducibility
- `scikit-learn` for evaluation metrics
- `transformers` from the HuggingFace API to manage the model
- `torch` for general machine learning functionality

Note that a HuggingFace access token is required. For information, see <https://huggingface.co/docs/hub/security-tokens>.

In [63]:
# Import the necessary modules
from codecarbon import EmissionsTracker
from datasets import Dataset, ClassLabel, Features, Value
import numpy as np
import os
import pandas as pd
import random
from sklearn.metrics import f1_score, accuracy_score
from transformers import (
    DataCollatorWithPadding,
    DistilBertTokenizerFast, 
    DistilBertForSequenceClassification,
    EarlyStoppingCallback,
    EvalPrediction, 
    set_seed,
    Trainer, 
    TrainingArguments
)
import torch

## Make Sure the Directory Structure is in Place

To keep the information organized, I will make a directory for the model (`greek`) and training logs (`logs`). I will also designate the path for the training data.

In [64]:
# Create paths if they do not exist
if not os.path.exists("../greek"):
    os.makedirs("../greek")

# Establish file directories
output_dir = '../greek'  # Directory for saving the model

if not os.path.exists("../logs"):
    os.makedirs("../logs") # Directory for logs

log_dir = '../logs'

local_file_path = '../data/deduped_greek_and_latin.csv' # Path to the dataset file

## Set a Random Seed for Reproducibility

Since so much of machine learning depends upon randomness, setting a "random seed" makes the randomness consistent across runs, so that the results should be reproducible by anyone running the code in this notebook. The `seed` variable will be called in creating the dataset splits and in setting the parameters for the `Trainer`.

In [65]:
# Set the random seed for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(seed)

## Load the Data and Prepare it for Use with the Model

The data I gathered from the TLG and Perseus needs to be turned into tensors, the format expected by the model. The following operations load the original data, tokenize the data, convert it into the HuggingFace dataset format, and split the dataset into training, validation, and testing sets.


In [66]:
# Load the original CSV file into a pandas DataFrame
data = pd.read_csv(local_file_path, encoding='utf-8', quotechar='"')
# Convert 'Name' column to string type to avoid issues with NaN values
data['Name'] = data['Name'].astype(str)
# Replace NaN values with an empty string
data['Name'] = data['Name'].fillna('')

### Determine the Correct max_length Value

Setting a `max_length` ensures that inputs to the model will be consistent in size. It will also make the training faster and more efficient.

In [67]:
# Set TOKENIZERS_PARALLELISM to false to avoid warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Initialize the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-multilingual-cased')

# Tokenize the data to find the max length
tokens = data['Name'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
# Determine the maximum length for padding/truncation. It will be called later in the Trainer.
max_length = int(pd.Series(tokens).map(len).quantile(0.95))
print("Determined max_length for padding/truncation:", max_length)

Determined max_length for padding/truncation: 20


### Convert the data into the Dataset format

The class `Dataset` in the HuggingFace API has a method for converting pandas dataframes into the expected format. The following block makes use of that. It also updates the dataset to use 

In [68]:
# Create the dataset from pandas
hf_dataset = Dataset.from_pandas(data)

# Get unique label names
label_names = sorted(data['Label'].unique())
num_labels = len(label_names)

# Define features with ClassLabel
features = Features({
    'Name': Value('string'),
    'Label': ClassLabel(names=label_names)
})

# Cast the dataset to use these features
hf_dataset = hf_dataset.cast(features)

# Rename 'Label' to 'labels' for transformers compatibility
hf_dataset = hf_dataset.rename_column('Label', 'labels')

Casting the dataset:   0%|          | 0/40458 [00:00<?, ? examples/s]

### Split the Dataset into Training, Validation, and Testing Sets

In [69]:
# Split the dataset into training, validation, and testing sets
train_test_split = hf_dataset.train_test_split(test_size=0.2, seed=seed)
train_val_split = train_test_split['train'].train_test_split(test_size=0.125, seed=seed)  # 0.125 of 0.8 = 0.1 of original

train_dataset = train_val_split['train']
val_dataset = train_val_split['test']
test_dataset = train_test_split['test']

### Tokenize the Dataset

This ensures that the dataset is in the format expected by the `Trainer`, defined below.

In [70]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['Name'], padding="max_length", truncation=True, max_length=max_length)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/28320 [00:00<?, ? examples/s]

Map:   0%|          | 0/4046 [00:00<?, ? examples/s]

Map:   0%|          | 0/8092 [00:00<?, ? examples/s]

### Set the Format of the Dataset for PyTorch

The HuggingFace API is really an abstraction of many PyTorch calls, so the dataset must be in the form expected by PyTorch.

In [71]:
# Set the format for PyTorch
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Print the dataset to ensure correct setup
print("Training Dataset:", train_dataset)
print("Validation Dataset:", val_dataset)
print("Testing Dataset:", test_dataset)

Training Dataset: Dataset({
    features: ['Name', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 28320
})
Validation Dataset: Dataset({
    features: ['Name', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 4046
})
Testing Dataset: Dataset({
    features: ['Name', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 8092
})


### Get the Number of Labels

This counts the number of labels (two in this case: "Greek" and "Latin") and initializes the model's classification head with the correct number of output units. This is why I don't have to set anything manually for a binary or multiclass classification problem. That is handled automatically based on the value of `num_labels`.

In [72]:
num_labels = data['Label'].nunique()
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-multilingual-cased', num_labels=num_labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Set the Metrics

I will use `f1` and `accuracy` as basic metrics for the model's training. More information at <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html> and <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html>.

In [73]:
def compute_metrics(pred: EvalPrediction):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'f1': f1,
        'accuracy': acc
    }

## Create the TrainingArguments

The HuggingFace [`Trainer` class](https://huggingface.co/docs/transformers/main_classes/trainer) has many parameters. I will set a number of them in this array.

In [74]:
training_args = TrainingArguments(
    output_dir=output_dir,              # Directory for saving model and logs
    eval_strategy="epoch",              # Evaluate the model every 'eval_steps'
    eval_steps=500,                     # Number of steps to run evaluation
    logging_dir=log_dir,                # Directory for storing logs
    logging_strategy="steps",           # Log metrics every epoch
    logging_steps=100,                  # Log metrics every 100 steps
    save_strategy="epoch",              # Save checkpoints every 'save_steps'
    save_steps=500,                     # Save the model every 500 steps
    save_total_limit=1,                 # Limit the total number of checkpoints to save
    seed=seed,                          # Set the random seed for reproducibility
    per_device_train_batch_size=16,     # Batch size for training
    per_device_eval_batch_size=64,      # Batch size for evaluation
    num_train_epochs=20,                # Number of training epochs
    load_best_model_at_end=True,        # Load the best model at the end of training
    metric_for_best_model='f1',         # Use F1 score to find the best model
    greater_is_better=True,             # Higher F1 score is better
    report_to=["tensorboard"]           # Report metrics to TensorBoard
)

## Use the DataCollator

This next block creates a helper that dynamically pads the batches during training and evaluation, increasing efficiency and speed.

In [75]:
# Initialize the data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Initialize the Trainer

Assemble all the parameters defined above and bundle them into the Trainer.

In [76]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    processing_class=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    data_collator=data_collator,
)

[codecarbon INFO @ 16:59:45] Energy consumed for RAM : 0.010425 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 16:59:45] Energy consumed for all CPUs : 0.049238 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 16:59:45] 0.059663 kWh of electricity used since the beginning.


### Take Advantage of Available GPU or Fallback to CPU

The following code checks whether an NVIDIA GPU is available. If not, it checks for an Apple Silicon GPU (which I have). If neither one is available, it defaults to using the CPU.

In [77]:
# General device selection for Colab (CUDA), Mac (MPS), or CPU fallback
if torch.cuda.is_available():
    # Set the device to CUDA (NVIDIA GPU)
    device = torch.device("cuda")
    print("Using device: CUDA (GPU)")
elif torch.backends.mps.is_available():
    # Set the device to MPS (Metal Performance Shaders)
    device = torch.device("mps")
    print("Using device: MPS (Apple Silicon GPU)")
else:
    # Fallback to CPU if no GPU is available
    device = torch.device("cpu")
    print("Using device: CPU")

# Move the model to the selected device
model = model.to(device)

Using device: MPS (Apple Silicon GPU)


## Set Up the EmissionsTracker from CodeCarbon

Initialize a tracker and start tracking emissions. Data is printed to the screen and logged in `logs/greek_latin_log.csv`.

In [78]:
# Setup CodeCarbon's EmissionsTracker
tracker = EmissionsTracker(
    output_dir="../logs",
    output_file="greek_latin_emissions_log.csv"
)
tracker.start()

[codecarbon INFO @ 16:59:46] [setup] RAM Tracking...
[codecarbon INFO @ 16:59:46] [setup] GPU Tracking...
[codecarbon INFO @ 16:59:46] No GPU found.
[codecarbon INFO @ 16:59:46] [setup] CPU Tracking...
[codecarbon INFO @ 16:59:46] CPU Model on constant consumption mode: Apple M4 Pro
[codecarbon INFO @ 16:59:46] >>> Tracker's metadata:
[codecarbon INFO @ 16:59:46]   Platform system: macOS-15.5-arm64-arm-64bit
[codecarbon INFO @ 16:59:46]   Python version: 3.10.9
[codecarbon INFO @ 16:59:46]   CodeCarbon version: 2.2.2
[codecarbon INFO @ 16:59:46]   Available RAM : 24.000 GB
[codecarbon INFO @ 16:59:46]   CPU count: 12
[codecarbon INFO @ 16:59:46]   CPU model: Apple M4 Pro
[codecarbon INFO @ 16:59:46]   GPU count: None
[codecarbon INFO @ 16:59:46]   GPU model: None


## Train the Model

In [79]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.1739,0.152165,0.955511,0.955264
2,0.106,0.122472,0.962911,0.962926
3,0.0527,0.182943,0.964394,0.964409
4,0.0465,0.188794,0.956759,0.956253
5,0.0249,0.197286,0.964409,0.964409
6,0.0447,0.205875,0.968384,0.968364
7,0.0331,0.213764,0.965808,0.965645
8,0.03,0.195977,0.968966,0.968858
9,0.0178,0.265488,0.967429,0.967375
10,0.0145,0.267624,0.965198,0.965151


[codecarbon INFO @ 17:00:00] Energy consumed for RAM : 0.010463 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 17:00:00] Energy consumed for all CPUs : 0.049415 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 17:00:00] 0.059878 kWh of electricity used since the beginning.
[codecarbon INFO @ 17:00:01] Energy consumed for RAM : 0.000038 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 17:00:01] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 17:00:01] 0.000215 kWh of electricity used since the beginning.
[codecarbon INFO @ 17:00:15] Energy consumed for RAM : 0.010500 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 17:00:15] Energy consumed for all CPUs : 0.049592 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 17:00:15] 0.060092 kWh of electricity used since the beginning.
[codecarbon INFO @ 17:00:16] Energy consumed for RAM : 0.000075 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 17:00:16] Energy consumed for a

TrainOutput(global_step=19470, training_loss=0.054792497251862034, metrics={'train_runtime': 1641.0267, 'train_samples_per_second': 345.15, 'train_steps_per_second': 21.572, 'total_flos': 1611962657395200.0, 'train_loss': 0.054792497251862034, 'epoch': 11.0})

In [80]:
# Stop the emissions tracker after training is complete
emissions = tracker.stop()

print(f"Estimated CO2 emissions for training: {emissions} kg")

[codecarbon INFO @ 17:27:08] Energy consumed for RAM : 0.004103 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 17:27:08] Energy consumed for all CPUs : 0.019374 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 17:27:08] 0.023477 kWh of electricity used since the beginning.


Estimated CO2 emissions for training: 0.011114388197519184 kg


## Save the Labels in the Model's Configuration

This is important for reproducibility because it ensures the mappings are always explicit and correct.

In [81]:
label_feature = train_dataset.features['labels']
model.config.label2id = {name: i for i, name in enumerate(label_feature.names)}
model.config.id2label = {i: name for i, name in enumerate(label_feature.names)}
# Check the mappings
print(model.config.label2id)
print(model.config.id2label)

{'Greek': 0, 'Latin': 1}
{0: 'Greek', 1: 'Latin'}


## Save the Model with its Tokenizer

Saving the model with its tokenizer is an important step for reproducibility, since it makes it easy to reload both the model and tokenizer later for use in inferencing.

In [82]:
# Save the model and tokenizer
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

('../greek/tokenizer_config.json',
 '../greek/special_tokens_map.json',
 '../greek/vocab.txt',
 '../greek/added_tokens.json',
 '../greek/tokenizer.json')

## Evaluate the Model on the Test Data

When I split the dataset above, I held out 20% of it for testing the model on data it had not yet "seen". I'll load the model and evaluate it on the test set.

In [83]:
# Evaluate the model on the test dataset
test_results = trainer.evaluate(test_dataset)

print("Test Results:", test_results)

Test Results: {'eval_loss': 0.19626270234584808, 'eval_f1': 0.9677150540915004, 'eval_accuracy': 0.967622343054869, 'eval_runtime': 4.401, 'eval_samples_per_second': 1838.662, 'eval_steps_per_second': 28.857, 'epoch': 11.0}


In [None]:
# Load the saved model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained('../greek')
tokenizer = DistilBertTokenizerFast.from_pretrained('../greek')
# Example texts for inference
texts = ["Juvenal","Unknown","Caesar, Julius","Cicero, Marcus Tullius","Homer","Hesiod","Aristotle"]

# Tokenize the texts
inputs = tokenizer(texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt")

# Make predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Convert predictions back to labels
predicted_labels = [label_feature.int2str(label_id) for label_id in predictions.tolist()]

for text, label in zip(texts, predicted_labels):
    print(f"Text: {text} - Predicted Label: {label}")

Text: Juvenal - Predicted Label: Latin
Text: Unknown - Predicted Label: Latin
Text: Caesar, Julius - Predicted Label: Latin
Text: Cicero, Marcus Tullius - Predicted Label: Latin
Text: Homer - Predicted Label: Greek
Text: Hesiod - Predicted Label: Greek
Text: Aristotle - Predicted Label: Greek


[codecarbon INFO @ 17:27:16] Energy consumed for RAM : 0.014551 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 17:27:16] Energy consumed for all CPUs : 0.068720 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 17:27:16] 0.083271 kWh of electricity used since the beginning.
[codecarbon INFO @ 17:27:31] Energy consumed for RAM : 0.014588 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 17:27:31] Energy consumed for all CPUs : 0.068897 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 17:27:31] 0.083486 kWh of electricity used since the beginning.
[codecarbon INFO @ 17:27:46] Energy consumed for RAM : 0.014626 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 17:27:46] Energy consumed for all CPUs : 0.069075 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 17:27:46] 0.083700 kWh of electricity used since the beginning.
[codecarbon INFO @ 17:28:01] Energy consumed for RAM : 0.014663 kWh. RAM Power : 9.000000000000002 W
[codecarbon INFO @ 17:28:01] Energy consumed for a