# Fine-Tune Large Language Models for Hittite Glossing

In this notebook, we explore the fine-tuning of the T5 large language model (LLM) for the task of Hittite glossing, focusing on its adaptability to low-resource ancient languages. We employ the pre-trained T5 model to investigate its efficacy in addressing the unique challenges of Hittite morphology. The notebook outlines the process of fine-tuning T5 and evaluates its performance using metrics such as token-level accuracy.

# Table of Contents

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

Now install the required packages for the LLM and datasets.


In [1]:
%pip install --upgrade pip
%pip install \
    torch==1.13.1 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    transformers==4.27.2 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 \
    sentencepiece \
    openai \
    pandas \
    numpy \
    matplotlib \
    tqdm \
    evaluate


Collecting pip
  Downloading pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-24.3.1
Collecting torch==1.13.1
  Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Collecting datasets==2.11.0
  Downloading datasets-2.11.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate==0.4.0
  Downloading evaluate-0.4.0-py3-none-any.whl.metadata (9.4 kB)
Collecting transformers==4.27.2
  Downloading transformers-4.27.2-py3-none-any.whl.metadata (106 kB)
Collecting rouge_score==0.1.2
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting loralib==0.1.

Import the necessary components.




In [2]:
# Core Libraries
import torch  # PyTorch for model training
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
from transformers import Trainer, TrainingArguments
from datasets import Dataset, load_metric

# Utility Libraries
import pandas as pd
import numpy as np
import os
import random
import time
import evaluate


<a name='1.2'></a>
### 1.2 - Load Dataset and LLM.

In [3]:
data = pd.read_csv("hittite_ds.csv", index_col = 0)
data.columns = ['txtid','lnr','cth','word', 'text', 'gloss','trans']
data.head()

Unnamed: 0,txtid,lnr,cth,word,text,gloss,trans
0,IBoT 1.30+,Vs. 1,821,LUGALuš,⸢LUGAL⸣-uš,FNL(u).NOM.SG.C,König
1,IBoT 1.30+,Vs. 1,821,kuapi,ku-wa-pí,CNJ,sobald als
2,IBoT 1.30+,Vs. 1,821,DINGIRaš,DINGIR{MEŠ}-aš,D/L.PL,Gottheit
3,IBoT 1.30+,Vs. 1,821,aruaizi,a-ru-wa-a-ez-zi,3SG.PRS,sich verneigen
4,IBoT 1.30+,Vs. 1,821,GUDU₁₂,{LÚ}GUDU₁₂,NOM.SG(UNM),Gesalbter


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 170496 entries, 0 to 170495
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   txtid   170496 non-null  object
 1   lnr     170496 non-null  object
 2   cth     170496 non-null  int64 
 3   word    170496 non-null  object
 4   text    170496 non-null  object
 5   gloss   170469 non-null  object
 6   trans   170496 non-null  object
dtypes: int64(1), object(6)
memory usage: 10.4+ MB


In [5]:
hf_dataset = Dataset.from_pandas(data[['word','trans']])
hf_dataset = hf_dataset.remove_columns("__index_level_0__") if "__index_level_0__" in hf_dataset.column_names else hf_dataset

hf_dataset

Dataset({
    features: ['word', 'trans'],
    num_rows: 170496
})

In [6]:
# Split into train, validation, and test sets
splits = hf_dataset.train_test_split(test_size=0.2, seed=43)  # 80% train, 20% test
train_dataset = splits["train"]
test_dataset = splits["test"]

# Further split test set into validation and test
val_test_splits = test_dataset.train_test_split(test_size=0.5, seed=43)  # 50/50 split
val_dataset = val_test_splits["train"]
test_dataset = val_test_splits["test"]

Load the pre-trained [GermanT5 model](https://huggingface.co/GermanT5/german-t5-oscar-ep1-prompted-germanquad) and its tokenizer directly from HuggingFace.

In [43]:
model_name='GermanT5/german-t5-oscar-ep1-prompted-germanquad'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)



In [None]:
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# original_model = original_model.to(device)


In [None]:
#original_model.config

In [None]:
# test_input = "Provide the gloss for the word: LUGALuš"
# tokens = tokenizer(test_input, return_tensors="pt")
# print(tokens)

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [20]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}, all model parameters: {all_model_params}, percentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print_number_of_trainable_model_parameters(original_model)

'trainable model parameters: 247539456, all model parameters: 247539456, percentage of trainable model parameters: 100.00%'

<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing.

In [23]:
random_indices = random.sample(range(len(test_dataset)), 5)
for idx in random_indices:
    hittite_word = test_dataset[idx]['word']
    expected_gloss = test_dataset[idx]['trans']
    prompt = f"Geben Sie die Übersetzung für das folgende hethitische Wort \n\n{hittite_word} an:\n\n die Übersetzung:"
    inputs = tokenizer(prompt, return_tensors="pt")
    output = tokenizer.decode(
        original_model.generate(inputs["input_ids"], max_new_tokens=50)[0],
        skip_special_tokens=True,
    )
    print(f"Input: {hittite_word}")
    print(f"Expected translation: {expected_gloss}")
    print(f"Generated translation: {output}\n")


Input: anaḫi
Expected translation: Kostprobe (einer Opfergabe)
Generated translation: ''anaḫi''

Input: ḫazi
Expected translation: Ḫazzi
Generated translation: ''subsidia''

Input: ME-ŠE-DI
Expected translation: Leibwächter
Generated translation: ''ME-ŠE-DI'' („Ägyptisch-Arabisch“)

Input: kuliuišna
Expected translation: Kuliwišna
Generated translation: kategorische

Input: UŠ-KÉ-EN
Expected translation: sich niederwerfen
Generated translation: ''UŠ-KÉ-EN''



<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Dataset

Training Prompt (Hittite Word or Phrase):
Prepend the instruction "Provide the translation for the following Hittite word": before the input word or phrase

Example:


```
Provide the translation for the following Hittite word:

LUGALuš

Translation:
```



Training Response (Translation):
The response should be the expected translation for the word.

Example:



```
König
```




In [27]:
def preprocess_function(example):
    # Ensure input and target are strings
    return {
        "input_text": f"Geben Sie die Übersetzung für das folgende hethitische Wort {str(example['word'])} an \n\n die Übersetzung:",
        "target_text": str(example["trans"])
    }

In [28]:
train_dataset_preprocessed = train_dataset.map(preprocess_function)
val_dataset_preprocessed = val_dataset.map(preprocess_function)
test_dataset_preprocessed = test_dataset.map(preprocess_function)
train_dataset_preprocessed[378]

Map:   0%|          | 0/136396 [00:00<?, ? examples/s]

Map:   0%|          | 0/17050 [00:00<?, ? examples/s]

Map:   0%|          | 0/17050 [00:00<?, ? examples/s]

{'word': 'AZU',
 'trans': 'Opferschauer',
 'input_text': 'Geben Sie die Übersetzung für das folgende hethitische Wort AZU an \n\n die Übersetzung:',
 'target_text': 'Opferschauer'}

In [29]:
# Tokenize datasets
def tokenize_function(example):
    model_inputs = tokenizer(
        example["input_text"], max_length=512, padding="max_length", truncation=True
    )
    labels = tokenizer(
        example["target_text"], max_length=128, padding="max_length", truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [30]:
train_dataset_tokenized = train_dataset_preprocessed.map(tokenize_function, batched=True)
val_dataset_tokenized = val_dataset_preprocessed.map(tokenize_function, batched=True)
test_dataset_tokenized = test_dataset_preprocessed.map(tokenize_function, batched=True)

# Set dataset format for PyTorch
train_dataset_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
val_dataset_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Map:   0%|          | 0/136396 [00:00<?, ? examples/s]

Map:   0%|          | 0/17050 [00:00<?, ? examples/s]

Map:   0%|          | 0/17050 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [31]:
print(f"Shapes of the datasets:")
print(f"Training: {train_dataset_tokenized.shape}")
print(f"Validation: {val_dataset_tokenized.shape}")
print(f"Test: {test_dataset_tokenized.shape}")

test_dataset_tokenized

Shapes of the datasets:
Training: (136396, 7)
Validation: (17050, 7)
Test: (17050, 7)


Dataset({
    features: ['word', 'trans', 'input_text', 'target_text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 17050
})

The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [32]:
output_dir = f'./glossing-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=4,
    weight_decay=0.01,
    logging_steps=1,
    #max_steps=1,
    #per_device_train_batch_size=8,
    evaluation_strategy="epoch",
    report_to="none"
)


trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=train_dataset_tokenized,
    eval_dataset=val_dataset_tokenized
)

Start training process...



In [33]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.0021,0.011685
2,0.0294,0.007539
3,0.0341,0.006213
4,0.0028,0.005919


TrainOutput(global_step=68200, training_loss=0.027722070564675638, metrics={'train_runtime': 25944.7089, 'train_samples_per_second': 21.029, 'train_steps_per_second': 2.629, 'total_flos': 3.73560475524268e+17, 'train_loss': 0.027722070564675638, 'epoch': 4.0})



Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [34]:
# Save the fine-tuned model
trainer.save_model(output_dir)

# Save the tokenizer
tokenizer.save_pretrained(output_dir)

print(f"Model and tokenizer saved to {output_dir}")


Model and tokenizer saved to ./glossing-training-1733253000


In [35]:
instructed_model = AutoModelForSeq2SeqLM.from_pretrained(output_dir)
instructed_tokenizer = AutoTokenizer.from_pretrained(output_dir)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below, you can see how the fine-tuned model is able to create a reasonable translations compared to the original inability to understand what is being asked of the model.

In [36]:
random_indices = random.sample(range(len(test_dataset)), 5)

In [37]:
for idx in random_indices:
    hittite_word = test_dataset[idx]['word']
    expected_gloss = test_dataset[idx]['trans']
    prompt = f"Provide the translation for the following Hittite word:\n\n{hittite_word}\n\nTranslation:"
    inputs = instructed_tokenizer(prompt, return_tensors="pt")
    output = instructed_tokenizer.decode(
        instructed_model.generate(inputs["input_ids"], max_new_tokens=50)[0],
        skip_special_tokens=True,
    )
    print(f"Input: {hittite_word}")
    print(f"Expected translation: {expected_gloss}")
    print(f"Generated translation: {output}\n")

Input: ekuzi
Expected translation: trinken
Generated translation: (Gefäß)

Input: mezulla
Expected translation: Mez(z)ul(l)a
Generated translation: Mez(z)ul(l)a

Input: QA-TAM-MApat
Expected translation: ebenso
Generated translation: ebenso

Input: pai
Expected translation: geben
Generated translation: (u.B.)

Input: DINGIRnana
Expected translation: Gottheit
Generated translation: (Priesterin)



<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [38]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [48]:
# Move models to the appropriate device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

hittite_words = test_dataset[10450:10550]['word']  # Hittite words/phrases
human_baseline_tr = test_dataset[10450:10550]['trans']  # Expected glosses (human-provided)

original_model = original_model.to(device)
instructed_model = instructed_model.to(device)

# Initialize lists to store model outputs
original_model_tr = []
instructed_model_tr = []

# Iterate through the selected examples
for hittite_word in hittite_words:
    # Create the input prompt
    prompt = f"""
    Geben Sie die Übersetzung für das folgende hethitische Wort

    {hittite_word} an

    die Übersetzung:
    """
    # Tokenize the input prompt
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)  # Move input_ids to the same device as the model
    input_ids_instructed = instructed_tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    # Generate gloss using the pre-trained (original) model
    original_model_outputs = original_model.generate(
        input_ids=input_ids,
        generation_config=GenerationConfig(max_new_tokens=50)
    )
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_tr.append(original_model_text_output)

    # Generate gloss using the fine-tuned (instructed) model
    instructed_model_outputs = instructed_model.generate(
        input_ids=input_ids_instructed,
        generation_config=GenerationConfig(max_new_tokens=50)
    )
    instructed_model_text_output = instructed_tokenizer.decode(instructed_model_outputs[0], skip_special_tokens=True)
    instructed_model_tr.append(instructed_model_text_output)

# Combine results into a DataFrame for analysis
zipped_glosses = list(zip(human_baseline_tr, original_model_tr, instructed_model_tr))
df = pd.DataFrame(zipped_glosses, columns=['human_baseline_tr', 'original_model_tr', 'instruct_model_tr'])

# Display the DataFrame
df


Unnamed: 0,human_baseline_tr,original_model_tr,instruct_model_tr
0,Großer,Geben Sie die erste von 100 Sanskrit-Übersetzu...,Großer
1,Brotlaib,ninda.gura,Brotlaib
2,Totentempel,šišta,Ḫišta
3,Berg,nihilistisch,Papi
4,Linsensuppe,TU Chemnitz,Saubohnensuppe
...,...,...,...
95,ziehen,Geben Sie die erste von 100 Sanskrit-Übersetzu...,ziehen
96,dann,namma,dann
97,Wettergott,ŠKURna,Wettergott
98,Siegel,KIŠIB,Siegel


In [47]:
original_model_results = rouge.compute(
    predictions=original_model_tr,
    references=human_baseline_tr,
    use_aggregator=True,
    use_stemmer=True,
)

# Compute ROUGE scores for the instructed (fine-tuned) model
instructed_model_results = rouge.compute(
    predictions=instructed_model_tr,
    references=human_baseline_tr,
    use_aggregator=True,
    use_stemmer=True,
)

# Display results
print("ORIGINAL MODEL:")
print(original_model_results)

print("\nINSTRUCTED MODEL:")
print(instructed_model_results)

ORIGINAL MODEL:
{'rouge1': 0.025555555555555557, 'rouge2': 0.02, 'rougeL': 0.025555555555555557, 'rougeLsum': 0.026666666666666665}

INSTRUCTED MODEL:
{'rouge1': 0.895, 'rouge2': 0.27, 'rougeL': 0.895, 'rougeLsum': 0.895}


The results indicate that the **fine-tuned (instructed) model** performs significantly better than the original pre-trained model on the task of Hittite translation:

**Original Model**: Low ROUGE-1 (2.56%) and ROUGE-2 (2%) scores, showing minimal overlap with reference translations, which reflects its lack of task-specific adaptation.

**Instructed Model**: High ROUGE-1 (89.5%) and ROUGE-L (89.5%) scores, indicating strong alignment with the reference translations. ROUGE-2 (27%) suggests there is still room for improvement in capturing multi-word patterns.

***Conclusion: Fine-tuning improves the model's performance substantially, adapting it effectively for Hittite translation tasks.***