# Fine-Tune a Generative AI Model for Machine Translation task

In this notebook, I will fine-tune an existing LLM from Hugging Face for enhanced machine translation. I will use the [MarianMT](https://huggingface.co/docs/transformers/en/model_doc/marian) model, which provides a high quality instruction tuned model and can translation text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with BLEU metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

# Table of Contents

- [ 1 - Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Machine Translation Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with BLEU Metric)](#2.4)

<a name='1'></a>
## 1 - Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Required Dependencies

Install the required packages for the LLM and datasets.


In [1]:
%pip install -U datasets==2.17.0

%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Ignored the following yanked versions: 0.3.0a0[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement torchdata==0.5.1 (from versions: 0.3.0a1, 0.3.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.8.0, 0.9.0, 0.10.0, 0.10.1)[0m[31m
[0m[31mERROR: No matching distribution found for torchdata==0.5.1[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Import the necessary components. Some of them are new for this week, they will be discussed later in the notebook. 

In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

You are going to continue experimenting with the [En-Az](https://huggingface.co/datasets/Zarifa/English-To-Azerbaijani) Hugging Face dataset. It contains 5,000+ sentences with the corresponding manually labeled translation. 

In [3]:
huggingface_dataset_name = "Zarifa/English-To-Azerbaijani"

dataset = load_dataset(huggingface_dataset_name)

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 5161
    })
})

Load the pre-trained [MarianMT](https://huggingface.co/Helsinki-NLP/opus-mt-az-en) and its tokenizer directly from HuggingFace. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [4]:
model_name = "Helsinki-NLP/opus-mt-az-en"
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)



It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it. 

In [5]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 56061952
all model parameters: 56586240
percentage of trainable model parameters: 99.07%


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to translate the test compared to the baseline result, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [6]:
index = 200

sentence = dataset['train'][index]['translation']['aze']
translate = dataset['train'][index]['translation']['en']

prompt = f"""
Translate the following sentence.

{sentence}

Translation:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN Translation:\n{translate}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Translate the following sentence.

Başçı dörd illiyinə seçildi.

Translation:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN Translation:
The president was elected for four years.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
Technology was elected for four years:


<a name='2.1'></a>
### 2.1 - Preprocess the Machine Translation Dataset

You need to convert the sentence-translate (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Translate the following conversation` and to the start of the Translate with `Translate` as follows:

Training prompt (Translation):
```
Translate the following conversation.

    Başçı dörd illiyinə seçildi.
    
Translate: 
```

Training response (Translate):
```
The president was elected for four years.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 5161
    })
})

In [8]:
def tokenize_function(example):
    start_prompt = 'Translate the following conversation.\n\n'
    end_prompt = '\n\nTranslate: '
    prompt = [start_prompt + sentence + end_prompt for sentence in [ex['aze'] for ex in example['translation']]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer([ex['en'] for ex in example['translation']], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [9]:
tokenized_datasets = tokenized_datasets.remove_columns(['id'])

To save some time in the lab, I will subsample the dataset:

*Note* : I do not have validation so I will take other portion as validation.

In [10]:
tokenized_datasets_training = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)
tokenized_datasets_validation = tokenized_datasets.filter(lambda example, index: index % 1001 == 0, with_indices=True)

Check the shapes of all three parts of the dataset:

In [11]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets_training['train'].shape}")
print(f"Validation: {tokenized_datasets_validation['train'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (52, 3)
Validation: (6, 3)
DatasetDict({
    train: Dataset({
        features: ['translation', 'input_ids', 'labels'],
        num_rows: 5161
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [51]:
output_dir = f'./machine-translation-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=115,
    weight_decay=0.01,
    logging_steps=10,
    max_steps=10,
    save_strategy="steps",  # Save checkpoints during training
    save_steps=5,           # Save every 5 steps
    save_total_limit=2      # Keep only the 2 most recent checkpoints
)


trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets_training['train'],
    eval_dataset=tokenized_datasets_validation['train'] # I pass the same because my dataset doesnt have validation and test
)

Start training process...

In [52]:
trainer.train()

100%|██████████| 10/10 [00:05<00:00,  2.00it/s]

{'loss': 1.5615, 'learning_rate': 0.0, 'epoch': 1.43}


100%|██████████| 10/10 [00:05<00:00,  1.67it/s]

{'train_runtime': 5.9995, 'train_samples_per_second': 13.334, 'train_steps_per_second': 1.667, 'train_loss': 1.561526107788086, 'epoch': 1.43}





TrainOutput(global_step=10, training_loss=1.561526107788086, metrics={'train_runtime': 5.9995, 'train_samples_per_second': 13.334, 'train_steps_per_second': 1.667, 'train_loss': 1.561526107788086, 'epoch': 1.43})

Training a fully fine-tuned version of the model would take a few hours on a GPU. To save time, download a checkpoint of the fully fine-tuned model to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the **instruct model** in this lab.

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [12]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./machine-translation-training-1736454923/checkpoint-5", torch_dtype=torch.bfloat16)

  return torch.load(checkpoint_file, map_location="cpu")


<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where I ask myself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), we can see how the fine-tuned model is able to create a reasonable translation of the sentence compared to the original inability to understand what is being asked of the model.

In [14]:
index = 200

sentence = dataset['train'][index]['translation']['aze']
human_baseline_translate = dataset['train'][index]['translation']['en']

prompt = f"""
Translate the following sentence.

{sentence}

Translation:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN TRANSLATION:\n{human_baseline_translate}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN TRANSLATION:
The president was elected for four years.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Thorn's sterling, the chief of us, was elected for four years:
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Thorn's sterling, the chief of us, was elected for four years:


<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with BLEU Metric)

The [BLEU metric](https://en.wikipedia.org/wiki/BLEU) helps quantify the validity of translation produced by models. It compares translation to a "baseline" translation which is usually created by a human.

In [27]:
bleu = evaluate.load('bleu')

Downloading builder script: 100%|██████████| 5.94k/5.94k [00:00<00:00, 19.2MB/s]
Downloading extra modules: 4.07kB [00:00, 9.64MB/s]                   
Downloading extra modules: 100%|██████████| 3.34k/3.34k [00:00<00:00, 18.4MB/s]


Generate the outputs for the sample of the test dataset (only 10 sentences and translations to save time), and save the results.

In [37]:
sentences = []
human_baseline_translates = []
part_of_dataset = dataset['train'][0:10]['translation']

for sentence in part_of_dataset:
    sentences.append(sentence['aze'])
    human_baseline_translates.append(sentence['en'])

In [39]:
original_model_translates = []
instruct_model_translates = []
peft_model_translates = []

for idx, sentence in enumerate(sentences):
    prompt = f"""
Summarize the following conversation.

{sentence}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_translates.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_translates.append(instruct_model_text_output)
    
zipped_translates = list(zip(human_baseline_translates, original_model_translates, instruct_model_translates))
 
df = pd.DataFrame(zipped_translates, columns = ['human_baseline_translates', 'original_model_translates', 'instruct_model_translates'])
df

Unnamed: 0,human_baseline_translates,original_model_translates,instruct_model_translates
0,"Good morning, ladies and gentlemen!","Sufinith Wallen Wallenz, the tomorrow smoldering:","Sufiniths of Sufts, the tomorrow laity, is goo..."
1,I give you my word.,"Sufinzler, I give you a word:","Surezler, I give you a word, a saying:"
2,Good morning.,"Sufinzler: Succeedingly, we're all in the same...","Sufinzler: Succeeding, smelting!"
3,Which is new?,"Sufiniths: there's no, no water, no, no, no.","Sufiniths, there's no, no water, no, no, no."
4,Yeah.,Sufin Schrogener: Succeeding.,Succeeding at Succeeding:
5,Currently Burj Khalifa is the tallest skyscrap...,Smokesion - boggling chills: low-power in the ...,"Smoothing, the Big Braille convict, is the hig..."
6,Goodbye!,Sufin Schrootzler: No.m.,"Sufinz, a superstitious constituent: Smooth."
7,How are you getting on?,Sufinithing: how are you?,"Sufinz, sterezing at the bottom of the smelts,..."
8,Where do you come from?,Sufinith stereters: all of you at that point?,Sufini stereters: all of you at that time?
9,Where is my newspaper?,Sufin Schroeder: Where is Succeeding?,Sufini is a sterling ion: where is my Succeory?


Evaluate the models computing BLEU metrics. Notice the improvement in the results!

In [58]:
# Compute BLEU for the original model
original_model_results = bleu.compute(
    predictions=original_model_translates,
    references=human_baseline_translates[0:len(original_model_translates)]
)

# Compute BLEU for the instruct model
instruct_model_results = bleu.compute(
    predictions=instruct_model_translates,
    references=human_baseline_translates[0:len(instruct_model_translates)]
)

# Print the results
print('ORIGINAL MODEL BLEU:')
print(original_model_results)
print('INSTRUCT MODEL BLEU:')
print(instruct_model_results)

ORIGINAL MODEL BLEU:
{'bleu': 0.0, 'precisions': [0.20833333333333334, 0.08139534883720931, 0.02631578947368421, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 1.8461538461538463, 'translation_length': 96, 'reference_length': 52}
INSTRUCT MODEL BLEU:
{'bleu': 0.0, 'precisions': [0.17757009345794392, 0.07216494845360824, 0.022988505747126436, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 2.0576923076923075, 'translation_length': 107, 'reference_length': 52}


<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon. 

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [54]:
from peft import LoraConfig, get_peft_model, TaskType

# Configure LoRA for the Helsinki model
# Dynamically generate the target modules
encoder_layers = [f"model.encoder.layers.{i}.self_attn.{proj}" for i in range(6) for proj in ["k_proj", "v_proj", "q_proj"]]
decoder_self_attn_layers = [f"model.decoder.layers.{i}.self_attn.{proj}" for i in range(6) for proj in ["k_proj", "v_proj", "q_proj"]]
decoder_cross_attn_layers = [f"model.decoder.layers.{i}.encoder_attn.{proj}" for i in range(6) for proj in ["k_proj", "v_proj", "q_proj"]]

# Combine all target modules
target_modules = encoder_layers + decoder_self_attn_layers + decoder_cross_attn_layers

# Configure LoRA
lora_config = LoraConfig(
    r=16,  # Rank
    lora_alpha=16,  # Scaling factor
    target_modules=target_modules,  # Explicitly specified target modules
    lora_dropout=0.1,  # Dropout for regularization
    bias="none",  # No bias reparameterization
    task_type=TaskType.SEQ_2_SEQ_LM  # Sequence-to-sequence task
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [55]:
peft_model = get_peft_model(original_model, 
lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 884736
all model parameters: 57470976
percentage of trainable model parameters: 1.54%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [56]:
output_dir = f'./peft-sentence-translate-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1    
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

Now everything is ready to train the PEFT adapter and save the model.

In [None]:
peft_trainer.train()

peft_model_path="./peft-machine-translation-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Prepare this model by adding an adapter to the original MarianMT model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [None]:
from peft import PeftModel, PeftConfig

peft_model = PeftModel.from_pretrained(original_model, 
                                       './peft-dialogue-summary-checkpoint-from-s3/', 
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [None]:
index = 200

sentence = dataset['train'][index]['translation']['aze']
human_baseline_translate = dataset['train'][index]['translation']['en']

prompt = f"""
Translate the following sentence.

{sentence}

Translation:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_translate}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with BLEU Metric)
Perform inferences for the sample of the test dataset (only 10 sentences and translations to save time). 

In [None]:
sentences = []
human_baseline_translates = []
part_of_dataset = dataset['train'][0:10]['translation']

for sentence in part_of_dataset:
    sentences.append(sentence['aze'])
    human_baseline_translates.append(sentence['en'])

original_model_translates = []
instruct_model_translates = []
peft_model_translates = []

for idx, sentence in enumerate(sentences):
    prompt = f"""
Summarize the following conversation.

{sentence}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_translates[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_translates.append(original_model_text_output)
    instruct_model_translates.append(instruct_model_text_output)
    peft_model_translates.append(peft_model_text_output)

zipped_translates = list(zip(human_baseline_translates, original_model_translates, instruct_model_translates, peft_model_translates))
 
df = pd.DataFrame(zipped_translates, columns = [
    'human_baseline_translates', 
    'original_model_translates', 
    'instruct_model_translates', 
    'peft_model_translates']
    )
df

Compute BLEU score for this subset of the data. 

In [None]:
# Compute BLEU for the original model
original_model_results = bleu.compute(
    predictions=original_model_translates,
    references=human_baseline_translates[0:len(original_model_translates)]
)

# Compute BLEU for the instruct model
instruct_model_results = bleu.compute(
    predictions=instruct_model_translates,
    references=human_baseline_translates[0:len(instruct_model_translates)]
)
# Compute BLEU for the PEFT model
instruct_model_results = bleu.compute(
    predictions=peft_model_translates,
    references=human_baseline_translates[0:len(peft_model_translates)]
)

# Print the results
print('ORIGINAL MODEL BLEU:')
print(original_model_results)
print('INSTRUCT MODEL BLEU:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)