<a href="https://colab.research.google.com/github/August-murr/Data_science_Demonstration/blob/main/Fine-Tuning%20Machine%20Translation%20Model/fine_tuning_machine_translation_model_with_peft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
 In this notebook, we'll leverage the [Parallel Movie Subtitles](https://www.kaggle.com/datasets/augustmurr/movie-parallel-dataset) dataset, to effectively fine-tune a machine translation model in Google Colab. This process will be facilitated by the GPU resources provided by Colab and utilizing the Hugging Face libraries. Specifically, we'll employ Hugging Face's PEFT (Parameter Efficient Fine-Tuning) library with a LoRA adapter. This approach allows us to train a small number of parameters, making it computationally efficient. Moreover, it helps mitigate the risk of catastrophic forgetting, contributing to enhanced performance in translating movie subtitles.

## Installing libraries

In [None]:
!pip install transformers peft accelerate sentencepiece datasets evaluate

**transformers**: Library for accessing pre-trained NLP models and tools.

**peft**: parameter efficent fine-tuning library for NLP models.

**accelerate**: for faster training through distributed computing.

**sentencepiece**: Used by Hugging Faces Tokenizers

**datasets**: Library for easy access and manipulation of commonly used datasets

**evaluate**: Library for evaluating the performance of NLP models.

## Downloading the dataset from Kaggle

The following cell uses your Kaggle API token to get data. Make sure you have your token ready or create a new one before running the cell.

In [None]:
# Importing necessary library for uploading Kaggle API token
from google.colab import files

# Uploading the Kaggle API token file
files.upload()
# Removing existing Kaggle directory if it exists
!rm -r ~/.kaggle
# Creating a new Kaggle directory
!mkdir ~/.kaggle
# Moving the uploaded Kaggle API token to the Kaggle directory
!mv ./kaggle.json ~/.kaggle/
# Setting appropriate permissions for the Kaggle API token file
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Downloading the dataset from Kaggle using the Kaggle API
!kaggle datasets download -d augustmurr/movie-parallel-dataset

Downloading movie-parallel-dataset.zip to /content
 98% 135M/138M [00:04<00:00, 41.0MB/s]
100% 138M/138M [00:04<00:00, 34.3MB/s]


In [None]:
# Unzipping the downloaded dataset
#change the path if yours is different
!unzip "/content/movie-parallel-dataset.zip"

##Data Loading and Cleaning

The dataset is organized as CSV files with two columns (the first one representing English). To prepare it for training, we need to do some cleaning. There are three functions explained below.

1. The first function, `find_files_with_name`, locates the paths of specific files within folders. Since there are two types of CSV files (time-based and line-by-line data), we load them separately. For this notebook, we'll focus on using time-based subtitles, which include parallel subtitles for each minute of a movie.

2. The `concatenate_and_clean_csv_files` function removes unnecessary characters such as "[],(),♪♪," which often represent additional explanations from the movie and do not contribute to the translation. It then combines the data to form a comprehensive dataset. Proper removal of these characters is crucial as their presence can significantly impact the model's performance and lead to hallucinations, where the model may start translating sentences that do not actually exist in the dataset.

3. The `split_dataframe` function divides the data into training, validation, and test sets.

In [None]:
import os
import pandas as pd
import re
from sklearn.model_selection import train_test_split

In [None]:
def find_files_with_name(file_path, file_name):
    matched_files = []
    # Walk through all subfolders
    for root, dirs, files in os.walk(file_path):
        if file_name in files:
            # Construct the full file path and add it to the list
            matched_files.append(os.path.join(root, file_name))
    return matched_files

In [None]:
def concatenate_and_clean_csv_files(file_paths):
    # Initialize an empty DataFrame to hold all the data
    large_df = pd.DataFrame()

    # Loop through all file paths and concatenate them into large_df
    for file_path in file_paths:
        df = pd.read_csv(file_path, skiprows=1, names=['language_1', 'language_2']).dropna()

        # Clean the data using regex
        pattern = r'\(.*?\)|\[.*?\]|♪♪.*?♪♪|\w+:|♪'
        df.loc[:, 'language_1'] = df['language_1'].apply(lambda x: re.sub(pattern, '', x).strip())
        df.loc[:, 'language_2'] = df['language_2'].apply(lambda x: re.sub(pattern, '', x).strip())

        # Append the cleaned dataframe to large_df
        large_df = pd.concat([large_df, df], ignore_index=True)

    # Rename the columns to 'input' and 'target'
    large_df.rename(columns={'language_1': 'input', 'language_2': 'target'}, inplace=True)

    return large_df

In [None]:
def split_dataframe(df, train_size=0.7, val_size=0.15, test_size=0.15):
    assert train_size + val_size + test_size == 1, "The split sizes must sum up to 1"

    # Calculate the proportion of validation set relative to the combined validation and test set size
    val_relative_size = val_size / (val_size + test_size)

    # Split the data into training and remaining data
    train_df, remaining_df = train_test_split(df, train_size=train_size, random_state=42, shuffle=False)

    # Split the remaining data into validation and test sets
    val_df, test_df = train_test_split(remaining_df, train_size=val_relative_size, random_state=42, shuffle=False)

    return train_df, val_df, test_df

"There are parallel datasets available for four language pairs: English to Thai, English to French, English to Arabic, and English to Indonesian. You can modify the path below to select the language pair you want to fine-tune the data for. In this case, we chose English to Thai. Additionally, you have the option to choose between line-by-line data or time-based data."

In [None]:
paths = find_files_with_name("/content/english to thai","parallel_subtitle_time_based.csv")
data = concatenate_and_clean_csv_files(paths)

"Here, you can select the proportion of data you want for training, validation, and testing. We have chosen the default configuration, which is `train_size=0.7`, `val_size=0.15`, and `test_size=0.15`."

In [None]:
train,val,test = split_dataframe(data)

## loading the model and tokenizer
To make training more efficient, we'll change all the weights from 32-bit floats to 16-bit floats.

In [None]:
import torch
import torch.nn.functional as F
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast, Seq2SeqTrainer
from evaluate import load
from datasets import Dataset

In [None]:
# Example usage
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
model.eval()  # Set the model to eval mode
if torch.cuda.is_available():
    model = model.to('cuda').half()  # Move model to CUDA for FP16 operations

## Translation Function

This function takes a list of sentences and a language ID, then tokenizes and translates them using the model. Make sure to pick the correct language IDs from this [link](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt).

In [None]:
# Function to translate a list of sentences to a specified language
def translate_sentences(model, tokenizer, sentences,target_lang_id,src_language_id="en_XX"):
    translated_sentences = []
    for sentence in sentences:
        tokenizer.src_lang = src_language_id
        encoded = tokenizer(sentence, return_tensors="pt", max_length=1024, truncation=True)

        # Move tensors to same device as model
        if torch.cuda.is_available():
            encoded = encoded.to('cuda')

        with torch.no_grad():
            generated_tokens = model.generate(
                **encoded,
                forced_bos_token_id=tokenizer.lang_code_to_id[target_lang_id]
            )
        translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
        translated_sentences.append(translation[0])
    return translated_sentences

## Evaluation Functions

For evaluation, we'll employ two methods. First, the BLEU score, and then a sentence similarity model. While BLEU is quicker and simpler, it doesn't consider alternative translations with similar meanings. Therefore, a sentence similarity model provides a more accurate evaluation of overall performance.

here's the [link](https://huggingface.co/setu4993/LEALLA-small) to the sentence similarities hugging face page.

In [None]:
# Function to evaluate BLEU score given two lists of strings
def evaluate_bleu(predictions, references):
    bleu_metric = load('bleu')
    formatted_references = [[sentence] for sentence in references]
    bleu_result = bleu_metric.compute(predictions=predictions, references=formatted_references)
    return bleu_result

In [None]:
from transformers import BertModel, BertTokenizerFast
import numpy as np

In [None]:
# Initialize tokenizer and model
similarity_tokenizer = BertTokenizerFast.from_pretrained("setu4993/LEALLA-small")
similarity_model = BertModel.from_pretrained("setu4993/LEALLA-small")
similarity_model.eval()

In [None]:
# Function to calculate the average similarity between pairs of sentences in two lists
def calculate_average_pairwise_similarity(list_sentences1, list_sentences2):
    assert len(list_sentences1) == len(list_sentences2), "The lists must be of the same length"

    similarity_scores = []

    for sentence1, sentence2 in zip(list_sentences1, list_sentences2):
        # Encode sentences
        inputs = similarity_tokenizer([sentence1, sentence2], return_tensors="pt", padding=True, truncation=True)

        # Generate embeddings
        with torch.no_grad():
            outputs = similarity_model(**inputs)

        # Get the pooler_output for sentence embeddings
        embeddings = outputs.pooler_output

        # Compute L2 normalized embeddings
        normalized_embeddings = F.normalize(embeddings, p=2, dim=1)

        # Compute similarity score for the pair
        similarity_score = torch.matmul(
            normalized_embeddings[0].unsqueeze(0),
            normalized_embeddings[1].unsqueeze(0).transpose(0, 1)
        )

        similarity_scores.append(similarity_score.item())

    # Calculate the average similarity score
    average_similarity = np.mean(similarity_scores)

    return average_similarity

## Evaluating the Original Model with BLEU and Sentence Similarity

Next, we'll assess the model using BLEU and a similarity function. This helps us make comparisons with the fine-tuned model later on.

In [None]:
english_sentences = list(train['input'][:500])

In [None]:
thai_sentences = list(train['target'][:500])

In [None]:
thai_translations = translate_sentences(model,tokenizer,english_sentences,"th_TH")

In [None]:
calculate_average_pairwise_similarity(thai_translations,thai_sentences)

In [None]:
evaluate_bleu(thai_translations,thai_sentences)

## Fine-Tuning with PEFT

To evaluate the model, we switched the weights to 16-bit floats. Now, for fine-tuning, we'll use 16-bit floats as well. However, the Hugging Face trainer requires the model to be in 32-bit floats because it handles the conversion internally. So, we'll reload the model accordingly.

If you encounter memory issues, you can use the following code to free up GPU memory:
```python
# Clear up the GPU memory
torch.cuda.empty_cache()
```

In [None]:
from transformers import AutoModelForSeq2SeqLM

In [None]:
model_checkpoint = "facebook/mbart-large-50-many-to-many-mmt"
model_for_peft = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

## LoRA Configuration

This is where we set up the LoRA adapter's parameters, like the number of weights to train, dropout rate, and which layers to include. To find out the layer names of our model, we ran the following code that prints the structure of the transformer model:

```python
# Print all named modules in the model to identify layer names
for name, module in model.named_modules():
    print(name, module.__class__.__name__)
```

Understanding these layer names helps us configure the LoRA adapter for fine-tuning.

In [None]:
from peft import get_peft_model
from peft import LoraConfig, TaskType

In [None]:
peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    inference_mode=False,
    r=64,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"]
)

In [None]:
peft_model = get_peft_model(model_for_peft, peft_config)

Next, we'll utilize the function below to determine the total number of parameters in the model and specify how many we are going to train.

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

In [None]:
print_number_of_trainable_model_parameters(peft_model)

## Tokenizing the Datasets

In [None]:
# Tokenize the dataset and create a DataLoader if necessary
from transformers import MBart50TokenizerFast
from datasets import Dataset

In [None]:
tokenizer = MBart50TokenizerFast.from_pretrained(model_checkpoint,src_lang="en_XX", tgt_lang="th_TH")

In [None]:
train.reset_index(drop=True, inplace=True)
train_dataset = Dataset.from_pandas(train)

In [None]:
train.reset_index(drop=True, inplace=True)
train_dataset = Dataset.from_pandas(train)

In [None]:
val.reset_index(drop=True, inplace=True)
val_dataset = Dataset.from_pandas(val)

In [None]:
def tokenize_function(examples):
    model_inputs = tokenizer(
        text=examples["input"],
        text_target=examples["target"],
        max_length=256,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )
    return model_inputs

In [None]:
train_dataset = train_dataset.map(tokenize_function,batched=True)
train_dataset = train_dataset.remove_columns(['input', 'target'])

In [None]:
val_dataset = val_dataset.map(tokenize_function,batched=True)
val_dataset = val_dataset.remove_columns(['input', 'target'])

## Training Arguments

In this section, we set up training parameters such as learning rate, batch size, 16-bit float training, the number of epochs, and specify a directory to save the training checkpoints.

In [None]:
from transformers import DataCollatorForSeq2Seq

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=peft_model)

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="/content/drive/MyDrive/peft checkpoints",#change to your prefered dir
    learning_rate=1e-4,
    per_device_train_batch_size=4,  # Adjust as needed
    per_device_eval_batch_size=4,
    num_train_epochs=50,  # Tune as per requirement
    weight_decay=0.01,
    evaluation_strategy="steps",
    save_strategy="steps",
    logging_steps=500,  # Adjust as needed
    fp16=True,
)

### Training

In [None]:
trainer = Seq2SeqTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
trainer.train()
# Save the model to the specified directory
model_path = "/content/drive/MyDrive/peft models/en_to_thai model"
peft_model.save_pretrained(model_path)
# Save the tokenizer to the same directory as the model
tokenizer_path = "/content/drive/MyDrive/peft models/en_to_thai thai"
tokenizer.save_pretrained(tokenizer_path)

## Evaluating the Fine-Tuned Model

Now, we'll load the model and the tokenizer from the saved path and evaluate them in the same manner as we did with the original model.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

In [None]:
torch.cuda.empty_cache()

In [None]:
# Load the trained model and tokenizer
fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/en to thai fine tuned model")
fine_tuned_tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/en to thai fine tuned model")

In [None]:
# Move model to the correct device (e.g., GPU or CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
fine_tuned_model.eval()  # Set the model to eval mode
fine_tuned_model.to(device)
fine_tuned_model.half()

In [None]:
english_sentences = list(train['input'][:500])

In [None]:
thai_sentences = list(train['target'][:500])

In [None]:
thai_translations = translate_sentences(fine_tuned_model,fine_tuned_tokenizer,english_sentences,"th_TH")

In [None]:
calculate_average_pairwise_similarity(thai_translations,thai_sentences)

In [None]:
evaluate_bleu(thai_translations,thai_sentences)

We tested both models on a sample of the data, and here are the results:

- Similarity Score: 0.45 ➔ 0.55
- BLEU Score: 0.02 ➔ 0.09

The similarity score considers the overall meaning and indicates that the model adapted well to translating movie dialogues. On the other hand, the BLEU score, which represents the structure of the translation, showed a more significant improvement. The model learned to translate longer pieces of text better.