
"""
# Fine-Tune LLaMA 3.1 for Estonian Text Simplification
Author: Eduard Barbu  
Contact: eduard.barbu@ut.ee  

This script is designed to fine-tune the LLaMA 3.1 model on a dataset of Estonian text simplifications.  
The dataset contains pairs of original sentences and their simplified versions. The training process  
uses LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.  

## Features:
- Automatically mounts Google Drive to access the dataset, configuration, and template files.
- Supports JSON-formatted datasets for training instead of TSV.
- Preprocesses the dataset into a conversational format using a customizable template.
- Utilizes UnsLoTH's FastLanguageModel for efficient fine-tuning.
- Ensures compatibility with multiple GPU types, including T4 and A100.
- Saves the fine-tuned model and tokenizer for future use.  

## Instructions:
1. **Upload your dataset**:  
   Ensure the dataset is in JSON format and located in Google Drive. The file should have the following structure:  
   [
       {
           "src": "source_identifier",
           "original": "original_sentence",
           "simpl_lex": "optional_lexical_simplification",
           "simpl_final": "final_simplified_sentence"
       }
   ]

2. **Provide necessary files**:  
   - `llama_finetuning_config.json`: A configuration file containing training hyperparameters.  
   - `template-finetuning-llama.txt`: A text file with placeholders for formatting training data.  

3. **Dependencies**:  
   This script installs all necessary Python packages, including:  
   - `torch` and `torchvision` (CUDA-enabled versions for GPU).  
   - `unsloth`, `trl`, `peft`, `bitsandbytes`, and `xformers`.  

4. **Run the script**:  
   - Verify the dataset and template file paths in the script.  
   - Execute the script to preprocess the data, fine-tune the model, and save the outputs.  

## Outputs:
- The fine-tuned model and tokenizer are saved in the `models/fine_tuned_model` directory on Google Drive.  

## Notes:
- Ensure you have a GPU runtime enabled in Colab for optimal performance.  
- For T4 GPUs, the script uses `float16` precision (`fp16=True` in the configuration).  
- For A100 GPUs, the script automatically switches to `bfloat16` precision if required.  
"""


In [2]:
!pip install torch==2.0.0+cu118 torchvision==0.15.0+cu118 -f https://download.pytorch.org/whl/torch_stable.html


Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [2]:
import torch
import torchvision
print(torch.__version__)
print(torchvision.__version__)

  warn(


2.3.1+cu121
0.15.0+cu118


In [1]:
# %%capture
# Install required packages in Google Colab

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes
!pip install "xformers<=0.0.27" --no-cache-dir
!pip install numpy==1.23.5


Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-ovq6piv1/unsloth_6292e91a97a147a3a5472f4d3e8d6d97
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-ovq6piv1/unsloth_6292e91a97a147a3a5472f4d3e8d6d97
  Resolved https://github.com/unslothai/unsloth.git to commit bdf0cd6033595be4e7ed23d0d002bb176d343152
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Define the path to the directory containing the datasets and configuration file
directory_path = '/content/drive/MyDrive/DataToShare/Estonian-Text-Simplification/Data/Estonian-Training-And-Test-Sets'

# Verify by listing contents
print(os.listdir(directory_path))


Mounted at /content/drive
['models', 'Usage-Speed-Memory', 'estonian_simplification_template_read-llama.txt', 'llama(3.0) vs llama (3.1)', 'Test-Sets', 'Training-Sets', 'Paper Evaluation', 'template-finetuning-llama.txt', 'template-simplification-llama.txt', 'Estonian-Predictions-Dataset.tsv', 'simplification_training_set.json', 'llama_finetuning_config.json']


In [4]:
import os
import pandas as pd
import torch
from datasets import Dataset
from transformers import TrainingArguments
from unsloth import FastLanguageModel
from trl import SFTTrainer
import time
import argparse
import json


def read_json_file(file_path):
    """Reads a JSON file and returns a Pandas DataFrame."""
    import json

    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)

    df = pd.DataFrame(data)
    return df



Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [5]:
def read_template(template_file):
  """Reads the fine-tuning template from a file."""
  with open(template_file, 'r', encoding='utf-8') as file:
        template = file.read()
  return template


In [6]:
def format_training_data(examples, template):
    """Formats the training data using a given template."""
    inputs = examples['original']
    outputs = examples['simpl_final']
    formatted_data = []

    for input_sentence, output_sentence in zip(inputs, outputs):
        formatted_example = template.format(
            input_sentence=input_sentence,
            output_sentence=output_sentence
        )
        formatted_data.append(formatted_example)

    return {"text": formatted_data}

In [7]:
def fine_tune_model(train_dataset, config):
    """Fine-tunes the Llama 3.1 model on the training dataset."""
    max_seq_length = 2048
    dtype = torch.float16 if torch.cuda.get_device_capability(0)[0] >= 7 else torch.float32
    load_in_4bit = True

    # Load the Llama 3.1 model and tokenizer
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )

    tokenizer.padding_side = 'left'
    special_tokens_dict = {"eos_token": tokenizer.eos_token, "pad_token": tokenizer.eos_token}
    tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

    model = FastLanguageModel.get_peft_model(
        model,
        r=16,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        lora_alpha=16,
        lora_dropout=0,
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=3407,
        use_rslora=False,
        loftq_config=None,
    )

    training_args = TrainingArguments(
        report_to="none",
        per_device_train_batch_size=config.get("per_device_train_batch_size", 2),
        gradient_accumulation_steps=config.get("gradient_accumulation_steps", 4),
        warmup_steps=config.get("warmup_steps", 5),
        max_steps=config.get("max_steps", 1000),
        learning_rate=config.get("learning_rate", 2e-4),
        fp16=(dtype == torch.float16),
        logging_steps=config.get("logging_steps", 10),
        optim=config.get("optim", "adamw_8bit"),
        weight_decay=config.get("weight_decay", 0.01),
        lr_scheduler_type=config.get("lr_scheduler_type", "linear"),
        seed=config.get("seed", 3407),
        output_dir=config.get("output_dir", "outputs"),
    )

    tokenizer.padding_side = 'right'

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        dataset_num_proc=2,
        packing=False,
        args=training_args,
    )

    start_time = time.time()
    trainer.train()
    end_time = time.time()
    training_duration = end_time - start_time

    print(f"Training took {training_duration:.2f} seconds")

    return model, tokenizer



In [8]:
def save_model_and_tokenizer(model, tokenizer):
    """Saves the fine-tuned model and tokenizer to the 'models' directory."""
    model_save_path = os.path.join(directory_path, "models", "fine_tuned_model")
    model.save_pretrained(model_save_path)
    tokenizer.save_pretrained(model_save_path)
    print(f"Model and tokenizer saved to {model_save_path}")

In [9]:
def load_and_display_dataset_info(train_file):
    """Load and display information about the training set."""

    train_file_path = os.path.join(directory_path, 'simplification_training_set.json')
    train_df = read_json_file(train_file_path)

    # Reset index
    train_df.reset_index(drop=True, inplace=True)

    # Display basic information
    print(f"Training Set: {len(train_df)} examples")

    # Display the first few examples
    print("\nFirst 3 examples from the Training Set:")
    print(train_df.head(3))

    return train_df


In [10]:
def main(train_file, config_file, template_file):

    # Load and display information about the training set
    train_df = load_and_display_dataset_info(train_file)

    # Read the fine-tuning template
    template = read_template(template_file)

    # Prepare the training dataset
    train_file_path = os.path.join(directory_path, 'simplification_training_set.json')
    train_df = read_json_file(train_file_path)
    train_df.reset_index(drop=True, inplace=True)
    train_dataset = Dataset.from_pandas(train_df)
    train_dataset = train_dataset.map(
        lambda examples: format_training_data(examples, template),
        batched=True
    )

    with open(config_file) as f:
        config = json.load(f)

    model, tokenizer = fine_tune_model(train_dataset, config)
    save_model_and_tokenizer(model, tokenizer)



In [None]:
if __name__ == "__main__":
    train_file = 'Training-Sets/Estonian-TrainingSet_4-persona-lex+synt.tsv'
    config_file = os.path.join(directory_path, 'llama_finetuning_config.json')
    template_file = os.path.join(directory_path, 'template-finetuning-llama.txt')
    main(train_file, config_file, template_file)


Training Set: 50394 examples

First 3 examples from the Training Set:
    src                                           original simpl_lex  \
0  TURK  Sõjalise konflikti ühel pool võitlevad Sudaani...             
1  TURK  Jiddah on peauks islamiusu pühimasse linna Mek...             
2  TURK  Arvatakse, et Suur Tume Laik osutab augule Nep...             

                                         simpl_final  
0  Sõjalise konflikti ühel pool on Sudaani väed j...  
1  Jiddah on peavärav islamiusu pühimasse linna M...  
2  Arvatakse, et Suur Tume Laik on auk Neptuuni m...  


Map:   0%|          | 0/50394 [00:00<?, ? examples/s]

==((====))==  Unsloth 2025.1.7: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.3.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 2.3.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Unsloth 2025.1.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Map (num_proc=2):   0%|          | 0/50394 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 50,394 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 900
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,2.2074
20,2.1637
30,2.0754
40,1.9263
50,1.669
60,1.3974
70,1.231
80,1.1554
90,1.0874
