# Fine-Tuning Gemma 2 Model for English-to-Bangla Translation Using Unsloth

This notebook demonstrates the complete process of adapting **Gemma 2**, Google's open model, for English-Bengali translation. The goal is to enable accurate translations between these two languages, leveraging a high-quality dataset and efficient fine-tuning techniques.

The process involves the following key steps::

- **Dataset Preparation**
- **Model Fine-Tuning**
- **Inference and Prediction**
- **Evaluation with BLEU Score**

The dataset used for this project is sourced from Kaggle. It contains pairs of English sentences and their Bengali translations, which are converted into a suitable format for training.

Dataset link: [English to Bengali for Machine Translation](https://www.kaggle.com/datasets/sayedshaun/english-to-bengali-for-machine-translation)

This notebook demonstrates the step-by-step process of fine-tuning a pre-trained language model for English-to-Bangla and Bangla-to-English translations. Below are the sections and their corresponding code snippets.


## Step 1: Install Dependencies

To fine-tune the Gemma 2 model, we first install the required libraries:
- **Unsloth**: For efficient LoRA-based fine-tuning.
- **TRL**: To handle training with SFTTrainer.
- **SacreBLEU**: To evaluate machine translation quality.
- **Datasets**: For processing and managing data.
- **PyTorch with CUDA 12.1**: For accelerated GPU-based training.


In [1]:
!pip install sacrebleu


# Install pip3-autoremove if not already installed
!pip install pip3-autoremove

# Uninstall old Torch and related packages
!pip-autoremove torch torchvision torchaudio -y

# Install Torch, TorchVision, and TorchAudio with CUDA 12.1
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121 --quiet

# Install additional required libraries
!pip install unsloth datasets trl --quiet


Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-3.1.1-py3-none-any.whl.metadata (8.6 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-3.1.1-py3-none-any.whl (19 kB)
Installing collected packages: portalocker, sacrebleu
Successfully installed portalocker-3.1.1 sacrebleu-2.5.1
Collecting pip3-autoremove
  Downloading pip3_autoremove-1.2.2-py2.py3-none-any.whl.metadata (2.2 kB)
Downloading pip3_autoremove-1.2.2-py2.py3-none-any.whl (6.7 kB)
Installing collected packages: pip3-autoremove
Successfully installed pip3-autoremove-1.2.2
pyarrow 18.1.0 is installed but pyarrow<15.0.0a0,>=14.0.1 is re

## Step 2: Dataset Preparation

The dataset contains pairs of English and Bengali sentences, sourced from Kaggle. It is preprocessed and converted into a JSON-like format, structured as:

- **Instruction**: Specifies the translation task (e.g., "Translate to English").
- **Input**: The text to be translated.
- **Output**: The expected translation.

This structured format ensures compatibility with the fine-tuning model.


In [2]:
import pandas as pd

from pprint import pprint

import csv

# Function to convert CSV data into the desired JSON format

def convert_csv_to_json_format(csv_file):

    list_ds = []  # Initialize an empty list to store the formatted data
    # Open and read the CSV file from the specified path

    with open(csv_file, newline='', encoding='utf-8') as file:

        csv_reader = csv.DictReader(file)

        # Loop through each row in the CSV

        for row in csv_reader:

            english_sentence = row['english_caption']

            bangla_sentence = row['bengali_caption']

            # Append the formatted dictionaries to the list

            list_ds.append({

                "instruction": "Translate this to English",

                "input": bangla_sentence,

                "output": english_sentence

            })

            list_ds.append({

                "instruction": "Translate this to Bangla",

                "input": english_sentence,

                "output":  bangla_sentence

            })



    return list_ds  # Return the populated list



# Define the path to your CSV file

csv_file = '/kaggle/input/english-to-bengali-for-machine-translation/english to bengali.csv'  



# Call the function to convert the CSV into the desired format

list_ds = convert_csv_to_json_format(csv_file)

# Now print the result

pprint(list_ds[:2])  #print 2 lines


[{'input': 'একটি গোলাপী জামা পরা বাচ্চা মেয়ে একটি বাড়ির প্রবেশ পথের সিঁড়ি বেয়ে '
           'উঠছে।',
  'instruction': 'Translate this to English',
  'output': 'a child in a pink dress is climbing up a set of stairs in an '
            'entry way .'},
 {'input': 'a child in a pink dress is climbing up a set of stairs in an entry '
           'way .',
  'instruction': 'Translate this to Bangla',
  'output': 'একটি গোলাপী জামা পরা বাচ্চা মেয়ে একটি বাড়ির প্রবেশ পথের সিঁড়ি '
            'বেয়ে উঠছে।'}]


## Step 3: Load and Configure the Model

We utilize **Gemma 2**, a pre-trained large language model by Google. The model is optimized for fine-tuning using:
- **LoRA-based Fine-Tuning**: Enables efficient parameter tuning with low resource requirements.
- **Gradient Checkpointing**: Reduces memory usage during training.

The configuration includes:
- LoRA rank (`r`): 16
- Target modules: Key projection layers for efficient adaptation.
- Dropout: Disabled for deterministic fine-tuning.


In [3]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None  # Auto-detect based on hardware
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-2-9b-it", 
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.12: Fast Gemma2 patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/6.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

Unsloth 2024.12.12 patched 42 layers with 42 QKV layers, 42 O layers and 42 MLP layers.


## **Step 4: Format Dataset for Training**

The dataset is tokenized and formatted into prompts compatible with the **Gemma 2** model. Fine-tuning requires the data to be in a structured format that clearly communicates the task, input, and expected output. This structure ensures the model understands the specific task during training and can generalize it effectively during inference.

Each prompt in the dataset is organized as follows:

- **Task Instruction**: Specifies what the model should do. In this case, the task is to translate text between English and Bengali. This instruction guides the model by providing context for the expected behavior.
  
- **Input Text**: The sentence or phrase to be translated. This is taken directly from the dataset and represents the source language text.
  
- **Expected Translation (Output)**: The correct translation corresponding to the input text, serving as the target label for the model during training.


In [4]:
import datasets

# Updated prompt name to reflect the translation task
translation_prompt = """Below is a task instruction paired with input text. Your job is to provide an accurate translation.

### Task:
{}

### Input Text:
{}

### Translation:
{}"""

EOS_TOKEN = tokenizer.eos_token  # Ensure EOS token is defined

# Optimized function name and implementation
def format_translation_prompts(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    # Using list comprehension for readability and efficiency
    texts = [
        translation_prompt.format(instruction, input_text, output) + EOS_TOKEN
        for instruction, input_text, output in zip(instructions, inputs, outputs)
    ]
    return {"text": texts}

# Convert your DataFrame to a Hugging Face Dataset
df = pd.DataFrame(list_ds)
dataset = datasets.Dataset.from_pandas(df)

# Apply the formatting function to add the 'text' field
dataset = dataset.map(format_translation_prompts, batched=True)

# Print the final dataset to verify
print(dataset)


Map:   0%|          | 0/78130 [00:00<?, ? examples/s]

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 78130
})


## Step 5: Train the Model

Using **SFTTrainer** from the `trl` library, we fine-tune the Gemma 2 model on the English-Bengali dataset. Key training parameters include:
- **Batch Size**: 2
- **Gradient Accumulation Steps**: 4
- **Learning Rate**: 1e-4
- **Maximum Steps**: 500
- **Optimizer**: AdamW with 8-bit precision for efficiency.

The training is performed on a GPU to leverage CUDA acceleration.


In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=25,
        max_steps=500,
        learning_rate=1e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=42,
        output_dir="outputs",
        report_to="none",
    ),
)
trainer_stats = trainer.train()


Map (num_proc=2):   0%|          | 0/78130 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 78,130 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 500
 "-____-"     Number of trainable parameters = 54,018,048


Step,Training Loss
10,2.9299
20,1.6064
30,1.0712
40,0.9665
50,0.9235
60,0.8629
70,0.8659
80,0.8272
90,0.8075
100,0.772


## Step 6: Inference

Once the model is trained, we perform inference by providing input sentences and generating translations. An example:

- **Input Text**: হাসনাত বলেন, ‘আমরা উদ্বেগের সঙ্গে লক্ষ্য করছি- সরকার এখনো ঘোষণাপত্রের ব্যাপারে দৃশ্যমান কোনো উদ্যোগ নেয়নি।
- **Generated Output**: Hasanat says, we are watching with concern - the government has not yet made any visible moves on the matter of the announcement.

The model is optimized for fluency and accuracy in translations.


In [6]:

FastLanguageModel.for_inference(model) # Unsloth has 2x faster inference!
inputs = tokenizer(
[
    translation_prompt.format(
        "Translate to English", # instruction
        "হাসনাত বলেন, ‘আমরা উদ্বেগের সঙ্গে লক্ষ্য করছি- সরকার এখনো ঘোষণাপত্রের ব্যাপারে দৃশ্যমান কোনো উদ্যোগ নেয়নি।", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

["<bos>Below is a task instruction paired with input text. Your job is to provide an accurate translation.\n\n### Task:\nTranslate to English\n\n### Input Text:\nহাসনাত বলেন, ‘আমরা উদ্বেগের সঙ্গে লক্ষ্য করছি- সরকার এখনো ঘোষণাপত্রের ব্যাপারে দৃশ্যমান কোনো উদ্যোগ নেয়নি।\n\n### Translation:\nhasanat says , ' we are watching with concern - the government has not yet made any visible moves on the matter of the proclamation . '<eos>"]

## Step 7: Evaluate with BLEU Score

We evaluate the translation quality using the **BLEU metric**, a standard measure for machine translation. The evaluation compares model-generated translations against reference sentences from the dataset.

Steps include:
1. Generating translations for a test dataset.
2. Computing the BLEU score using `sacrebleu`.

The BLEU score reflects how closely the translations match the references.


In [7]:
from sacrebleu import corpus_bleu

# Prepare references and hypotheses
def evaluate_bleu(test_dataset, model, tokenizer, translation_prompt, device="cuda"):
    references, hypotheses = [], []

    for example in test_dataset:
        # Format the input text using the alpaca_prompt
        input_text = translation_prompt.format(
            example['instruction'], example['input'], ""
        )
        reference = example['output']  # The ground truth response

        try:
            # Tokenize the input, ensuring consistent formatting and length
            inputs = tokenizer(
                [input_text],  # Wrap input in a list for batch processing
                return_tensors="pt",
                truncation=True,
                max_length=512,  # Set a maximum input length
                padding="max_length"  # Ensure consistent input size
            ).to(device)

            # Generate the hypothesis using the model
            outputs = model.generate(
                **inputs,
                max_new_tokens=64,  # Limit the output length
                use_cache=True  # Speed up generation
            )

            # Decode the generated output
            hypothesis = tokenizer.batch_decode(
                outputs, skip_special_tokens=True
            )[0]  # Decode the first (and only) hypothesis

            # Append the reference and hypothesis to their respective lists
            references.append([reference])  # BLEU expects a list of references for each hypothesis
            hypotheses.append(hypothesis)

        except Exception as e:
            # Handle errors gracefully to avoid crashing
            print(f"Error generating text for input: {input_text}\n{e}")
            continue

    # Compute BLEU score
    try:
        bleu = corpus_bleu(hypotheses, references)
        print(f"BLEU Score: {bleu.score:.2f}")
    except Exception as e:
        print(f"Error computing BLEU score: {e}")
        bleu = None

    return bleu


# Ensure you have a small test dataset to validate
test_dataset = dataset.select(range(100))  # Use first 100 samples for testing

# Call the BLEU evaluation function
bleu_score = evaluate_bleu(
    test_dataset=test_dataset,
    model=model,
    tokenizer=tokenizer,
    translation_prompt=translation_prompt,  # Ensure this matches your prompt format
    device="cuda"
)


BLEU Score: 29.93


## **Step 8: Save the Fine-Tuned Model**

After fine-tuning, the model and tokenizer are saved locally to enable reuse without re-training. We use the `save_pretrained` method from Hugging Face to store the fine-tuned artifacts.


In [8]:
model_name = "Gemma2_BanglaEnglish"  

# Save the fine-tuned model and tokenizer locally
model.save_pretrained(model_name)
tokenizer.save_pretrained(model_name)

print(f"Fine-tuned model and tokenizer saved as '{model_name}'")


Fine-tuned model and tokenizer saved as 'Gemma2_BanglaEnglish'


## **Conclusion**

This notebook showcased the successful fine-tuning of **Gemma 2** for English-Bengali  translation. The process included:

- Preparing and formatting a high-quality dataset.
- Fine-tuning the model using **LoRA** for efficient parameter updates.
- Evaluating the model with **BLEU scores**.
- Saving the fine-tuned model and tokenizer for reuse.

### **Key advantages of this approach**:
- **Efficiency**: LoRA enables fine-tuning with reduced computational requirements.
- **Speed**: The **Unsloth** library provides faster inference, making the model suitable for real-world applications.
