# Multi-lingual DPO fine-tuning a Gemma model
###### Ok, So in this notebook we will:
- **1- Load a multi-lingual data**
- **2- Used the loaded data to generate synthetic preference**
- **3- Load the Gemma model and perform parameter efficient DPO fine-tuning with peft and trl**
- **4- Save and merge the fine-tuned part and the full mode and publish it to kaggle models**

Used languages:                                                                                               
"French", "Standard Arabic", "Italian", "Spanish", "Portuguese", "Japanese", "German", "Iranian Persian", "Russian"

##### Device:                                                                                 
1x Nvidia A100 40gb

##### Base Model:                                                                          
gemma2-2b-it

##### My Gemma2 cookbook:
I made this repo and I'm uploading all notebooks related to working with gemma models, check it out: https://github.com/Mhdaw/Gemma2

## Step 0: Installing requaired framework and libraies 

By runing the folowing code we install requaired framework and libraies 


In [None]:
!pip install -q datasets trl peft bitsandbytes accelerate kagglehub keras_hub google-generativeai

In [None]:
import google.generativeai as genai
import torch
import os
import shutil
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig, get_peft_model, PeftModel, PeftConfig
import time
import kagglehub

## Step 1: Setting up the enviorment variables for kaggle(for model uploading) and huggingface
you can use any of the mentioned methods to set up your secrets. 

In [None]:
from google.colab import userdata

import os
# Set the environment variables for Kaggle and Weights & Biases.
# from kaggle_secrets import UserSecretsClient if you use kaggle
# from google.colab import userdata if you use google colab
#import getpass if you use jupyter notebook
os.environ["KAGGLE_USERNAME"] = "your-username"# or UserSecretsClient().get_secret(KAGGLE_USERNAME) or userdata.get(KAGGLE_USERNAME) or getpass.getpass("Enter your KAGGLE_USERNAME: ")
os.environ["KAGGLE_KEY"] = "kaggle-api-key" # or UserSecretsClient().get_secret(KAGGLE_KEY) or userdata.get(KAGGLE_KEY) or getpass.getpass("Enter your  KAGGLE_KEY: ")
os.environ["HF_TOKEN"] = "huggingface-api-key" "kaggle-api-key" # or UserSecretsClient().get_secret(HF_TOKEN) or userdata.get(HF_TOKEN) or getpass.getpass("Enter your  HF_TOKEN: ")

# Step 2: Dataset generation:
in the folowing section we will load the aya dataset which is a multilingual dataset and we will use gemini models for generating the preference dataset.
Since both Gemma and Gemini are developed by google that looks like a good option, For using the gemini api you have to have an api key, You can go to this link to make one [link](https://aistudio.google.com/)

**Disclaimer**

This project utilizes synthetic data generated through [Google's Gemini API] and is intended solely for educational and research purposes. The content herein does not replicate or extract proprietary components of the aforementioned services.

All rights to the underlying AI models and data used belong to their respective owners. No ownership or endorsement by these entities is implied.

If any entity or individual believes that their rights are infringed upon by this project, please contact [mahdi.sed1384digh@gmail.com] immediately. Upon notification, appropriate actions, including content removal, will be taken promptly.

### Sub tasks:
- 1: load the aya dataset.
- 2: Set up a function to handle generating the preference data.
- 3: process and convert it into a DPO format dataset.
- 4: Save it as `csv` for loading with the `dataset`

#### Sub task1:

you can look into it on Hugging Face: https://huggingface.co/datasets/CohereForAI/aya_dataset

Dataset Summary (from the original dataset page):
The Aya Dataset is a multilingual instruction fine-tuning dataset curated by an open-science community via Aya Annotation Platform from Cohere For AI. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators. This dataset can be used to train, finetune, and evaluate multilingual LLMs.

Curated by: Contributors of Aya Open Science Intiative.

Language(s): 65 languages (71 including dialects & scripts).

License: Apache 2.0

In [None]:
# Load the annotations dataset
aya_dataset = load_dataset("CohereForAI/aya_dataset")

we process the aya dataset, we use a sample of 9 specific language to save time, api limit and computation.

In [None]:
data_frame = aya_dataset["train"].to_pandas()
data_frame = data_frame.drop(columns=["language_code", "annotation_type", "user_id"])
languages_to_keep = ["French", "Standard Arabic", "Italian", "Spanish", "Portuguese", "Japanese", "German", "Iranian Persian", "Russian",]
filterd_data_frame = data_frame[data_frame["language"].isin(languages_to_keep)]

In [None]:
def sample_languages(df, n_samples):
  """Samples n_samples from each language in the dataframe.

  Args:
    df: The input dataframe.
    n_samples: The number of samples to take from each language.

  Returns:
    A new dataframe with the sampled data.
  """
  sampled_df = df.groupby('language', group_keys=False).apply(
      lambda g: g.sample(n=min(n_samples, len(g)))
  )
  return sampled_df
    
sampled_df = sample_languages(filterd_data_frame, 100)

In [None]:
model_system_prompt = "You are a helpful assistant, answer well in the same language as input prompt"
evaluator_system_prompt = """You are an excelent judge that evaluates two models response to a prompt given by user, You will also be given a perferd response which you can use to select which model is better.
you will get the data in the folowing format: model_a:{model_a_response}, model_b:{model_b_response}, preferred_response:{preferred_response},
You should respond only model_a or model_b, any other response will be rejected.
"""

In [None]:
def process_evaluation_response(evaluation_response):
  if "model_a" in evaluation_response:
    return "model_a"
  elif "model_b" in evaluation_response:
    return "model_b"
  else:
    return None

In [None]:
def generat_dpo_dataset(train_df):
    genai.configure(api_key=user_secrets.get_secret("GEMINI_API_KEY"))
    dpo_dataset = []
    failed_indices = []

    def generate_with_retry(model, content, max_retries=3, base_delay=2):
        for attempt in range(max_retries):
            try:
                response = model.generate_content(content)
                return response
            except Exception as e:
                if attempt < max_retries - 1:
                    # Exponential backoff: 2s, 4s, 8s
                    delay = base_delay * (2 ** attempt)
                    time.sleep(delay)
                    continue
                raise e
        return None

    for index, row in tqdm(train_df.iterrows(), desc="processing"):
        try:
            prompt = row["inputs"]
            preferred_response = row["targets"]
            language = row["language"]
            model_a_name = random.choice(["gemini-1.5-flash", "gemini-1.5-flash-8b"])
            model_b_name = random.choice(["gemini-2.0-flash-exp", "gemini-1.5-flash-8b"])
            evaluation_model_name = random.choice(["gemini-2.0-flash-exp", "gemini-1.5-flash"])

            # Model A generation
            model_a = genai.GenerativeModel(
                model_name=model_a_name,
                system_instruction=model_system_prompt)

            model_a_response = generate_with_retry(model_a, prompt)
            if not model_a_response:
                raise Exception("Failed to get response from model A after retries")
            time.sleep(1)

            # Model B generation
            model_b = genai.GenerativeModel(
                model_name=model_b_name,
                system_instruction=model_system_prompt)

            model_b_response = generate_with_retry(model_b, prompt)
            if not model_b_response:
                raise Exception("Failed to get response from model B after retries")
            time.sleep(1)

            # Evaluation
            evaluation_prompt = f"""model_a:{model_a_response.text}, model_b:{model_b_response.text}, preferred_response:{preferred_response},"""

            evaluation_model = genai.GenerativeModel(
                model_name=evaluation_model_name,
                system_instruction=evaluator_system_prompt)

            evaluation_response = generate_with_retry(evaluation_model, evaluation_prompt)
            if not evaluation_response:
                raise Exception("Failed to get response from evaluation model after retries")
            time.sleep(1)

            dpo_dataset.append({
                "prompt": prompt,
                "language": language,
                "model_a_name": model_a_name,
                "model_a_response": model_a_response.text,
                "model_b_name": model_b_name,
                "model_b_response": model_b_response.text,
                "preferred_response": preferred_response,
                "evaluation_model_name": evaluation_model_name,
                "evaluation_response": process_evaluation_response(evaluation_response.text)
            })

        except Exception as e:
            print(f"Error processing index {index}: {str(e)}")
            failed_indices.append(index)
            continue

    # Print summary of failed operations
    if failed_indices:
        print(f"\nFailed to process {len(failed_indices)}")

    return dpo_dataset#, failed_indices

In [None]:
dpo_dataset = generat_dpo_dataset(sampled_df)
dpo_dataset_df = pd.DataFrame(dpo_dataset)

In [None]:
def get_winner_model(row):
  if row["evaluation_response"] == "model_a":
    return row["model_a_response"]
  elif row["evaluation_response"] == "model_b":
    return row["model_b_response"]

In [None]:
def process_df(df):
  """ this will input the dpo df and return the processed df which has these columns: prompt, rejected and choosen."""
  processed_df = pd.DataFrame(columns=["prompt", "rejected", "chosen"])
  processed_df["prompt"] = df["prompt"]

  processed_df["rejected"] = df.apply(lambda row: row["model_a_response"] if row["evaluation_response"] == "model_b" else row["model_b_response"], axis=1)
  processed_df["chosen"] = df.apply(get_winner_model, axis=1)
  return processed_df
processed_df = process_df(dpo_dataset_df)
processed_df.to_csv("processed_dpo_dataset.csv", index=False)

In [None]:
dataset = load_dataset("csv", data_files="/content/processed_dpo_dataset.csv")

## Step 3: Loading the model and the tokenizer

we use the `AutoModelForCausalLM` and `AutoTokenizer` from huggingface transformers for loading and using our model.

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") # or you can load it dirrectly from kaggle.
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    device_map="cuda",
)
model.config.use_cache = False
print("Model and tokenizer loaded...")

In [None]:
finetune_name = "gemma2-2b-mulitlingual-DPO"

Here we setup a LoRA config to perform parameter efficient fine-tuning.
with this method instead of updating all prameters, we only update a fraction of it.

In [None]:
lora_config = LoraConfig(
    r=6,
    lora_alpha=12,
    target_modules=["q_proj", "v_proj"],  # Specify target modules to apply LoRA
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

In [None]:
model = get_peft_model(model, lora_config)

As You can see for this experiment we only update aronud 1 million parameters isntead of the 2.6 billion.

In [None]:
def count_trainable_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

trainable_params = count_trainable_parameters(model)
print(f"Total trainable parameters: {trainable_params}")

#### Starting the DPO fine-tuning
We use the trl for this, first we define our `training_args` `DPOConfig`, here we define all of our hyperparameters. we use a small batch size for this experiment

In [None]:
# Training arguments
training_args = DPOConfig(
    # Training batch size per GPU
    per_device_train_batch_size=2,
    torch_empty_cache_steps = 20,
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=300,
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=1,
    # Directory to save model outputs
    output_dir="finetune_name",
    # Number of steps for learning rate warmup
    warmup_steps=150,
    tf32=True,
    # Disable wandb/tensorboard logging
    report_to="none",
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    hub_model_id=finetune_name,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    beta=0.1,
    # Maximum length of the input prompt in tokens
    max_prompt_length=512,
    # Maximum combined length of prompt + response in tokens
    max_length=768,
)

#### setting up the trainer

In [None]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing preferred/rejected response pairs
    train_dataset=dataset["train"],
    # Tokenizer for processing inputs
    processing_class=tokenizer,
)

### Training the model
I deleted the logs since it shows a table with houndreds of rows to make this notebook more readable.

In [None]:
trainer.train()

## step 4: Saving the model
Now that our model is trained, Its time to save it.
when we use the `save_model()` with a peft model, we only save the fine-tuned adapter weights instead of the full model, this is no problem but since we want this model to be more useful we can merge the model and then upload a complete version of it
we will do that:

In [None]:
trainer.save_model(f"./{finetune_name}")

In [None]:
# 1. Load the base model
model_name_or_path = "google/gemma-2-2b-it"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)

# 2. Load the PEFT adapter
adapter_model_name_or_path = "/content/gemma2-2b-mulitlingual-DPO"
model = PeftModel.from_pretrained(model, adapter_model_name_or_path)

# 3. Merge the adapter into the base model
merged_model = model.merge_and_unload()

In [None]:
os.makedirs("/content/gemma2-2b-DPO-merged", exist_ok=True)
merged_model.save_pretrained("/content/gemma2-2b-DPO-merged")
print("Merged model saved.")

**Note:** When we save the merged model, the tokenizer related files are not saved, we have to transfer(move) them from the adapters directory.

In [None]:
# List of files to copy
files_to_copy = ['/content/gemma2-2b-mulitlingual-DPO/training_args.bin', '/content/gemma2-2b-mulitlingual-DPO/tokenizer_config.json',
                '/content/gemma2-2b-mulitlingual-DPO/tokenizer.model', '/content/gemma2-2b-mulitlingual-DPO/tokenizer.json',
                '/content/gemma2-2b-mulitlingual-DPO/special_tokens_map.json']
destination_directory = '/content/gemma2-2b-DPO-merged'

# Ensure the destination directory exists
if not os.path.exists(destination_directory):
    os.makedirs(destination_directory)

# Copy each file
for file in files_to_copy:
    shutil.copy(file, destination_directory)
    print(f"File '{file}' copied to '{destination_directory}'.")

## Final step: Pushing the model to kaggle so every one can use.

In [None]:
if "KAGGLE_USERNAME" not in os.environ or "KAGGLE_KEY" not in os.environ:
    kagglehub.login()

model_version = 1
kaggle_username = kagglehub.whoami()["username"]
fine_tuned_model_name = "gemma2_2b_mulitlingual_DPO"
handle = f'{kaggle_username}/gemma/transformers/{fine_tuned_model_name}'
print(f"Handle: {handle}\n")
local_model_dir = "/content/gemma2-2b-DPO-merged"
kagglehub.model_upload(handle, local_model_dir)
print("Done!")

# Inference
Here we talk about how we can load the fine-tuned model from kaggle and use it:

**Note:**
This is for inferencing the model inside kaggle notebooks:

**Infrence step by step:**
- 1: load the model from kaggle( You can do this by adding the model from sidebar(kaggle inputs)
- 2: using the model in multilingual text generation, Since we performed DPO fine-tuning:
- ***Our model is now responding better and has a more natural and helpful response form, And we achieved this by using a small dataset and small number of parameters.***

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/gemma2/transformers/gemma2_2b_mulitlingual_dpo/1")
tokenizer.pad_token = tokenizer.eos_token

# Load the Fine-tuned model
model = AutoModelForCausalLM.from_pretrained(
    "/kaggle/input/gemma2/transformers/gemma2_2b_mulitlingual_dpo/1",
    device_map="cuda",
)
model.config.use_cache = False
print("Model and tokenizer loaded and ready to go...")

**Example usage:**

In [None]:
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="/kaggle/input/gemma2/transformers/gemma2_2b_mulitlingual_dpo/1",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",  # replace with "mps" to run on a Mac device, Or Use auto
)

messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]

outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)

Or

In [None]:
input_text = "Ciao, Come stai la mia amica?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

# Conclusion
This notebook showcased the complete workflow for mulitlingual DPO fine-tuning the Gemma model for 9 Languages. We highlighted:

- Dataset genration and preparation
- Fine-tuning with LoRA
- DPO fine-tuning
- how to use it