# Fine-tuning Gemma-2B (4-bit quantized) for en-fr translation

This project aims at giving minimal resources to finetune a quantized version of Gemma-2B. In this specific notebook, we will be using an English-French corpus.

## Prerequisites

- **Gemma access** : One needs a granted access to Gemma models. This can be done [accepting Google terms of use](https://huggingface.co/google/gemma-2b).
- **Data** : The [opus-books (en-fr) dataset](https://huggingface.co/datasets/opus_books/tree/main/en-fr) can be found in the HuggingFace hub.
- **GPU** : I made the model fit (during training - see batch_size below) in a RTX3080 (10GB VRAM)
- **Required packages** : Required packages can be installed with the given `requirements.txt`

## Try Gemma-2B pretrained on en-fr translation

The following imports are required for inference

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

Declare the quantization config (4 bits)

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Instantiate the tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("./model")

Instantiate the model with the 4-bits quantization config

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "./model", quantization_config=quantization_config, low_cpu_mem_usage=True
)

Write a function that runs a translation using dedicated prompt

In [None]:
def get_translation(input_en_text: str, model, tokenizer) -> str:
    """
    This function returns the French translation of input_en_text in the style of a LLM chat.

    params:
        input_en_text ([str]): The english text you want to translate.
        model: the pretrained or finetuned version of the model
        tokenizer: the associated tokenizer
    """

    prompt_template ="""
    <start-of-turn>user What is the French translation of : "{input_en_text}" ?<end-of-turn> 
    """

    prompt = prompt_template.format(input_en_text=input_en_text)

    # Encode the input prompt using the Tokenizer
    encoded = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

    # Send the input to the GPU
    model_inputs = encoded.to('cuda')

    # Run the inference - Feel free to adapt the temperature and other params according to your preferences
    generated_ids = model.generate(**model_inputs, do_sample=True, pad_token_id=tokenizer.eos_token_id, max_new_tokens=100, temperature=0.1)

    # Decode the tokenized output
    decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=False)
    
    return decoded

Test it !

In [None]:
translation = get_translation(input_en_text="Thank you.", model=model, tokenizer=tokenizer)

As you can see, the result is not that good...

In [None]:
translation

## Finetuning

### Data preparation

Make the following imports to handle the dataset with HuggingFace Datasets

In [None]:
from datasets import Dataset, load_dataset

Load the dataset

In [None]:
dataset = load_dataset("./data/", split="train")

There we use Pandas DataFrame for convenience purposes but we could have kept original Dataset object

In [None]:
df = dataset.to_pandas()
df.head(5)

Let's split it into 'en' and 'fr' columns + remove id

In [None]:
df["en"]=df["translation"].apply(lambda x: x['en'])
df["fr"]=df["translation"].apply(lambda x: x['fr'])
df=df.drop(['translation', 'id'], axis=1)

Create a column which contains prompts

In [None]:
def generate_prompt(data_line):

    en=data_line["en"]
    fr=data_line["fr"]
    prompt_template = f"""
    <start_of_turn>user What is the French translation of : "{en}" ? <end_of_turn>\n<start_of_turn>model "{fr}" <end_of_turn>"""

    return prompt_template

In [None]:
df["prompt"]=df.apply(lambda x: generate_prompt(x), axis=1)

Go back to Dataset

In [None]:
dataset = Dataset.from_pandas(df)

Add a column made of tokenized prompts

In [None]:
dataset = dataset.map(lambda x: tokenizer(x["prompt"]), batched=True)

Train and test split (80% - 20%)

In [None]:
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]

### Apply LoRA

In [None]:
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
model

Search for Linear layers

In [None]:
import bitsandbytes as bnb

def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit
  lora_module_names = set()

  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
      
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')

  return list(lora_module_names)

In [None]:
modules = find_all_linear_names(model)
print(modules)

Apply LoRA config

In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Get trainable params

In [None]:
trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")

### Launch

Import MlFlow for tracking

In [None]:
import mlflow

Export mlflow env var

In [None]:
import os

os.environ["MLFLOW_EXPERIMENT_NAME"]="gemma-2b-finetuning"
os.environ["MLFLOW_FLATTEN_PARAMS"]="1"

Instantiate a SFTTrainer

In [None]:
import transformers

from trl import SFTTrainer

tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field="prompt",
    peft_config=lora_config,
    max_seq_length=512,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=0.03,
        max_steps=10000,
        learning_rate=2e-4,
        logging_steps=100,
        output_dir="./gemma-2b-finetuned",
        optim="paged_adamw_8bit",
        save_steps=500,
        save_strategy="steps",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

Launch the training :)

In [None]:
model.config.use_cache = False  # silence the warnings
trainer.train()

Stop the mlflow experiment

In [None]:
mlflow.end_run()

Save the adapter

In [None]:
new_model = "./gemma-2b-finetuned/checkpoint-XXXX/"
trainer.model.save_pretrained(new_model)

Merge the initial model and the learnt adapter

In [None]:
from peft import PeftModel

In [None]:
merged_model = PeftModel.from_pretrained(model, new_model)
merged_model = merged_model.merge_and_unload()

In [None]:
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Testing

Instantiate your models (pretrained and finetuned)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "./model", quantization_config=quantization_config, low_cpu_mem_usage=True
)

In [None]:
finetuned_model = AutoModelForCausalLM.from_pretrained(
    "./merged_model", quantization_config=quantization_config, low_cpu_mem_usage=True
)

In [None]:
model_inference = get_translation(input_en_text="Mathematics are very difficult this year", model=model, tokenizer=tokenizer)
finetuned_inference = get_translation(input_en_text="Mathematics are very difficult this year", model=finetuned_model, tokenizer=tokenizer)

Compare, the finetuned model seems to know how to speak French !

In [None]:
model_inference

In [None]:
finetuned_inference