<a href="https://colab.research.google.com/github/Eddiebee/AI-Craft/blob/main/Fine_Tunning_Gemma_Models_on_Hugging_Face_using_PEFT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Parameter Efficient Fine-tuning of Google's Deepmind Gemma Models on 🤗

---

Gemma; the open weights language model from Google Deepmind is available for the broader open-source community via Hugging Face. Both the 2 billion and 7 billion parameter size variance are available with the pretrained and instruction-tuned flavors respectively.

The Gemma family of models are well suited for prototyping and experimentation using the free GPU resource Google Colab readily makes available.

In this notebook, I am to apply Parameter Efficient FineTuning (PEFT) on the Gemma family of models using the Hugging Face Transformers and PEFT libraries on GPUs using a dataset available in the open-source community.

---
### WHY PEFT?
PEFT or Parameter-efficient Fine tuning is a technique to optimize open-source models on different domains and at a lost cost. With PEFT, one can readily rely on openly available compute platforms like Google Colab for learning and experimentation.
Also, the default (full weights) training for language models, even for modest sizes, tends to be memory and compute-intensive.

---
### Low-Rank Adaptation for Large Language Models
Low-Rank Adaption (LoRA) is one of the parameter-efficient fine-tuning techniques for large language models (LLMs).
By freezing the original model and only training the adapter layers that are decomposed into low-rank matrices, it addresses just a fraction of the total number of model parameters to be fine-tuned.
The 🤗 PEFT Library makes available an easy abstraction that allows users to select the model layers where adapter weights should be applied.

In this notebook, I'll be leveraging QLoRA from Dettmers et al., in order to quantize the base model in 4-bit precision for a more memory efficient fine-tuning protocol.
To be able to ensure this quantization of the base model, we'll first need to install the `bitsandbytes` library and then pass in the `BitAndBytesConfig` object to `from_pretrained` method when loading the base model.

---

In [None]:
!pip install peft

In [7]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)


### Install required packages

In [None]:
!pip install bitsandbytes

#### Learning to Quote

I begin by downloading the model and the tokenizer. I also ensure to include the `BitsAndBytesConfig` object for weights only optimization.

In [None]:
!pip install git+https://github.com/huggingface/accelerate

In [1]:
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')

model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id,
                                          token=HF_TOKEN)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0},
                                             token=HF_TOKEN)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

I now test the model before starting the finetuning, using a famous quote.

In [61]:
text = "Question: write a python program to add 2 numbers	"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)


outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Question: write a python program to add 2 numbers	
import sys
a = int(sys.argv[1])
b = int(sys


The model does a good generation but with some extra tokens.

But this is not the actual format we want the model to give out its output, let's see if we can teach the model to be able to give its output in this format;

```
Quote: Attitude is a little thing that makes a big difference.


Author: Winston Churchill
```

We'll be using the Asuender Motivational dialogue datasets [asuender/motivational-quotes](https://huggingface.co/datasets/asuender/motivational-quotes).

In [None]:
!pip install datasets

In [43]:
from datasets import load_dataset

dataset = load_dataset("HanHan055/englishPython")
dataset = dataset.map(lambda samples: tokenizer(samples["question"]), batched=True)
dataset

Map:   0%|          | 0/4957 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'question', 'solution', 'input_ids', 'attention_mask'],
        num_rows: 4957
    })
})

Now let's finetune this model using the LoRA config we defined above

In [None]:
!pip install trl

In [44]:
dataset["train"]["question"][0]

' write a python program to add two numbers \n'

In [56]:
def formatting_func(example):
    output_texts = []
    for i in range(len(example)):
        text = f"Question: {example['question'][i]}\Solution: {example['solution'][i]}"
        output_texts.append(text)
    return output_texts

import transformers
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=2,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)
trainer.train()

Step,Training Loss
1,1.7421
2,1.7123


TrainOutput(global_step=2, training_loss=1.7271946668624878, metrics={'train_runtime': 3.8323, 'train_samples_per_second': 2.088, 'train_steps_per_second': 0.522, 'total_flos': 10372722769920.0, 'train_loss': 1.7271946668624878, 'epoch': 0.32})

Finally, we are ready to test the model once more with the same prompt we have used earlier:

In [59]:
text = "Question: write a python program to add 2 numbers"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Question: write a python program to add 2 numbers and display the result.

Answer:

def add(a, b): return a + b


In [30]:
text = "Quote: the only true wisdom "
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: the only true wisdom <em>is</em> to know nothing.

The above quote is from the great philosopher Socrates.

I


In [38]:
text = "Quote: Live the life of "
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Live the life of <strong><em>your dreams</em></strong>.

<strong><em>Your dreams</em></strong> are the things you want to
