In [1]:
import os # libray to access Operative System, in this case it was only use to access environment variable that contains HF API key
from transformers import AutoTokenizer, AutoModelForCausalLM # transformers contains Tokenizer and LLM (in this case gemma-2b)
import torch # torch enables CUDA (for GPU processing) https://pytorch.org/ you can pip install over here (make sure to check your GPU version first
# using command in cmd
# WARNING: if you dont have a GPU available on your PC or Laptop better run this code in Colab (Code does not work if you don't have GPU)
from huggingface_hub import HfApi # to access hugging face API (contains dataset to train)

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # we check GPU availability
if torch.cuda.is_available(): # if True
    print("We can use GPU") # we can use GPU
else:
    print("We don't use GPU") # we cannot use GPU

We can use GPU


# Step 1 - Connect to Hugging Face API

In [3]:
api = HfApi(token=os.environ.get('HF_TOKEN')) # you can get API Token by creating an HF account https://huggingface.co --> Setting --> tokens
# signed-up users have 100 request per hour

# Step 2 - Load Model

In [4]:
model_name = "google/gemma-2b"  # Replace with a model you have access (you need ask for permission to access gemma-2b) 

tokenizer = AutoTokenizer.from_pretrained(model_name, token = os.environ.get('HF_TOKEN')) # load tokenizer using your HF API Key (tokenizer is linked to model name, so you do not need to worry if it is the correct one)
model = AutoModelForCausalLM.from_pretrained(model_name, token = os.environ.get('HF_TOKEN')) # load model using your HF API key
# is going to take a while to download (model is about 5 GB), once downloaded it would compile faster in future runs
#print(model) # Print the model if you want to take a look for 'proj' layers those are the ones that we can adjust using LoRA (Low Rank Adaptation) these are know as attention layers

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# First Question

In [5]:
input_text = "What should I do on a trip to Europe?" # prompt

input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_length=128)
print(tokenizer.decode(outputs[0]))

<bos>What should I do on a trip to Europe?

The answer to this question is not as simple as it seems. There are many different things to see and do in Europe, and it can be difficult to know where to start.

If you’re planning a trip to Europe, here are some tips to help you get started:

1. Decide what you want to see and do.

There are so many amazing places to see and do in Europe, it can be hard to know where to start. Start by deciding what you want to see and do. Do you want to see the Eiffel Tower in Paris? Or


# Conclusion
As you can see the answer from the model is not that quite relevant, however the point of Fine-Tuning is to adjust does attention layers to obtain better results

# Second Question

In [6]:
input_text = "Explain the process of photosynthesis in a way that a child could understand"


input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_length=128)
print(tokenizer.decode(outputs[0]))

<bos>Explain the process of photosynthesis in a way that a child could understand.

A 100-W lightbulb is plugged into a standard $120-\mathrm{V}$ (rms) outlet. Find $(a) I_{\text {mas }}$ and $(b) I_{\max }$ when a device like this is operating at the maximum current allowed by its own internal circuitry. (Such a device is often called a light dimmer.)

A 100-turn, 2.0-cm-diameter coil is at rest with its axis vertical. A uniform magnetic field $60^{\circ}$ away


# Conclusion

In this example the answer is not very clear, and it is kind of confusing, the point of Fine-Tuning is to fix this kind of issues.

## Step 3 - Check Dataset (dolly-15k)

In [9]:
from datasets import load_dataset # to get datasets 

dataset_name = "databricks/databricks-dolly-15k" # opensource datasets that has several kind of inputs to LLM
dataset = load_dataset(dataset_name, split="train[0:1000]") # just take the first 1000 records

print(f"Instruction is: {dataset[0]['instruction']}") # take a look of the data
print(f"Response is: {dataset[0]['response']}") # response 
print(f"context is: {dataset[0]['context']}") # context (not all rows has it)
print(f"category is: {dataset[0]['category']}")  # category of the question
dataset

Instruction is: When did Virgin Australia start operating?
Response is: Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.
context is: Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.
category is: closed_qa


Dataset({
    features: ['instruction', 'context', 'response', 'category'],
    num_rows: 1000
})

# Step 4 - Finetune using Parameter Efficient Fine-tuning (PEFT)
- if you don't use PEFT then you will overload GPU and training might be impossible (remember that Gemma-2b has 2 billion parameters)

# Do all the imports

In [10]:
import torch # to modify the model
from transformers import (
    AutoModelForCausalLM, # to have access to the model
    AutoTokenizer, # to have access to the model tokenizer
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Step 5 - Load Dataset

In [13]:
from datasets import load_dataset

dataset_name = "databricks/databricks-dolly-15k"
dataset = load_dataset(dataset_name, split="train") #load whole dataset

## Check record categories

In [14]:
from collections import defaultdict

categories_count = defaultdict(int)
for __, data in enumerate(dataset):
    categories_count[data['category']] += 1
print(categories_count)

defaultdict(<class 'int'>, {'closed_qa': 1773, 'classification': 2136, 'open_qa': 3742, 'information_extraction': 1506, 'brainstorming': 1766, 'general_qa': 2191, 'summarization': 1188, 'creative_writing': 709})


# Step 6 - Filter data

In [15]:
# exclude those that do not have any context
filtered_dataset = []
for __, data in enumerate(dataset):
    if data["context"]:
        continue
    else:
        text = f"Instruction:\n{data['instruction']}\n\nResponse:\n{data['response']}"
        filtered_dataset.append({"text": text})

print(filtered_dataset[0:2]) #check the filtered data

[{'text': 'Instruction:\nWhich is a species of fish? Tope or Rope\n\nResponse:\nTope'}, {'text': 'Instruction:\nWhy can camels survive for long without water?\n\nResponse:\nCamels use the fat in their humps to keep them filled with energy and hydration for long periods of time.'}]


# Step 7 - Create new split file

In [14]:
# convert to json and save the filtered dataset as jsonl file
import jsonlines as jl
with jl.open('dolly-mini-train-learning.jsonl', 'w') as writer:
    writer.write_all(filtered_dataset[0:])

# Step 8 - Load Filtered data from HF API

In [16]:
from datasets import load_dataset # access to HF datasets
# create your own repository of data with your username and repository name, mine is private, so you won't be able to enter
dataset_name = "joselopez1999/data_bricks_train_learning"
dataset = load_dataset(dataset_name, split="train[0:1500]") # adjust the ammount of data you want to load (my case, 1500)
dataset

Dataset({
    features: ['text'],
    num_rows: 1500
})

## Step 9 - Define all the parameters
- LoRA parameters
- bitsandbytes parameters
- training arguments / parameters
- Supervised fine-tuning (SFT) parameters

In [17]:
# define some variables - model names
model_name = "google/gemma-2b" # base LLM to Fine tune
new_model = "gemma-ft" #new model nam,e

# LoRA parameters
# LoRA attention dimension
lora_r = 8 #if you have a huge GPU you can try to use higher values (lower values consume less memory)
# Alpha parameter for LoRA scaling
lora_alpha = 16 #scales the update of LoRA matrices (it determines the impact or weight new model has compared to the one is being tuned)
# Dropout probability for LoRA layers
lora_dropout = 0.1 # probability to random drop connections between layers (improves learning)

# bitsandbytes parameters
# Activate 4-bit precision base model loading
use_4bit = True
# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"
# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False # disable double quantization

# TrainingArguments parameters
# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results" # to visualize performance latter
# Number of training epochs
num_train_epochs = 1 # just one epoch to run once
# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False # this depends on hardware
bf16 = False # if you have more than 8 cores available on GPU you can use it as True (I have 6, so I won't use it)
# Batch size per GPU for training
per_device_train_batch_size = 4 # huge batch_size can overload the memory, try with lower values (this depends on hardware)
# Batch size per GPU for evaluation
per_device_eval_batch_size = 4 # huge batch_size can overload the memory, try with lower values (this depends on hardware)
# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1
# Enable gradient checkpointing
gradient_checkpointing = True
# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3 # this is to avoid exploding gradients
# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4 # learning rate of the model
# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001 # decay of the model
# Optimizer to use
optim = "paged_adamw_32bit"
# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"
# Number of training steps (overrides num_train_epochs)
max_steps = -1 # max_steps are the same as epochs, -1 it means it is not going to be activated
# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03 
# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True
# Save checkpoint every X updates steps
save_steps = 25 # every how many steps you capture a checkpoint
# Log every X updates steps
logging_steps = 25 # update steps every 25 steps


# SFT parameters
# Maximum sequence length to use
max_seq_length = 64 # this can overload memory, try with low values first
# Pack multiple short examples in the same input sequence to increase efficiency
packing = True # False
# Load the entire model on the GPU 0
# device_map = {"": 0} # if you now your GPU core you can use this or if you have more than one GPU you can select which to use
device_map="auto" # this ensures you use all the GPUs you have available 

# Step 10 - QLoRA Configuration

In [18]:
# Load QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit, # Activates 4-bit precision loading
    bnb_4bit_quant_type=bnb_4bit_quant_type, # nf4
    bnb_4bit_compute_dtype=compute_dtype, # float16
    bnb_4bit_use_double_quant=use_nested_quant, # False
)

In [19]:
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("Setting BF16 to True")
        bf16 = True
    else:
        bf16 = False

# Step 11 - Load Model to Fine-Tune

In [20]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=os.environ.get('HF_TOKEN'),
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          token=os.environ.get('HF_TOKEN'),
                                          trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Step 12 - LoRA Configuration

In [21]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj"] #attentions modules to fine tune
)

# # Step 13 - Training Arguments

In [22]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",
)

# Step 14 - Supervised Fine-Tuning

In [23]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    # formatting_func=format_prompts_fn,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Generating train split: 0 examples [00:00, ? examples/s]

# Step 15 - Training Process

In [24]:
# Train model
trainer.train()
trainer.model.save_pretrained(new_model)

  attn_output = torch.nn.functional.scaled_dot_product_attention(


Step,Training Loss
25,4.5353
50,3.1887
75,2.8145
100,2.6059
125,2.6722
150,2.5787
175,2.5661
200,2.5788
225,2.6113
250,2.6272


# Step 16 - Visualize results of Fine-Tuning

In [25]:
%load_ext tensorboard
%tensorboard --logdir results/runs

# Step 17 - Prompt the newly fine-tuned model
* Load and MERGE the LoRA weights with the model weights
* Run inference with the same prompt we used to test the pre-trained model

In [26]:
input_text = "What are some of the best places to visit in Europe?"

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.
Some parameters are on the meta device because they were offloaded to the cpu.


# First Question after Fine-Tuning

In [28]:
input_ids = tokenizer(input_text, return_tensors="pt").to(device)
outputs = model.generate(**input_ids, max_length=128)
print(tokenizer.decode(outputs[0]))

<bos>What are some of the best places to visit in Europe?

Response:
Europe is a continent that is home to many countries and cities. Some of the best places to visit in Europe include Paris, London, Rome, Barcelona, Amsterdam, Venice, Prague, Berlin, Copenhagen, Vienna, Lisbon, and many more. Each city has its own unique culture, history, and attractions that make it a must-visit destination.<eos>
