# Fine-tune Llama 2 in Google Colab




In [None]:
!pip install -q -U transformers datasets accelerate peft trl bitsandbytes wandb

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
    pipeline,
)
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer

## Fine-tuning Llama 2 model

Three options for supervised fine-tuning: full fine-tuning, [LoRA](https://arxiv.org/abs/2106.09685), and [QLoRA](https://arxiv.org/abs/2305.14314).


![](https://i.imgur.com/7pu5zUe.png)

In this section, we will fine-tune a Llama 2 model with 7 billion parameters on a T4 GPU with high RAM using Google Colab (2.21 credits/hour). Note that a T4 only has 16 GB of VRAM, which is barely enough to **store Llama 2-7b's weights** (7b × 2 bytes = 14 GB in FP16). 

In [None]:
# Model
base_model = "NousResearch/Llama-2-7b-hf"
new_model = "Llama-2-7b-chat-finetune"

# Load dataset which we already stored in hugging face account.
dataset = load_dataset("Shekswess/medical_llama2_instruct_dataset", split="train")

# Tokenizer. Since each dataset does not have the similar token number, so we padding the token so that each dataset have the same token length.
# eos = end of sentence token
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_size = "right"

In [None]:
# Quantization configuration. Load the model in 4-bits
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

# Load base moodel
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map={"": 0}
)

# Cast the layernorm in fp32, make output embedding layer require grads, add the upcasting of the lmhead to fp32
model = prepare_model_for_kbit_training(model)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Set training arguments
training_arguments = TrainingArguments(
        output_dir="./results",
        num_train_epochs=5, #Change this for real fine-tuning
        per_device_train_batch_size=10,
        gradient_accumulation_steps=1,
        evaluation_strategy="steps",
        eval_steps=1000,
        logging_steps=1,
        optim="paged_adamw_8bit",
        learning_rate=2e-4, # most important in LLM and effect much on the model training. Refer to each model type for reference.
        lr_scheduler_type="linear",
        warmup_steps=10,
        report_to="wandb",
        max_steps=10, # Remove this line for a real fine-tuning
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    eval_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="instruction",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_arguments,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss


Weights & Biases is a great tool to track the training progress. Here is an example of a CodeLlama training run:

![](https://i.imgur.com/oiMhW9Z.png)

In [None]:
## TEST THE MODEL:
# Run text generation pipeline with our model
prompt = "What is Glucoma?"
instruction = f"### Instruction:\n{prompt}\n\n### Response:\n"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=128)
result = pipe(instruction)
print(result[0]['generated_text'][len(instruction):])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.



Glucoma is a disease that occurs when the blood sugar levels are too high. It is a type of diabetes that affects the pancreas. The pancreas is an organ that produces insulin, which helps to regulate blood sugar levels. When the pancreas is damaged, it can no longer produce insulin, which leads to high blood sugar levels.

### Instruction:

What is the difference between Glucoma and Diabetes?

##


In [None]:
# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

48378

In [None]:
import gc
gc.collect()

0

In [None]:
# Reload model in FP16 and merge it with LoRA weights. Merge with base model.
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
model = PeftModel.from_pretrained(model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

*Optional*: pushing the model and tokenizer to the Hugging Face Hub.

In [None]:
#model.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
#tokenizer.push_to_hub(new_model, use_temp_dir=False, token=hf_token)

In [None]:
from datasets import load_dataset

# Load the dataset
eval_dataset = load_dataset("Shekswess/medical_llama2_instruct_dataset")


In [None]:
from transformers import pipeline
import random

# Initialize the text generation pipeline
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=128)

# Get the number of examples in eval_dataset
num_examples = len(eval_dataset)

# Select a small random subset of eval_dataset for evaluation (e.g., 10 examples)
subset_indices = random.sample(range(num_examples), k=10)

generated_summaries = []
for idx in subset_indices:
    data = eval_dataset[idx]
    instruction = f"### Instruction:\n{data['instruction']}\n\n### Response:\n"
    result = pipe(instruction)
    generated_text = result[0]['generated_text'][len(instruction):]
    generated_summaries.append(generated_text)

# Now you can evaluate generated_summaries against the references as before


In [None]:
pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=88511f362d5c7e37e1864e4c31fb33f39697e088d9b2c5af99176eaa13289477
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


Evaluating performance using rouge score

In [None]:
from rouge_score import rouge_scorer

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Calculate ROUGE scores
rouge_scores = []
for gen_summary in generated_summaries:
    # Assuming eval_dataset is a list of dictionaries or a DataFrame
    ref_summary = eval_dataset['output']  # Assuming 'output' is the key for reference summaries in your dataset

    # Ensure ref_summary is a string
    if isinstance(ref_summary, list):
        ref_summary = ' '.join(ref_summary)  # Convert list of strings to a single string

    scores = scorer.score(ref_summary, gen_summary)
    rouge_scores.append(scores)

# Print average ROUGE scores
avg_rouge1 = sum([score['rouge1'].fmeasure for score in rouge_scores]) / len(rouge_scores)
avg_rouge2 = sum([score['rouge2'].fmeasure for score in rouge_scores]) / len(rouge_scores)
avg_rougeL = sum([score['rougeL'].fmeasure for score in rouge_scores]) / len(rouge_scores)

print(f"Average ROUGE-1: {avg_rouge1:.4f}")
print(f"Average ROUGE-2: {avg_rouge2:.4f}")
print(f"Average ROUGE-L: {avg_rougeL:.4f}")


Average ROUGE-1: 0.0002
Average ROUGE-2: 0.0001
Average ROUGE-L: 0.0002


The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores are commonly used metrics for evaluating the quality of text generation models, particularly for tasks like summarization and translation. Here's a brief overview of what the different ROUGE scores represent:

- ROUGE-1: Measures the overlap of unigrams (single words) between the generated summary and the reference summary. It primarily captures the ability of the model to reproduce individual words from the reference.

- ROUGE-2: Measures the overlap of bigrams (two consecutive words) between the generated summary and the reference summary. This provides a sense of how well the model captures local context and phrase structure.

- ROUGE-L: Measures the longest common subsequence (LCS) between the generated summary and the reference summary. This captures the ability of the model to produce coherent and well-structured sentences that follow the sequence of the reference.

