# Fine-Tuning Llama 2

In this notebook, you will see how to use your previously created dataset for fine-tuning

Don't forget to run this notebook with T4 GPU.

In [2]:
!pip install -q -U transformers datasets accelerate peft trl bitsandbytes wandb

In [3]:
!pip install wandb evaluate




In [None]:
import wandb
my_secret = "Token_Here"

wandb.login(key=my_secret)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmaymonah-althunayan[0m ([33mmaymonah-althunayan-tuwaiq-academy[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [5]:
run = wandb.init(
    project = "GovernAI",
)

In [None]:
from google.colab import userdata

# Defined in the secrets tab in Google Colab
hf_token = "Token_Here"

In [7]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
    pipeline,
)
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer

## Fine-tuning Llama 2 model

To drastically reduce the VRAM usage, we must **fine-tune the model in 4-bit precision**, which is why we'll use QLoRA here.

In [8]:
# Model
base_model = "meta-llama/Llama-2-7b-chat-hf"
new_model = "llama-2-7b-GovernAI"

# Dataset
dataset = load_dataset("json", data_files="/content/llm_qa_dataset_100.jsonl", split="train")

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"


Learn more about padding [in the following article](https://medium.com/

towards-data-science/padding-large-language-models-examples-with-llama-2-199fb10df8ff) written by Benjamin Marie.

In [9]:
# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# LoRA configuration (Low-Rank Adaptation)
peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    # target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto"  # This automatically maps the model across available GPUs
)

# Prepare model for training
model = prepare_model_for_kbit_training(model)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

![](https://i.imgur.com/bBf6ARw.png)

See Hugging Face's [Llama implementation](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L229C4-L229C4) for more information about target modules.

In [10]:
def preprocess_function(examples):
    return {
        "text": examples["question"] + " " + examples["answer"]  # Merge question and answer
    }

dataset = dataset.map(preprocess_function)


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [11]:
from evaluate import load
def compute_metrics(eval_pred):
    mauve = load('mauve')
    predictions =  [
        "What are the key data classification principles in our organization?",
        "How should we handle integrated datasets with mixed classification levels?",
        "What is the 'Open by Default' principle and how is it applied differently across sectors?",
        "When should data classification be timebound?",
        "How does the 'Segregation of Duties' principle affect data handling responsibilities?",
        "What factors determine the classification level according to the 'Necessity and Proportionality' principle?"
    ]
    references = [
        "The key data classification principles include: Open by Default, Necessity and Proportionality, Timely Classification, Highest Level of Protection, and Segregation of Duties.",
        "According to Principle 4 (Highest Level of Protection), if information includes an integrated dataset with different classification levels, the highest classification level shall be approved.",
        "The Open by Default principle states that data shall primarily be accessible in the development sector unless its sensitivity requires higher protection, and top secret in political and security sectors unless its sensitivity requires lower protection.",
        "According to Principle 3 (Timely Classification), data shall be classified upon creation or receipt from other entities, and said classification should be timebound.",
        "The Segregation of Duties principle requires that worker responsibilities related to data classification, access, disclosure, use, modification, or destruction shall be segregated to prevent overlap of powers and avoid dispersal of responsibilities.",
        "According to the Necessity and Proportionality principle, data shall be classified based on its nature, sensitivity, and impact, balancing its value against its confidentiality level."
    ]
    mauve_results = mauve.compute(predictions=predictions, references=references)
    print(mauve_results.mauve)
    return mauve_results.mauve

In [12]:
# Set training arguments
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    per_device_train_batch_size=10,
    gradient_accumulation_steps=1,
    eval_strategy="steps",
    eval_steps=1000,
    logging_steps=1,
    optim="paged_adamw_8bit",
    learning_rate=2e-4,
    lr_scheduler_type="linear",
    warmup_steps=10,
    report_to="wandb",
    # max_steps=2,
)

# Initialize the trainer (removed invalid parameters)
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    eval_dataset=dataset,
    peft_config=peft_config,
    args=training_arguments,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)


Adding EOS to train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss




Weights & Biases is a great tool to track the training progress.

In [13]:
# Run text generation pipeline with our model
prompt = "What is a large language model?"
instruction = f"### Instruction:\n{prompt}\n\n### Response:\n"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=128)
result = pipe(instruction)
print(result[0]['generated_text'][len(instruction):])

Device set to use cuda:0
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_value=None`.
  return fn(*args, **kwargs)
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_value=None`.
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_value=None`.
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_value=None`.
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_value=None`.
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_value=None`.
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_value=None`.
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_value=None`.
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_value=None`.
C

A  Why I The How  Who The We P M “ This What Who What What The Who What Who


In [17]:
# # Empty VRAM
del model
del pipe
del trainer
# Collect garba ge and clear GPU cache
import gc
gc.collect()
gc.collect()

0

Merging the base model with the trained adapter.

In [18]:
torch.cuda.empty_cache()

In [21]:
device = "cuda:0"

# Reload model in FP16 and merge it with LoRA weights
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    # device_map={"": 0},
    device_map="auto"
)
model = PeftModel.from_pretrained(model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 4.12 MiB is free. Process 466876 has 14.73 GiB memory in use. Of the allocated memory 14.14 GiB is allocated by PyTorch, and 475.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

 pushing the model and tokenizer to the Hugging Face Hub.

In [None]:
# Push the model to the Hub
model.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
# Push the tokenizer to the Hub
tokenizer.push_to_hub(new_model, use_temp_dir=False, token=hf_token)

#Evaluation

In [None]:
def evaluate_rag_system():
    # Test set
    test_questions = [
        "What are the key data classification principles in our organization?",
        "How should we handle integrated datasets with mixed classification levels?",
        "What is the 'Open by Default' principle and how is it applied differently across sectors?",
        "When should data classification be timebound?",
        "How does the 'Segregation of Duties' principle affect data handling responsibilities?",
        "What factors determine the classification level according to the 'Necessity and Proportionality' principle?"
    ]

    test_refs = [
        "The key data classification principles include: Open by Default, Necessity and Proportionality, Timely Classification, Highest Level of Protection, and Segregation of Duties.",
        "According to Principle 4 (Highest Level of Protection), if information includes an integrated dataset with different classification levels, the highest classification level shall be approved.",
        "The Open by Default principle states that data shall primarily be accessible in the development sector unless its sensitivity requires higher protection, and top secret in political and security sectors unless its sensitivity requires lower protection.",
        "According to Principle 3 (Timely Classification), data shall be classified upon creation or receipt from other entities, and said classification should be timebound.",
        "The Segregation of Duties principle requires that worker responsibilities related to data classification, access, disclosure, use, modification, or destruction shall be segregated to prevent overlap of powers and avoid dispersal of responsibilities.",
        "According to the Necessity and Proportionality principle, data shall be classified based on its nature, sensitivity, and impact, balancing its value against its confidentiality level."
    ]


    # Generate predictions
    pred_responses = [generate_response(question) for question in test_questions]

    # BLEU Score
    bleu = evaluate.load("bleu")
    bleu_results = bleu.compute(predictions=pred_responses, references=[[ref] for ref in test_refs])
    print(f"BLEU Score: {bleu_results['bleu']:.2f}")

    # ROUGE Score
    rouge = evaluate.load("rouge")
    rouge_results = rouge.compute(predictions=pred_responses, references=test_refs)
    for key, value in rouge_results.items():
        print(f"{key}: {value:.2f}")

    # F1 Score
    pred_tokens = [word_tokenize(response) for response in pred_responses]
    ref_tokens = [word_tokenize(ref) for ref in test_refs]
    pred_flat = [token for sublist in pred_tokens for token in sublist]
    ref_flat = [token for sublist in ref_tokens for token in sublist]

    # Align the lengths of the token lists
    min_length = min(len(pred_flat), len(ref_flat))
    pred_flat = pred_flat[:min_length]
    ref_flat = ref_flat[:min_length]

    # Calculate F1 score
    f1 = f1_score(ref_flat, pred_flat, average='weighted')
    print(f"F1 Score: {f1:.2f}")
