## Purpose

NOTE: Needs to be run on Google Colaboratory A100 GPU or similar spec (as even training quantised model takes significant memory)

This notebook fine-tunes the best-performing model in initial retrieval-augmented generation (RAG) vs large language model (LLM) experiments.

Mistral 7B instruct fine-tuned on synthetic dataset generated from DeepSeek QA pairs in other notebooks:

2 sources to that dataset:

- QA pairs generated from individual markdown chunks containing text from Citizen Information site
- QA pairs generated from leader questions to politicians in Dáil Éireann (sourced from Oireachtas API and reformulated for clearer QA pairs using DeepSeek)

This was a brief exploration of the fine-tuning workflow, I learned a lot about quantisation, parameter-efficient fine-tuning (PEFT), LoRA (Low-rank adaptation), and the available classes that Hugging Face provides to ease fine-tuning of existing models.

A more robust analysis would extend the below implementation.

## Install Additional packages

In [None]:
# !pip install -q transformers accelerate bitsandbytes peft
# !pip install -U datasets
# !pip install -U trl

!pip install transformers trl accelerate torch bitsandbytes peft datasets -q
!pip install -U datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.8/375.8 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m124.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m92.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m57.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Imports

In [None]:
import os
import json
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from huggingface_hub import login, HfApi
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from typing import Optional, List, Tuple, Any, Union, Pattern, Dict, Callable
from trl import SFTTrainer


In [None]:
# Login to Hugging face - needed for upload of model and dataset at end of notebook
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
def check_file_exists(file_path: str) -> bool:
  """
  This function checks if provided file path exists

  Args:
    file_path (str): file path which should be checked for file existence
  
  Returns:
    bool: whether the file exists or not
  """

  # Uses os package isfile method to determine if a file exists at the provided file path
  # If so, return True, else return False
  if os.path.isfile(file_path):
    return True

  return False

In [None]:
def read_json_file(input_file_path: str, default: Any = []) -> Any:
  """
  Reads content from json file at provided input file path

  Args:
    input_file_path (str): the file path where the target file resides
    default (Any): the default structure of the expected file (to be returned in case of error to avoid consuming errors)
  
  Returns:
    Any: the content read from the file
  """

  # Call utility function to check if a file actually exists at the provided file path parameter
  # If it does not exist, return the default data structure
  if not check_file_exists(input_file_path):
    return default

  # Try open the file and read the contents
  # If successful, return the contents. If not successful, return the default data structure
  try:
    with open(input_file_path, "r", encoding="utf-8") as input_file:
      content = json.load(input_file)

    print("Successfully loaded content from {} file".format(input_file_path))
    return content

  except (OSError, IOError, json.JSONDecodeError) as e:
    print("Error reading from file path: {}".format(input_file_path))
    return default

In [None]:
def write_jsonl_file(output_file_path: str, data: Any) -> None:
  """
  Function that writes content to jsonl file at provided output file path. Used as Hugging Face format prefers jsonl file for datasets

  Args:
    output_file_path (str): where the jsonl file should be written in file system
    data (Any): the data to be written to the file
  """

  # Open the file (create if it doesn't exist)
  with open("{}.jsonl".format(output_file_path), "w") as output_file:
    # Loop over items in data variable, for each use json dump method to write data and append new line
    for item in data:
      new_item = {"instruction": item.get("question", ""), "answer": item.get("answer", "")}

      json.dump(new_item, output_file)
      output_file.write("\n")

In [None]:
# Read merged synthetic dataset json file (Note: needs to be added at correct path)
dataset_content = read_json_file("/content/merged_final_qa_dataset.json")

Successfully loaded content from /content/merged_final_qa_dataset.json file


In [None]:
# Define target output path for jsonl file
DATASET_JSON_L_FILE_PATH = "/content/fine-tuning-data"

In [None]:
print(dataset_content[0])

{'chunk_text': 'Private health services\n\nEither individual health professionals or healthcare companies provide private healthcare services. Typically, you pay the full cost of private healthcare services, but you can buy private health insurance to help cover the cost.\n\nArrangements vary from one company to another, but most private healthcare companies have agreements with hospitals to pay the hospital directly. In general, for outpatient costs you pay the health professional and then claim back from the health insurance company. You should check with your own company as to their procedures.\n\nThe following companies offer voluntary private health insurance in Ireland:\n\nIrish Life Health\n\nLaya Healthcare\n\nVhi Healthcare\n\nHSF Health Plan (provides cash benefit plans but not in-patient health insurance)\n\nGeneral Practitioners (GPs)\n\nGeneral Practitioners (GPs) are family doctors.\n\nIn Ireland, GPs provide broad services to patients on all health issues and can refer y

In [None]:
# Generate jsonl file from synthetic dataset json file
write_jsonl_file(DATASET_JSON_L_FILE_PATH, dataset_content)

In [None]:
# Load Hugging Face dataset, using jsonl file created above to load in data
dataset = load_dataset("json", data_files="/content/fine-tuning-data.jsonl", cache_dir="./hf-cache")

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
type(dataset)

datasets.dataset_dict.DatasetDict

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'answer'],
        num_rows: 2228
    })
})


In [None]:
# Extracting out the data
full_dataset = dataset["train"]

In [None]:
print(len(full_dataset))

2228


In [None]:
# Classic train test split for data
split_dataset = full_dataset.train_test_split(test_size=0.1, seed=42)

In [None]:
# Extract training data
train_dataset = split_dataset["train"]

In [None]:
print(len(train_dataset))

2005


In [None]:
# Extract validation data
test_dataset= split_dataset["test"]

In [None]:
print(len(test_dataset))

223


In [None]:
# As Mistral 7B instruction was best-performing model in experiments, this was selected for fine-tuning to enable comparison between
# Mistral 7B with RAG vs Mistral 7B fine-tuned
model_id = "mistralai/Mistral-7B-Instruct-v0.3"

In [None]:
# Initialise bits and bytes config for model quantisation (required as training puts more pressure on available GPU in terms of memory usage)
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= True,
)

In [None]:
# Initialise tokeniser
tokeniser = AutoTokenizer.from_pretrained(model_id,
                                          # padding_side="right",
                                          # add_eos_token=True
                                          )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [None]:
# Ensure that tokeniser has padding token
tokeniser.pad_token = tokeniser.eos_token
# tokeniser.pad_token = tokeniser.unk_token
# tokeniser.padding_side = "right"

In [None]:
# Load quantised version of Mistral 7B instruct
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map="auto",
                                             torch_dtype=torch.float16,
                                             trust_remote_code=True,
                                             use_cache=False)

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [None]:
# Model settings
model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

In [None]:
def format_example(example):
  """
  A simple function to generate QA pairs which will be used for fine-tuning model
  """

  return """### Instruction:
{}

### Response:
{}""".format(example["instruction"].strip(), example["answer"].strip())

In [None]:
# Map over each item in training dataset, apply format_example function to each to get model inputs into correct format
train_dataset = train_dataset.map(lambda x: {"text": format_example(x)})

Map:   0%|          | 0/2005 [00:00<?, ? examples/s]

In [None]:
# Map over each item in test dataset, apply format example function to each to get model inputs into correct format
test_dataset = test_dataset.map(lambda x: {"text": format_example(x)})

Map:   0%|          | 0/223 [00:00<?, ? examples/s]

In [None]:
def tokenise_example(example):
  """
  Use tokeniser to tokenise input QA pair that will be used for fine-tuning
  """

  tokenised = tokeniser(example["text"], padding="max_length", truncation=True, add_special_tokens=True, max_length=2048)
  tokenised["labels"] = tokenised["input_ids"].copy()
  return tokenised

In [None]:
# Use above to map over train dataset, create tokenised version of each input
tokenised_train_dataset = train_dataset.map(tokenise_example, batched=True, remove_columns=train_dataset.column_names)

Map:   0%|          | 0/2005 [00:00<?, ? examples/s]

In [None]:
# Use above to map over test dataset, create tokenised version of each input
tokenised_test_dataset = test_dataset.map(tokenise_example, batched=True, remove_columns=test_dataset.column_names)

Map:   0%|          | 0/223 [00:00<?, ? examples/s]

In [None]:
# Create PEFT with Lora config configuration (note different values of r, lora_alpha, lora_dropout were explored)
peft_config = LoraConfig(
    r=16,
    # r=8,
                        lora_alpha=32,
    # lora_alpha=16,
                         target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1,
                        # lora_dropout=0.2,
                         bias="none",
                         task_type="CAUSAL_LM")

In [None]:
# Prepare the model for training
model = prepare_model_for_kbit_training(model)

In [None]:
# Initialise peft version of model, using PEFT configuration shown above
model = get_peft_model(model, peft_config)

In [None]:
# Define training arguments to be used for fine-tuning
# Note some exploration below on a number of different parameters
# Note model is checkpointed and evaluated at every 50 iterations with only 1 training epoch set for run
# One key thing noted during experimentation was the GPU demands that training places even when model was loaded in quantised version on Colab A100

training_arguments = TrainingArguments(
    output_dir="/content/results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    # per_device_train_batch_size=16,
    # gradient_accumulation_steps=2,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    # save_steps=20,
    save_steps=50,
    # logging_steps=20,
    logging_steps=50,
    # learning_rate=2e-4,
    learning_rate=1e-4,
    #learning_rate=5e-5,
    weight_decay=0.001,
    # fp16=False,
    fp16=True,
    # bf16=False,
    max_grad_norm=0.3,
    # max_grad_norm=0.25,
    max_steps=-1,
    eval_strategy = "steps",
    # eval_steps = 20,
    eval_steps = 50,
    warmup_ratio=0.03,
    group_by_length=True,
    # lr_scheduler_type="constant",
    lr_scheduler_type="constant",
    report_to="wandb",
    remove_unused_columns=False
)

In [None]:
# Initialise Supervised Fine-Tuning trainer from Hugging Face, pass model, PEFT configuration, tokeniser, training arguments & train and test datasets
# to be used in fine-tuning to this
trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  processing_class=tokeniser,
  args=training_arguments,
  train_dataset=tokenised_train_dataset,
  eval_dataset=tokenised_test_dataset,
)

Truncating train dataset:   0%|          | 0/2005 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/223 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
# Call train method on that class to actually kick-off fine-tuning training
# One thing that was required was account creation with WanDB AI (which is free to setup account)
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjohndennehy101[0m ([33mjohndennehy101-university-of-limerick[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss
50,1.4903,2.3e-05


  return fn(*args, **kwargs)


KeyboardInterrupt: 

In [None]:
# Call save methods available on classes to store fine-tuned versions of the model and tokeniser
trainer.save_model("/content/results/final_model")
tokeniser.save_pretrained("/content/results/final_model")

('/content/results/final_model/tokenizer_config.json',
 '/content/results/final_model/special_tokens_map.json',
 '/content/results/final_model/chat_template.jinja',
 '/content/results/final_model/tokenizer.model',
 '/content/results/final_model/added_tokens.json',
 '/content/results/final_model/tokenizer.json')

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Load fine-tuned model & tokenizer to validate content generation functional before pushing to Hugging Face hub
model_dir = "/content/results/final_model"

# Initialise fine-tuned tokeniser and fine-tuned model from saved directory
tokeniser = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# Initiliase Hugging Face pipeline and use fine-tuned model and tokeniser
generator = pipeline("text-generation", model=model, tokenizer=tokeniser, device=0)  # device=0 if using GPU

# This was tested to see if model was actually generating output after the last input prompt token
# One issue that was discovered was extremely repetitive output (perhaps due to overfitting with limited dataset for fine-tuning)
prompt = """Finish this sentence:
"""

output = generator(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.9,
    max_new_tokens=1024,
    repetition_penalty=1.1,
)

print(output[0]['generated_text'])

Device set to use cuda:0


Finish this sentence: 



In [None]:
output = generator(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.9,
    max_new_tokens=2048,
    repetition_penalty=1.1,
)

print(output[0]['generated_text'])

<s>[INST]Who is entitled to Irish citizenship?[/INST]</s>



In [None]:
# Declare model name to be published on Hugging Face
model_repo_name = "johndennehy101/Mistral-7B-Instruct-v0.3-finetune-irish-citizen-info-v1"

In [None]:
# Push both the model and tokeniser used in fine-tuning to Hugging Face
model.push_to_hub(model_repo_name)
tokeniser.push_to_hub(model_repo_name)

adapter_model.safetensors:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/johndennehy101/Mistral-7B-Instruct-v0.3-finetune-irish-citizen-info-v1/commit/7466ebf5f2cfd439fa3d00f6c63e6e5e632470ce', commit_message='Upload tokenizer', commit_description='', oid='7466ebf5f2cfd439fa3d00f6c63e6e5e632470ce', pr_url=None, repo_url=RepoUrl('https://huggingface.co/johndennehy101/Mistral-7B-Instruct-v0.3-finetune-irish-citizen-info-v1', endpoint='https://huggingface.co', repo_type='model', repo_id='johndennehy101/Mistral-7B-Instruct-v0.3-finetune-irish-citizen-info-v1'), pr_revision=None, pr_num=None)

In [None]:
# Also, use the hugging face api to push the dataset used for training to the Hugging Face hub
api = HfApi()
dataset_repo_id = "johndennehy101/irish-citizen-information-fine-tuning-data"

api.create_repo(repo_id=dataset_repo_id, repo_type="dataset")

api.upload_file(path_or_fileobj="/content/fine-tuning-data.jsonl",
                path_in_repo="fine-tuning-data.jsonl",
                repo_id=dataset_repo_id,
                repo_type="dataset")

CommitInfo(commit_url='https://huggingface.co/datasets/johndennehy101/irish-citizen-information-fine-tuning-data/commit/988b371b7728a5cce26bb8178775beb848dabdea', commit_message='Upload fine-tuning-data.jsonl with huggingface_hub', commit_description='', oid='988b371b7728a5cce26bb8178775beb848dabdea', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/johndennehy101/irish-citizen-information-fine-tuning-data', endpoint='https://huggingface.co', repo_type='dataset', repo_id='johndennehy101/irish-citizen-information-fine-tuning-data'), pr_revision=None, pr_num=None)