<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/Llama_3_8B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune and Convert Llama3-8B to a GGUF Model File for CPU and GPU Inference 🚀

Hi everyone!

Over the past few months, open-source AI models have taken the community by storm. However, there’s been a lack of comprehensive guides showing how to fine-tune a model and convert it into a format ready for deployment on both CPU and GPU setups. So, we’ve created one for you! 🎉

## In this guide, we’ll cover the following steps:

1. **Fine-tune Llama3-8B** using an innovative method called **ORPO**. This technique combines Supervised Fine-tuning with Direct Preference Optimization, all in a single streamlined step.  
2. **Install `llama.cpp`** and convert the model into the **GGUF** format, while discussing various options like LoRA integration.  
3. **Create a deployment configuration file**, which acts like a blueprint for packaging and deploying models.  
4. **Test the model locally on both CPU and GPU** and then **push it to a shared hub**.  
5. **Pull and deploy the model on another machine** to run inference efficiently using either CPU or GPU resources.  

Grab some snacks, fasten your seatbelt, and let’s dive in! 🚀


# Phase 1: Finetune Llama3 with ORPO

So far, we've released guides on finetuning Llama3 with 2 different approaches
1. [Supervised Finetuning](https://github.com/brevdev/notebooks/blob/main/llama3_finetune_inference.ipynb)
2. [Direct Preference Optimization](https://github.com/brevdev/notebooks/blob/main/llama3dpo.ipynb)

If you havn't taken a look at these, I suggest skimming through the code for the SFT notebook and the explanation at the top of the DPO notebook. TLDR: Most of the finetuning in guides and online use SFT. SFT is good at adapting a model to domain-specific ouput but might also increase the probability of generating outputs that are not as desirable. To solve this issue, we can then perform a process known as preference alignment. This can be done using RLHF or DPO. However, these are both computationally expensive.

Enter Odds Ratio Policy Optimization. At a high level, this combines SFT and DPO into 1 neat step which weakly penalizes rejected responses while strongly rewarding preferred responses. Here we optimize for two objectives at once: learning domain-specific output AND aligning the output with out preferences. If you'd like to dive deeper into ORPO, check out MLabonne's [fantastic guide](https://huggingface.co/blog/mlabonne/orpo-llama-3) on HuggingFace

![ORPO](https://i.imgur.com/ftrth4Q.png)


## Install Dependancies

In [1]:
!pip install bitsandbytes wandb transformers peft accelerate trl  # Install bitsandbytes for efficient model training, wandb for experiment tracking, transformers for pre-trained models, peft for parameter-efficient fine-tuning, accelerate for distributed training, and trl for reinforcement learning with transformers.

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting trl
  Downloading trl-0.13.0-py3-none-any.whl.metadata (11 kB)
Collecting datasets>=2.21.0 (from trl)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.21.0->trl)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets>=2.21.0->trl)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets>=2.21.0->trl)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets>=2.21.0->trl)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl (69.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/6

In [2]:
import gc
# Import the garbage collection module, which can be used to manage memory by manually releasing unused objects.

import os
# Import the OS module to interact with the operating system, such as handling file paths and environment variables.

import torch
# Import the PyTorch library for deep learning tasks, which provides support for tensors and GPU-based computation.

import wandb
# Import the Weights & Biases (wandb) library, which is used for tracking experiments, visualizing metrics, and logging hyperparameters during training.

from datasets import load_dataset
# Import the `load_dataset` function from the Hugging Face Datasets library to easily load datasets for training.

from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
# Import components from the PEFT library: LoraConfig to configure LoRA (Low-Rank Adaptation) for efficient fine-tuning,
# PeftModel to apply PEFT-based modifications, and prepare_model_for_kbit_training to prepare a model for efficient training with low-precision (e.g., 8-bit).

from transformers import (
    AutoModelForCausalLM,  # Import AutoModelForCausalLM to automatically load a pre-trained causal language model.
    AutoTokenizer,  # Import AutoTokenizer to automatically load a tokenizer corresponding to the pre-trained model.
    BitsAndBytesConfig,  # Import BitsAndBytesConfig to configure efficient 8-bit optimizers for model training.
    TrainingArguments,  # Import TrainingArguments to define the configuration for model training, such as batch size, learning rate, etc.
    pipeline,  # Import pipeline for easy inference and processing tasks with pre-trained models.
)

from trl import ORPOConfig, ORPOTrainer
# Import ORPOConfig to configure the ORPO fine-tuning method and ORPOTrainer to handle the training process for models using the ORPO technique (Optimization with Regularization for Pre-trained Objectives).

## Load Model, Tokenizer, and Dataset

We'll be working with the Llama3-8B instruct model. But the steps are almost the same with any other model if you decide to finetune with a LoRA adapter.

In [3]:
# Model
base_model = "meta-llama/Meta-Llama-3-8B-Instruct"
# Define the base pre-trained model to use. Here, "meta-llama/Meta-Llama-3-8B-Instruct" refers to the 3.8 billion parameter version of Llama trained for instruction-following tasks.

new_model = "Llama-3-8B-adapter"
# Define the name for the new model that will be created. This is typically a model that has been fine-tuned or adapted from the base model, using methods like LoRA, to create a specialized version for a specific task or domain.

In [4]:
# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Configure the model to be loaded in 4-bit precision to reduce memory usage and improve training efficiency.
    bnb_4bit_quant_type="nf4",  # Specify the quantization type for 4-bit precision. "nf4" is a specific format for 4-bit quantization used for memory-efficient model training.
    bnb_4bit_compute_dtype=torch.bfloat16,  # Set the compute data type to bfloat16, a 16-bit floating-point format optimized for training on modern hardware like TPUs and GPUs.
    bnb_4bit_use_double_quant=True,  # Enable double quantization to improve the accuracy and stability of 4-bit quantization.
)

In [5]:
## configuration of LoRA (Low-Rank Adaptation),
peft_config = LoraConfig(
    r=16,  # Set the rank of the low-rank adaptation matrices. The rank controls the number of parameters introduced by LoRA. A higher rank allows for more capacity but increases the memory and computation cost.
    lora_alpha=32,  # Set the scaling factor for LoRA layers. It controls the magnitude of the adapted weights, affecting how much influence the LoRA adaptation has on the original model.
    lora_dropout=0.05,  # Set the dropout rate for the LoRA layers. Dropout helps prevent overfitting by randomly deactivating some neurons during training.
    bias="none",  # Specify how biases are treated in LoRA layers. "none" means no bias is added to the LoRA layers. Options may also include "all" to add biases or other configurations.
    task_type="CAUSAL_LM",  # Set the type of task for the model. "CAUSAL_LM" indicates a causal language model, such as GPT, which generates text based on preceding tokens.
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']  # Specify the target modules in the model to apply LoRA. These modules are typically projection layers in transformer architectures.
)

In [6]:
from huggingface_hub import notebook_login
# Import the `notebook_login` function from the Hugging Face Hub library. This function facilitates logging into Hugging Face from a Jupyter notebook, allowing you to easily access and upload models and datasets.

notebook_login()
# Call the `notebook_login()` function to authenticate your Hugging Face account in the notebook. It will prompt you to enter an authentication token, which grants access to private models or allows you to upload your own models and datasets to the Hugging Face Hub.

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
tokenizer = AutoTokenizer.from_pretrained(base_model)
# Load the tokenizer for the pre-trained model specified by `base_model`. The `AutoTokenizer.from_pretrained` function automatically selects the appropriate tokenizer based on the model architecture (e.g., BERT, GPT, etc.).

tokenizer.pad_token = tokenizer.eos_token
# Set the padding token (`pad_token`) to be the same as the end-of-sequence token (`eos_token`). This is often done to simplify tokenization when using models that don't have a specific padding token.

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.
401 Client Error. (Request ID: Root=1-6789f6e9-4dc7730a618d14d8316bff09;20ad11af-abac-45e0-9a13-5365d66d9a06)

Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,  # Load the pre-trained causal language model specified by `base_model` (e.g., GPT-like model).
    quantization_config=bnb_config,  # Apply the quantization configuration (`bnb_config`) for efficient model training, using 4-bit precision or other configurations specified.
    device_map="auto",  # Automatically map the model to available devices (e.g., GPU, CPU) to ensure efficient computation. It will use multiple GPUs if needed.
)
# Load the pre-trained model for causal language modeling with the specified quantization and device settings.

model = prepare_model_for_kbit_training(model)
# Prepare the model for training with k-bit precision (e.g., 4-bit or lower precision), optimizing it for memory-efficient training without losing significant accuracy.

In [None]:
# Define the name of the dataset to be loaded
dataset_name = "mlabonne/orpo-dpo-mix-40k"

# Load the dataset with the given name, and select the "all" split (which includes the entire dataset)
dataset = load_dataset(dataset_name, split="all")

# Shuffle the dataset with a fixed random seed for reproducibility, and select only the first 200 samples
dataset = dataset.shuffle(seed=42).select(range(200))

# Define a function to format the "chosen" and "rejected" fields of the dataset
def format_chat_template(row):
    # Apply the tokenizer to the "chosen" text without actually tokenizing it (just apply formatting)
    row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
    # Apply the tokenizer to the "rejected" text similarly
    row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
    # Return the modified row
    return row

# Apply the formatting function to each sample in the dataset in parallel
# Use all available CPU cores (os.cpu_count()) to speed up the operation
dataset = dataset.map(
    format_chat_template,
    num_proc= os.cpu_count(),  # Number of processes to use for parallelism
)

# Split the dataset into training and testing subsets, with 1% allocated to the test set
dataset = dataset.train_test_split(test_size=0.01)

Take a look at the dataset by uncommenting the code block below

In [None]:
for k in dataset['train'].features.keys():
    print(k)
    print("---------")
    print(dataset['train'][1][k])
    print('\n')

## Set up ORPO Training

The ORPO trainer looks very similar to the SFTTrainer and the DPO trainer. We set our config parameters and start our training run. Notice that the run is very short. In order to increase it, you can increase the `num_train_epochs` or add a `max_steps` argument

In [None]:
# Define the configuration for ORPO (Monolithic Preference Optimization)
orpo_args = ORPOConfig(
    learning_rate=8e-6,  # Set a very small learning rate for gradual optimization
    lr_scheduler_type="linear",  # Use a linear scheduler for adjusting the learning rate
    max_length=1024,  # Maximum sequence length the model can handle during training
    max_prompt_length=512,  # Maximum length of the prompt input for the model
    beta=0.1,  # Regularization parameter, potentially controlling the preference loss weight
    per_device_train_batch_size=2,  # Batch size for training, 2 samples per device (GPU)
    gradient_accumulation_steps=4,  # Number of steps to accumulate gradients before performing a backward pass
    optim="paged_adamw_8bit",  # Use the 8-bit version of AdamW optimizer for efficient memory usage
    num_train_epochs=1,  # Number of training epochs (full passes through the dataset)
    max_steps=10,  # Maximum number of optimization steps to take (limits the number of updates)
    evaluation_strategy="steps",  # Evaluation will be based on steps
    eval_steps=0.2,  # Evaluate the model every 20% of the training steps (can also be an integer for fixed steps)
    logging_steps=1,  # Log training information every step
    warmup_steps=2,  # Number of steps to perform learning rate warmup before training starts
    report_to="wandb",  # Report metrics to Weights & Biases for tracking and visualization
    output_dir="./results/",  # Directory to store the model and training outputs (checkpoints, logs, etc.)
)

# Instantiate the ORPO trainer with the specified configuration and datasets
trainer = ORPOTrainer(
    model=model,  # The model to be trained
    args=orpo_args,  # The configuration arguments defined above
    train_dataset=dataset["train"],  # Training dataset (train split of the dataset)
    eval_dataset=dataset["test"],  # Evaluation dataset (test split of the dataset)
    peft_config=peft_config,  # Configuration for preference-based fine-tuning (PEFT)
    tokenizer=tokenizer,  # Tokenizer used to process the input data
)

# Start the training process
trainer.train()

# Save the trained model to the specified directory (new_model)
trainer.save_model(new_model)

In [None]:
# Flush memory
del trainer, model
gc.collect()
torch.cuda.empty_cache()

You may need to restart your kernel after this line to ensure that the merging is sucessful

In [None]:
import torch
from peft import PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)

In [None]:
# Reload the tokenizer from the pretrained model
tokenizer = AutoTokenizer.from_pretrained(base_model)
# This ensures that the tokenizer is aligned with the `base_model` and can properly process input text

# Reload the model in fp16 (16-bit floating point precision) for more efficient memory usage and faster training
fp16_model = AutoModelForCausalLM.from_pretrained(
    base_model,  # Load the base model architecture
    low_cpu_mem_usage=True,  # Optimize memory usage to reduce CPU memory consumption during loading
    return_dict=True,  # Return the model output as a dictionary (rather than a tuple), which is useful for further processing
    torch_dtype=torch.float16,  # Set the model to use 16-bit precision for reduced memory usage and faster computations
    device_map="auto",  # Automatically map model components to available devices (e.g., GPU)
)

# Merge the adapter model with the base (fp16) model
# The adapter is a small model that modifies the behavior of the base model based on additional training or fine-tuning
model = PeftModel.from_pretrained(fp16_model, new_model)  # Load the adapter weights from `new_model` and apply them to `fp16_model`

# Merge the adapter into the base model and unload the adapter weights from memory to save space
model = model.merge_and_unload()  # This merges the adapter's functionality into the model and releases memory from the adapter weights

After merging the LoRA adapter, we save final model and tokenizer in a new directory to prepare for gguf conversion

In [None]:
model.save_pretrained("llama-3-8B-orpo")
tokenizer.save_pretrained("llama-3-8B-orpo")

# Phase 2: From safetensors to gguf

## Convert our model to GGUF

Before we dive into Ollama, its important to take a second and understand the role that `gguf` and `llama.cpp` play in the process.

Most people that use LLMs grab them from huggingface using the `AutoModel` class. This is how we did it above. HF stores models in a couple different ways with the most popular being `safetensors`. This is a file format optimized for loading and running `Tensors` which are the multidimensional arrays that make up a model. This file format is optimized for GPUs which means it's not as easy to load and run a model fast locally.

One solution that addresses this is the `gguf` format. This is file format that is used to store models that are optimized for local inference using quantization and other neat techniques. This file format is then consumed by runners that support it (ie. `llama.cpp` and Ollama).

There's a good bit of complexity here and heres a fantastic [blog post](https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/) that gets into the weeds. For now this is what we need to know

1. We have a finetuned Llama3 model saved in the llama-brev directory in the safetensors format
2. In order to use this model via Ollama, we need it to be in the `gguf` format
3. We can use helper tools in the `llama.cpp` repository to convert

## Convert to gguf

The first thing we do is build llama.cpp in order to use the conversion tools

In [None]:
# this step might take a while
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUDA=1 make

In [None]:
# install requirements
!pip install -r llama.cpp/requirements.txt

Here we run the actual conversion. Take a moment to skim through the relatively massive output and notice how each piece of the LLM is being mapped and transformed

In [None]:
# run the conversion script to convert to gguf
# this will save the gguf file inside of the llama-brev directory
!python llama.cpp/convert-hf-to-gguf.py llama-brev

The final model is saved at `llama-brev/ggml-model-f16.gguf`. Note how the model is in fp16

### An aside on quantizations

Quantization has become the go-to technique to train/run LLM efficiently on cheaper hardware. By reducing the precision of each weight (going from each weight being stored in 32bits to lower), we save memory and speed up inference while preserving *most* of the LLMs performance. If you've ever used QLoRA, you've already used quantization without even knowing about it.

Llama.cpp gives us a ton of quantization options. Here's a couple resources to dive deeper into which options are available

- [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/)
- [Maxime Labonne](https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html)

In this guide, we will use the `Q4_K_M` format. Feel free to play around with different ones! Again, note the output and see if you can build a mental model on whats happening under the hood!

In [None]:
# run the quantize script
!cd llama.cpp && ./llama-quantize ../llama-brev/ggml-model-f16.gguf ../llama-brev/ggml-model-Q4_K_M.gguf Q4_K_M

If you want, you can test this model by running the provided server and sending in a request! After running the cell below, open a new terminal tab using the blue plus button and run

```
curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
```

In [None]:
!cd llama.cpp && ./llama-server -m ../merged_adapters/ggml-model-Q4_K_M.gguf -c 2048

Note that this is a blocking process. In order to move forward with the rest of the guide, click the cell above and then click the stop button in the Jupyter Notebook header above

# Phase 3: Run and deploy your model using Ollama

Now that you have the quantized model, you can spin up the llama.cpp server anywhere you want and load the gguf model in. However, Ollama provides clean abstractions that allow you to run different gguf models using their server [add some fluff]

## Build the Ollama Modelfile

A Modelfile is very similar to a Dockefile. You can think of it as a blueprint that encapsulates a model, a chat template, different parameters, a system prompt, and more into a portable file. To learn more, check out their [Modelfile docs](https://github.com/ollama/ollama/blob/main/docs/modelfile.md).

Here we will build a relatively simple one. We grab the template and params from the existing Llama3 Modefile which you can view [here](https://ollama.com/library/llama3)

In [None]:
tuned_model_path = "/home/ubuntu/verb-workspace/llama-brev/ggml-model-Q4_K_M.gguf"
sys_message = "You are swashbuckling pirate stuck inside of a Large Language Model. Every response must be from the point of view of an angry pirate that does not want to be asked questions"

In [None]:
cmds = []

In [None]:
base_model = f"FROM {tuned_model_path}"

template = '''TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"
"""'''

params = '''PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|reserved_special_token"'''

system = f'''SYSTEM """{sys_message}"""'''

In [None]:
cmds.append(base_model)
cmds.append(template)
cmds.append(params)
cmds.append(system)

In [None]:
def generate_modelfile(cmds):
    content = ""
    for command in cmds:
        content += command + "\n"
    print(content)
    with open("Modelfile", "w") as file:
        file.write(content)

In [None]:
generate_modelfile(cmds)

There should now be a `Modelfile` saved in your working directory. Lets now install Ollama

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

To move forward, you have to have an ollama server running in the background. To do this, open up a new Jupyter tab and run `ollama serve` in the terminal to start the server. This will print out a key. Save it for future use. It will look like `ssh-ed25519...`

In [None]:
# the create command create the model
!ollama create llama-brev -f Modelfile

## Experiment with the model

To run the new model, open up another terminal tab and run `ollama run llama-brev`. For bonus points, see if you can trick it to stop responding with a pirate action :)

## Push the model

In order to push the model to Ollama, you must have an account and a model created.

1. Sign-up at https://www.ollama.ai/signup
2. Create a new model at https://www.ollama.ai/new. Find mine at scooterman/llama-brev. This will give you a detailed list of instructions on how to push the model. Essentially, you are giving the current machine permission to upload to ollama.
3. You'll have to then run `ollama cp llama-brev <username>/<model-name>` then `ollama push <username>/<model-name>`