# **About the Author**

This Jupyter notebook was created by **Mohamed Ashour**, a specialist in construction data analytics and AI implementation within the construction industry. Mohamed's work focuses on leveraging advanced technologies to transform traditional construction practices. This notebook is part of his [Unsloth LLM Finetuning project](https://github.com/MoAshour93/Unsloth_LLM_Finetuning/tree/main/GRPO_Finetuning), which demonstrates practical applications of large language model optimization techniques.

For more of Mohamed's work, visit his [GitHub profile](https://github.com/MoAshour93), where you'll find various repositories showcasing both small-scale projects and comprehensive solutions. You can also explore his professional website [APC Mastery Path](http://www.apcmasterypath.co.uk) or connect with him on [LinkedIn](https://www.linkedin.com/in/mohamed-ashour-0727/). For inquiries, reach out via email at mo_ashour1@outlook.com.

*Note: This notebook utilizes the Unsloth library. For licensing information, please refer to the [main Unsloth GitHub repository](https://github.com/unslothai/unsloth).*

## **Notebook Overview**

This notebook is crafted to make the best use of the latest ***Unsloth*** integration of ***Group Relative Policy Optimization (GRPO)***, the core ***Reinforcement Learning Algorithm*** that drives DeepSeek's exceptional reasoning capabilities.

The aim is to use smaller version of Large Language models and finetune them on custom datasets while adding reasoning capabilities for a wider use.The fine-tuned model can then be used on various pieces of software such as Ollama & Open WebUI.

You can check the main [**Unsloth**](https://github.com/unslothai/unsloth) Github Repository for further details. 

<u>*The main steps and sub-steps for this approach, which will discussed in a greater level of details in this notebook, are as follows:*</u>
1. Step 1: Setting-up the Coding Environment and Installing Requirements
    1. Part 1: Installing an Integrated Development Environment
    2. Part 2: Installing a Programming Language
    3. Part 3: Installing Nvidia Tooklit
    4. Part 4: Nvidia Toolkit and PyTorch Compatibility
    5. Part 5: Solving any potential dependencies Conflict
    6. Part 6: Installing Ollama
    7. Part 7: Installing Open WebUI
2. Step 2: Importing Necessary Libraries and the LLM of choice
3. Step 3: Data Preparation and Reward Functions for RLHF Fine-tuning
4. Step 4: Training the Model with RLHF using GRPOTrainer
5. Step 5 : Testing and Saving the Fine-tuned Model
    1. Part 1: Testing the freshly trained model
    2. Part 2: Saving the lightweight LoRA adapters
    3. Part 3: Verifying adapter loading works correctly
    4. Part 4: Merging adapters with the base model for simplified deployment, and 
    5. Part 5: Converting to optimized formats for efficient inference on various hardware.
6. Optional Step 6: Merging the LoRA Adapters with the original Model & Converting to GGUF
    1. Part 1: Merging LoRA adapters with the main model and saving them in Floating Point 16 format
    2. Part 2: Converting the model to GGUF format
7. Step 6: Deployment on Ollama & Open WebUI
    1. Part 1: Creating a Model File
    2. Part 2: Deployment of Created Model on Ollama & Open WebUI


## **Step1: Setting-up the Coding Environment and Installing reqirements**

### **Part 1: Installing an Integrated Development Environment**

You need to have an Integrated Development Environment installed in order to run the code. You can use any IDE of your choice, but I recommend using Visual Studio Code.

You can download [Visual Studio Code](https://code.visualstudio.com/). It is available for Windows, macOS, and Linux.

### **Part 2: Installing a Programming Language**

#### **Installing python 3.11:**
##### <u>*Add deadsnakes PPA for newer Python versions*</u>
        sudo add-apt-repository ppa:deadsnakes/ppa
        sudo apt update

##### <u>*Install Python 3.11*</u>
        sudo apt install python3.11 python3.11-venv python3.11-dev

*P.S:This code relies on having the development headers package for it to work. You have to install python3.11 dev*

##### <u>*Make the installed version the current python version on your system:*</u>

        sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
        sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1

A good practice is to install everything in a virtual environment.

1. You will need first to install *virtualenv*:
        
        pip install virtualenv

2. Then you can create your virtual environment of choice and then activate it:
        
        virtualenv unsloth_env_mar25
        source unsloth_env_mar25

3. You can then install the required dependencies from Unsloth as shown below:

        pip install unsloth vllm # vllm is only available on Linux & Unsloth is build on Cuda Libraries
        pip --upgrade pillow

4. Make sure to install cmake to enable converting the trained model into GGUF at the end of this code

        sudo apt install cmake

### **Part 3: Installation of Nvidia Tooklit:**

[Nvidia Toolkit](https://developer.nvidia.com/cuda-12-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local) is one of the pillar of this notebook and the finetuning relies on CUDA technology from Nvidia. 

The version used in this code is 12.4. The code for downloading it is as follows:

        wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
        sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
        wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
        sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
        sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
        sudo apt-get update
        sudo apt-get -y install cuda-toolkit-12-4

### **Part 4: Nvidia CUDA ToolKit and PyTorch Compatibility**

You have to make sure that the downloaded version of Pytorch by Unsloth is matching the nvidia toolkit that you are downloading from above.

You can check using the 2 code blocks below.

If the versions are different then you can follow these steps:

1. First, let's check where both CUDA versions are installed:

        ls -la /usr/local/cuda*

2. Check your current PATH to see which CUDA bin directory comes first:

        echo $PATH

3. Find all nvcc installations on your system:

        which -a nvcc

4. Let's modify your shell configuration more thoroughly. Open your ~/.bashrc or ~/.zshrc file:

       nano ~/.bashrc  # or ~/.zshrc

5.  Add these lines at the end of the file, making sure the paths match your actual CUDA 12.4 installation:

        # Remove any existing CUDA paths from PATH and LD_LIBRARY_PATH
        export PATH=$(echo $PATH | tr ':' '\n' | grep -v "cuda" | tr '\n' ':' | sed 's/:$//')
        export LD_LIBRARY_PATH=$(echo $LD_LIBRARY_PATH | tr ':' '\n' | grep -v "cuda" | tr '\n' ':' | sed 's/:$//')

        # Set CUDA 12.4 as default
        export CUDA_HOME=/usr/local/cuda-12.4
        export PATH=$CUDA_HOME/bin:$PATH
        export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

6. Save and close the file (Ctrl+O, Enter, Ctrl+X in nano).

7. Verify the version again:

        nvcc --version

8. If nvcc still shows 12.6, there might be a system-wide configuration or a symbolic link issue. Let's check:

        ls -la /usr/bin/nvcc
        ls -la /usr/local/cuda

9. If /usr/local/cuda is a symbolic link to /usr/local/cuda-12.6, you can update it to point to 12.4 (requires sudo):

        sudo rm /usr/local/cuda
        sudo ln -s /usr/local/cuda-12.4 /usr/local/cuda

### **Part 5: Sorting out other dependencies problems**

#### **PyTorch Triton failure to compile problem.**

Here is the solution below:

1. First, install the required development packages:

        sudo apt-get update
        sudo apt-get install build-essential gcc-multilib g++-multilib

2. Make sure libcuda.so is properly installed and accessible:

        sudo apt-get install nvidia-cuda-dev

3. You can also enable more detailed logging as suggested in the error message:

        export TORCH_LOGS="+dynamo"
        export TORCHDYNAMO_VERBOSE=1

### **Part 6: Installing Ollama**

Ollama is an open-source platform that simplifies running, managing, and creating large language models (LLMs) locally on your machine. 

It provides an easy way to download and run models like Llama, Mistral, and other open-source LLMs with minimal setup. 

To install Ollama on Linux, open a terminal and run the following command:

        curl -fsSL https://ollama.com/install.sh | sh

This single command downloads and runs the installation script, which sets up Ollama as a service on your system. 

After installation, you can start using Ollama immediately with commands like ***ollama run llama3*** to download and run models. 

Ollama handles all the model downloading, caching, and configuration automatically, making it one of the simplest ways to experiment with LLMs locally without dealing with complex dependencies or environment setup.

### **Part 7: Installing Open WebUI**

Open WebUI is a user-friendly web interface designed to work with Ollama and other LLM backends, providing a ChatGPT-like experience for interacting with locally hosted models. 

It offers features like chat history, model switching, and parameter adjustments through an intuitive interface. To install Open WebUI on Linux, you can use Python's pip package manager with this simple command:

        pip install open-webui

After installation, you can start the web interface by running:

        open-webui serve

Once launched, Open WebUI will be accessible through your web browser at http://localhost:8080. 

This interface connects to your local Ollama instance automatically (if running), allowing you to chat with your downloaded models through a clean, modern interface without needing to use command-line tools. 

Open WebUI enhances the experience of working with local LLMs by providing conversation management, sharing capabilities, and easy configuration options in a familiar chat interface.

In [1]:
#This is a simple script to check if CUDA is available and to get the name of your GPU.

import torch
print(torch.version.cuda)  # Should match your CUDA version
print(torch.cuda.is_available())  # Should return True
print(torch.cuda.get_device_name(0))  # Should show your GPU name

12.4
True
NVIDIA GeForce RTX 3090


In [2]:
#The purpose of this script is to check if CUDA is installed and to check the version of CUDA installed.

!which nvcc
!nvcc --version

/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0


## **Step 2: Importing necessary libraries & the LLM**

The following 2 blocks of code showcase the import of the main libraries that we are going to use from *Unsloth* including the *PatchFastRL*.

Use `PatchFastRL` before all functions to patch GRPO and other RL algorithms!

The model that we are going to use for this project is the meta *llama 3.1 8b instruct* from meta's main huggingface repository.

Upon the download of meta llama 3.1 8b instruct on your computer, you will be able to find it in the following path: 
    */home/{your_username}/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/{some_random_characters_of_numbers_&_letter}/"*

In [3]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [4]:
# Import necessary libraries
from unsloth import is_bfloat16_supported  # Check if bfloat16 precision is supported by hardware
import torch  # PyTorch deep learning framework

# Configuration parameters
max_seq_length = 2048  # Maximum sequence length for model input
                       # Can be increased to handle longer text/reasoning chains
                       # Higher values require more GPU memory
lora_rank = 64  # LoRA adaptation rank parameter
                # Controls the expressiveness of fine-tuning
                # Higher rank = more capacity to learn but slower training and inference

# Load the pre-trained model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",  # Base model to load (Llama 3.1 8B)
    max_seq_length = max_seq_length,  # Apply the configured sequence length
    load_in_4bit = True,  # Use 4-bit quantization to reduce memory usage
                          # Set to False if using LoRA with 16-bit precision
    fast_inference = True,  # Enable vLLM for optimized inference speed
    max_lora_rank = lora_rank,  # Set maximum LoRA rank to match our configuration
    gpu_memory_utilization = 0.6,  # Use 60% of available GPU memory
                                  # Can be reduced if experiencing out-of-memory errors
)

# Apply Parameter-Efficient Fine-Tuning (PEFT) with LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,  # LoRA rank parameter (same as defined above)
                    # Higher values allow more adaptation but increase memory usage
                    # Common values: 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention mechanism modules
        "gate_proj", "up_proj", "down_proj",     # MLP/FFN modules
    ],  # These are the layers that will be fine-tuned with LoRA
        # Can remove QKVO (attention) layers if out of memory
    lora_alpha = lora_rank,  # Scaling factor for LoRA updates
                             # Setting it equal to rank is common practice
    use_gradient_checkpointing = "unsloth",  # Memory optimization technique
                                            # Enables training with longer sequences
                                            # Uses Unsloth's implementation
    random_state = 3407,  # Random seed for reproducibility of results
)

INFO 03-04 23:49:28 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.3: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.586 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit with actual GPU utilization = 56.7%
Unsloth: Your GPU has CUDA compute capability 8.6 with VRAM = 23.59 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 7.04 GB. Also swap space = 6 GB.
INFO 03-04 23:49:33 config.py:549] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward'

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-04 23:49:37 model_runner.py:1115] Loading model weights took 5.5898 GB
INFO 03-04 23:49:37 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-04 23:49:38 worker.py:267] Memory profiling takes 1.17 seconds
INFO 03-04 23:49:38 worker.py:267] the current vLLM instance can use total_gpu_memory (23.59GiB) x gpu_memory_utilization (0.57) = 13.37GiB
INFO 03-04 23:49:38 worker.py:267] model weights take 5.59GiB; non_torch_memory takes 0.04GiB; PyTorch activation peak memory takes 0.90GiB; the rest of the memory reserved for KV Cache is 6.85GiB.
INFO 03-04 23:49:39 executor_base.py:111] # cuda blocks: 3505, # CPU blocks: 3072
INFO 03-04 23:49:39 executor_base.py:116] Maximum concurrency for 2048 tokens per request: 27.38x
INFO 03-04 23:49:40 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error oc

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:12<00:00,  2.09it/s]

INFO 03-04 23:49:53 model_runner.py:1562] Graph capturing finished in 13 secs, took 0.65 GiB
INFO 03-04 23:49:53 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 16.24 seconds



Unsloth 2025.3.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## **Step 3: Data Preparation and Reward Functions for RLHF Fine-tuning**
This section handles the dataset preparation and reward function setup for Reinforcement Learning from Human Feedback (RLHF) fine-tuning of our language model.

#### **Dataset and Response Format Configuration**
The code block below sets up the data pipeline and defines how model responses should be structured. It sets out the system prompt as well as the anticipated xml format for the model's responses. This plays a vital role in the model's ability to understand and respond to the task in a structured manner as well as rewarding or penalising the model.

#### **RLHF Reward Functions**
A number of functions will be created to evaluate model outputs and assign numerical rewards to guide the learning process

#### **Key Implementation Details**

* Structured Responses: Uses XML-like tags to separate reasoning from answers, enabling evaluation of both process and results.
* Custom Dataset Loading: Transforms a CSV file into a properly formatted dataset for chat-based fine-tuning.
* Multiple Reward Functions: Implements various evaluation metrics:
    * Correctness: Rewards exact answer matching (2.0 points)
    * Format Adherence: Rewards proper XML structure (0.5 points)
    * Partial Credit: Gives fractional rewards for partially correct formatting
    * Special Case Handling: Includes separate handlers for different answer formats

This multi-faceted reward approach guides the model to develop both accurate responses and consistent formatting, critical for applications requiring structured outputs with explicit reasoning.

In [5]:
import re  # Regular expressions library for pattern matching
import pandas as pd  # Data manipulation library
from datasets import Dataset  # HuggingFace Datasets library for ML dataset handling

# System prompt and format templates
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""  # Defines the expected response structure for the model

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""  # Template string for formatting chain-of-thought responses with placeholders

# Answer extraction functions
def extract_xml_answer(text: str) -> str:
    """
    Extracts the content between <answer> tags from model output.
    Args:
        text: The full response text from the model
    Returns:
        The extracted answer text without the XML tags
    """
    answer = text.split("<answer>")[-1]  # Get everything after the last <answer> tag
    answer = answer.split("</answer>")[0]  # Get everything before the first </answer> tag
    return answer.strip()  # Remove leading/trailing whitespace

def extract_hash_answer(text: str) -> str | None:
    """
    Extracts the answer following '####' from model output.
    Used for models trained with different output formats.
    Args:
        text: The full response text from the model
    Returns:
        The extracted answer or None if format not found
    """
    if "####" not in text:
        return None
    return text.split("####")[1].strip()  # Get everything after the #### delimiter

# Load your custom dataset from CSV
def get_rics_apc_questions(csv_file_path='RICS APC Consolidated Submissions dataset.csv', split="train") -> Dataset:
    """
    Loads and formats a dataset from CSV file for fine-tuning.
    Args:
        csv_file_path: Path to the CSV file containing questions and answers
        split: Dataset split name (not used in this implementation)
    Returns:
        HuggingFace Dataset object formatted for chat fine-tuning
    """
    # Load the CSV file into a DataFrame
    df = pd.read_csv(csv_file_path)
    
    # Create dataset in the format needed for fine-tuning
    data_list = []
    for _, row in df.iterrows():
        data_list.append({
            'prompt': [  # Format as chat format with system and user messages
                {'role': 'system', 'content': SYSTEM_PROMPT},
                {'role': 'user', 'content': row['Question']}
            ],
            'answer': row['Answer']  # Using the full answer from your dataset
        })
    
    # Convert to a Hugging Face Dataset object
    dataset = Dataset.from_list(data_list)
    return dataset

# Load your dataset
dataset = get_rics_apc_questions()  # Initialize the dataset using the default file path

# Reward functions for RLHF (Reinforcement Learning from Human Feedback)
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """
    Reward function that checks if extracted answers match expected answers.
    Args:
        prompts: List of model prompts
        completions: List of model completions
        answer: List of expected answers
    Returns:
        List of reward scores (2.0 for correct, 0.0 for incorrect)
    """
    responses = [completion[0]['content'] for completion in completions]  # Extract content from completions
    q = prompts[0][-1]['content']  # Get the question from the last message in the prompt
    extracted_responses = [extract_xml_answer(r) for r in responses]  # Extract answers from XML format
    # Print debugging information
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]  # Award 2.0 points for exact matches

def int_reward_func(completions, **kwargs) -> list[float]:
    """
    Reward function that checks if extracted answers are integers.
    Args:
        completions: List of model completions
    Returns:
        List of reward scores (0.5 for integer answers, 0.0 otherwise)
    """
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]  # Award 0.5 points for numeric answers

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """
    Reward function that checks if the completion follows exact XML format.
    Args:
        completions: List of model completions
    Returns:
        List of reward scores (0.5 for strict format match, 0.0 otherwise)
    """
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"  # Strict regex pattern
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r, re.DOTALL) for r in responses]  # re.DOTALL makes . match newlines
    return [0.5 if match else 0.0 for match in matches]  # Award 0.5 points for strict format match

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """
    Reward function with more flexible pattern matching for XML format.
    Args:
        completions: List of model completions
    Returns:
        List of reward scores (0.5 for soft format match, 0.0 otherwise)
    """
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"  # More lenient regex pattern
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.search(pattern, r, re.DOTALL) for r in responses]  # re.search finds anywhere in string
    return [0.5 if match else 0.0 for match in matches]  # Award 0.5 points for soft format match

def count_xml(text) -> float:
    """
    Helper function that awards partial points for XML tag correctness.
    Args:
        text: Response text to evaluate
    Returns:
        Score based on XML tag correctness (up to 0.5) minus penalties
    """
    count = 0.0
    if text.count("<reasoning>\n") == 1:  # Check for opening reasoning tag
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:  # Check for closing reasoning tag
        count += 0.125
    if text.count("\n<answer>\n") == 1:  # Check for opening answer tag
        count += 0.125
    count -= len(text.split("\n</answer>\n")[-1])*0.001  # Penalize text after closing answer tag
    if text.count("\n</answer>") == 1:  # Check for closing answer tag
        count += 0.125
    count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001  # Additional penalty for text after tags
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    """
    Reward function that uses count_xml to score responses.
    Args:
        completions: List of model completions
    Returns:
        List of scores based on XML tag correctness
    """
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]  # Score each completion using count_xml

## **Step4: Training the Model with RLHF using GRPOTrainer**
This section sets up and executes the Reinforcement Learning from Human Feedback (RLHF) training process using the Generalized Reward-Parametrized Optimization (GRPO) approach.

#### **About GRPO and RLHF Training**
The Generalized Reward-Parametrized Optimization (GRPO) approach is an advanced RLHF technique that trains the model to maximize multiple reward signals simultaneously. This implementation:
1. <u>Combines Multiple Objectives</u>: The model learns to balance format adherence, reasoning structure, and answer correctness through the prioritized reward functions.
2. <u>Uses TorchDynamo Compilation</u>: The verbose logging helps track the compilation and optimization of training operations for debugging and performance tuning.
3. <u>Applies Efficient Learning</u>: By using the previously configured LoRA adapters, the training process updates only a small subset of parameters, making the fine-tuning process much more efficient.
4. <u>Progressive Reward System</u>: The reward functions are arranged in order of increasing importance - from basic formatting to exact answer matching - guiding the model to develop both proper structure and accurate content.

This RLHF approach trains the model to not only provide correct answers but to follow a specific reasoning-then-answer format that enhances explainability and trustworthiness of the model's outputs.

#### **Key Configuration Details**
This configuration balances performance and efficiency for RLHF training:

1. **Precision Optimization:**
* Automatically selects bfloat16 or fp16 based on hardware support
* Uses 8-bit optimizer to reduce memory footprint

2. **Performance Settings:**
* vLLM backend enables faster inference during training
* Small batch size with option to increase gradient accumulation steps
* Generates 6 candidate responses per prompt for reward comparison

3. **Training Stability:**
* Cosine learning rate schedule with 10% warmup period
* Gradient clipping at 0.1 to prevent divergence
* Weight decay for regularization

4. **Resource Management:**
* Configurable token length limits for inputs and outputs
* Option to reduce generations if facing memory constraints
* Fixed number of training steps rather than epochs for predictable runtime

This configuration is designed to make RLHF training feasible on consumer-grade hardware while maintaining training effectiveness for structured response generation.


In [6]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True,                     # Use vLLM backend for accelerated inference
    learning_rate = 5e-6,                # Small learning rate appropriate for fine-tuning
    adam_beta1 = 0.9,                    # Adam optimizer momentum parameter
    adam_beta2 = 0.99,                   # Adam optimizer second moment parameter
    weight_decay = 0.1,                  # L2 regularization to prevent overfitting
    warmup_ratio = 0.1,                  # Gradual learning rate warmup over 10% of training
    lr_scheduler_type = "cosine",        # Cosine learning rate schedule for smooth decay
    optim = "paged_adamw_8bit",          # Memory-efficient 8-bit AdamW optimizer
    logging_steps = 1,                   # Log metrics after every step
    bf16 = is_bfloat16_supported(),      # Use bfloat16 precision if hardware supports it
    fp16 = not is_bfloat16_supported(),  # Fallback to fp16 if bfloat16 not available
    per_device_train_batch_size = 1,     # Small batch size due to memory constraints
    gradient_accumulation_steps = 1,     # Can increase to 4 for more stable gradients
    num_generations = 6,                 # Number of responses to generate per prompt
    max_prompt_length = 256,             # Maximum token length for input prompts
    max_completion_length = 200,         # Maximum token length for generated responses
    # num_train_epochs = 1,              # Commented out in favor of max_steps
    max_steps = 1000,                    # Train for 1000 optimization steps
    save_steps = 1000,                   # Save model checkpoint every 1000 steps
    max_grad_norm = 0.1,                 # Gradient clipping to prevent explosive gradients
    report_to = "none",                  # Disable external reporting (could use W&B)
    output_dir = "outputs",              # Directory to save model checkpoints
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


In [7]:
# Enable verbose logging for PyTorch and TorchDynamo
!export TORCH_LOGS="+dynamo"        # Enable PyTorch dynamo logging
!export TORCHDYNAMO_VERBOSE=1       # Set TorchDynamo verbosity level for detailed compilation info

# Initialize the GRPO Trainer with model, tokenizer, and reward functions
trainer = GRPOTrainer(
    model = model,                  # The LoRA-adapted language model prepared in earlier steps
    processing_class = tokenizer,   # Tokenizer for processing text inputs/outputs
    reward_funcs = [                # Multiple reward functions in priority order
        xmlcount_reward_func,       # Rewards proper XML tag structure and placement
        soft_format_reward_func,    # Rewards general XML formatting (less strict)
        strict_format_reward_func,  # Rewards exact XML formatting (more strict)
        int_reward_func,            # Rewards numeric answers when appropriate
        correctness_reward_func,    # Rewards exact answer matching (highest value reward)
    ],
    args = training_args,           # Training configuration (batch size, learning rate, etc.)
    train_dataset = dataset,        # The prepared Q&A dataset
)

# Start the RLHF training process
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 484 | Num Epochs = 3 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 167,772,160/4,796,452,864 (3.50% trained)


-------------------- Question:
Give me a good quality example of Competency: Planning and Development Level 1 
Answer:
Through my studies and work experience, I have gained a solid foundation in the principles and practices of planning and development. I understand the role of the UK planning system in regulating land use and development, and the key policy frameworks at national and local levels that guide decision-making. I am familiar with the main stages of the development process, from site identification and feasibility assessment to planning application, construction, and disposal. I appreciate the importance of effective stakeholder engagement and community consultation in securing planning consents and social value. I am aware of the various planning mechanisms and tools available, such as Section 106 agreements, Community Infrastructure Levy, and viability assessments, and their impact on development economics. I also recognize the increasing focus on sustainable and inclusiv

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,-0.0,-2.000834,0.179406,200.0,0.0,-2.000834,0.0,0.0,0.0,0.0
2,-0.0,-2.179667,0.084963,200.0,0.0,-2.179667,0.0,0.0,0.0,0.0
3,0.0,-2.089167,0.175954,195.166672,0.000631,-2.089167,0.0,0.0,0.0,0.0
4,0.0,-2.002333,0.272161,191.666672,0.0007,-2.002333,0.0,0.0,0.0,0.0
5,0.0,-2.150333,0.088287,200.0,0.000614,-2.150333,0.0,0.0,0.0,0.0
6,0.0,-2.097833,0.154832,200.0,0.000648,-2.097833,0.0,0.0,0.0,0.0
7,0.0,-1.915,0.147751,200.0,0.000639,-1.915,0.0,0.0,0.0,0.0
8,0.0,-2.2715,0.180705,200.0,0.000619,-2.2715,0.0,0.0,0.0,0.0
9,0.0,-1.971833,0.220851,200.0,0.000598,-1.971833,0.0,0.0,0.0,0.0
10,0.0,-1.887,0.185074,193.5,0.000566,-1.887,0.0,0.0,0.0,0.0


-------------------- Question:
Give me a medium quality example of Competency: Value management and engineering Level 2 
Answer:
I helped run value engineering workshops on an education project. I looked at different options to save money and improve value, considering both initial and running costs. For the façade, I compared different systems to find the best option. I kept records of all decisions made and helped find significant cost savings. 
Response:
**Developing a Comprehensive Value Profile of a New IT System**

In an IT organization, a new system called "SmartProject" is being considered for implementation to streamline project management processes and enhance collaboration among teams. However, the organization is experiencing budget constraints and wants to ensure that the new system will provide sufficient value to justify its costs.

**Competency in Practice: Value Management and Engineering**

To address this challenge, we need to apply the principles of value management

TrainOutput(global_step=1000, training_loss=0.099515612887574, metrics={'train_runtime': 5560.5218, 'train_samples_per_second': 1.079, 'train_steps_per_second': 0.18, 'total_flos': 0.0, 'train_loss': 0.099515612887574})

## **Step 5 : Testing and Saving the Fine-tuned Model**

After completing the RLHF training process, it's essential to test the model's performance, save the trained adapters, and prepare the model for real-world deployment. This section walks through the complete workflow from initial testing to creating deployment-ready model artifacts in various formats.

The process follows five critical steps: 
1. Testing the freshly trained model
2. Saving the lightweight LoRA adapters
3. Verifying adapter loading works correctly
4. Merging adapters with the base model for simplified deployment, and 
5. Converting to optimized formats for efficient inference on various hardware. 

Each step builds on the previous one to ensure our fine-tuned model can be effectively utilized in production environments.

#### **Why These Steps Matter**
This workflow demonstrates a complete cycle from training to deployment:

* Inference Testing: Verifies that the model has learned the desired behavior
* LoRA Persistence: Saves only the adapter weights (~1-3% of full model size)
* Dynamic Loading: Shows how to apply LoRA adapters at inference time
* Model Merging: Creates a single model file with adapters integrated
* GGUF Conversion: Prepares the model for efficient deployment in applications like llama.cpp

The final GGUF format enables running the fine-tuned model on consumer hardware with minimal resources while maintaining most of the model's capabilities.

#### **Step 5.1: Testing the Fine-tuned Model**

In [9]:
# Format a test prompt using the chat template
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Give me a good quality example of procurement and tendering level 2 in relation to the RICS APC requirements."},
], tokenize = False, add_generation_prompt = True)

# Configure generation parameters
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.25,              # Low temperature for more deterministic outputs
    top_p = 0.95,                    # Nucleus sampling parameter
    max_tokens = 1024,               # Maximum response length
)

# Generate a response using the fine-tuned model
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,             # Use the currently loaded LoRA adapters
)[0].outputs[0].text

# Display the output
output

Processed prompts: 100%|██████████| 1/1 [00:13<00:00, 13.64s/it, est. speed input: 4.32 toks/s, output: 73.81 toks/s]


"**Procurement and Tendering Example for RICS APC**\n\n**Scenario:**\n\nA Quantity Surveyor is working for a construction company that has been appointed by a client to manage the procurement and tendering process for a new office building project. The project involves the construction of a 5-story office building with a total value of £2 million.\n\n**Procurement Strategy:**\n\nThe Quantity Surveyor has been tasked with developing a procurement strategy that ensures the project is delivered on time, within budget, and to the required quality standards. The procurement strategy involves the following steps:\n\n1. **Market Research:** The Quantity Surveyor conducts market research to identify potential contractors and suppliers who have the necessary skills, experience, and resources to deliver the project.\n2. **Pre-qualification Questionnaire (PQQ):** The Quantity Surveyor develops a PQQ to assess the suitability of potential contractors and suppliers. The PQQ includes questions on:\n

 #### **Step 5.2: Saving LoRA Adapters**

In [10]:
# Save the trained LoRA adapters to disk
model.save_lora("RICS_APC_Unsloth_grpo__llama31_8b_saved_lora")

#### **Step 5.3: Loading LoRA Adapters and Testing**

In [13]:
# Format a test prompt with system prompt included
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Give me a good quality example of procurement and tendering level 2 in relation to the RICS APC requirements."},
], tokenize = False, add_generation_prompt = True)

# Configure generation parameters
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.2,               # Even lower temperature for more focused output
    top_p = 0.95,                    # Nucleus sampling parameter
    max_tokens = 1024,               # Maximum response length
)

# Generate a response by dynamically loading the saved LoRA adapters
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("RICS_APC_Unsloth_grpo__llama31_8b_saved_lora"),
)[0].outputs[0].text

# Display the output
output

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.87it/s, est. speed input: 151.39 toks/s, output: 37.38 toks/s]


'<reasoning>\n0\n</reasoning>\n<answer>\n1\n</answer>\n1'

#### **Step 5.4: Merging LoRA with Base Model (16-bit)**

In [16]:
# Merge the LoRA adapters with the base model and save in 16-bit precision
model.save_pretrained_merged(
    "llama3_1_8b__Unsloth_GRPO_model", 
    tokenizer, 
    save_method = "merged_16bit",
)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 24.63 out of 62.65 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 19%|█▉        | 6/32 [00:00<00:01, 15.13it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:13<00:00,  2.43it/s]


Unsloth: Saving tokenizer... Done.
Done.


#### **Step 5.5: Converting to GGUF Format (8-bit)**

In [18]:
# Save the model in GGUF format with 8-bit quantization (Q8_0)
model.save_pretrained_gguf(
    "llama3_1_8b__Unsloth_GRPO_model", 
    tokenizer,
)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 17.95 out of 62.65 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:06<00:00,  4.66it/s]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...


RuntimeError: Unsloth: The file ('llama.cpp/llama-quantize' or 'llama.cpp/llama-quantize.exe' if you are on Windows WSL) or 'llama.cpp/quantize' does not exist.
But we expect this file to exist! Maybe the llama.cpp developers changed the name or check extension of the llama-quantize file.

## Optional Steps (If Steps 5.4 & 5.5 donot work as expected)
## **Step 6: Merging the LoRA Adapters with the original Model & Converting to GGUF**

For some reason there are some conflicts between the unsloth dependencies and the llama.cpp due to some changes in the documentation in llama.cpp in March 2025. Therefore, the steps below could be redundant in future releases of llama.cpp and unsloth.

***The two main steps are:***

1- Merging LoRA adapters with the main model and saving them in Floating Point 16 format.
2- Converting the model to GGUF format.

### **Step 6.1: Merging the LoRA Adapters with the original Model**

The code in this step performs a critical step in the model deployment workflow: merging LoRA adapter weights with the base model. 

The process starts by loading both the base Llama 3.1 model and the fine-tuned LoRA adapters separately, then uses PEFT's merge functionality to combine them into a single, standalone model. 

This merged model has all the fine-tuning improvements "baked in" and can be used without the LoRA-specific code, making deployment simpler and more efficient.

In [10]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer  # Core HuggingFace libraries for loading models
from peft import PeftModel, PeftConfig  # Parameter-Efficient Fine-Tuning (PEFT) tools

# Load the base model
base_model_path = "/home/mohamedashour/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/"
# ^ Path to the cached Llama 3.1 8B Instruct model files on the local machine

base_model = AutoModelForCausalLM.from_pretrained(
   base_model_path,                # Path to the model directory
   device_map="auto",              # Automatically distribute model across available GPUs/devices
   torch_dtype="auto"              # Automatically select the appropriate precision (bfloat16/float16/float32)
)

tokenizer = AutoTokenizer.from_pretrained(base_model_path)  # Load the corresponding tokenizer

# Load the LoRA model (adapters)
lora_model_path = "/home/mohamedashour/Documents/Projects/RICS_APC_LLM_Finetuning/RICS_APC_Unsloth_grpo__llama31_8b_saved_lora"
# ^ Path to the saved LoRA adapters from the RLHF training process

model = PeftModel.from_pretrained(
   base_model,                     # Apply LoRA adapters to the loaded base model
   lora_model_path                 # Path to the LoRA adapter weights
)
# This creates a model with the LoRA adapters attached but not yet merged

# Merge weights - combines LoRA adapter weights with the base model weights
# This eliminates the need for adapter-specific code during inference
# After merging, the model behaves like a standard model with the adaptations baked in
model = model.merge_and_unload()    # Returns a new model with merged weights and frees adapter memory

# Save the merged model to a new directory
output_dir = "/home/mohamedashour/Documents/Projects/RICS_APC_LLM_Finetuning/RICS_APC_Unsloth_GRPO_llama3_1_8b_merged"
model.save_pretrained(output_dir)   # Save the model weights and configuration
tokenizer.save_pretrained(output_dir)  # Save the tokenizer configuration and vocabulary

# Confirm successful saving
print(f"Merged model saved to {output_dir}")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Merged model saved to /home/mohamedashour/Documents/Projects/RICS_APC_LLM_Finetuning/RICS_APC_Unsloth_GRPO_llama3_1_8b_merged


#### **Step 6.2: Converting the Merged Model to GGUF using Llama.cpp**

This command in this step uses llama.cpp's conversion script to transform the merged Hugging Face model into GGUF (GPT-Generated Unified Format) format, which is optimized for efficient inference on CPUs and consumer GPUs. 

The q8_0 quantization reduces the model size by approximately 4x compared to full precision (float32), making it practical to run on consumer hardware while maintaining good performance.

The GGUF format is widely supported by applications like llama.cpp, text-generation-webui, and Ollama, making it an excellent choice for deploying your fine-tuned model across different platforms. 

This conversion is the final step in preparing your RICS APC specialized model for practical use in applications where resource efficiency is important.

In [7]:
!python /home/mohamedashour/Documents/Projects/RICS_APC_LLM_Finetuning/llama.cpp/convert_hf_to_gguf.py --help

usage: convert_hf_to_gguf.py [-h] [--vocab-only] [--outfile OUTFILE]
                             [--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}]
                             [--bigendian] [--use-temp-file] [--no-lazy]
                             [--model-name MODEL_NAME] [--verbose]
                             [--split-max-tensors SPLIT_MAX_TENSORS]
                             [--split-max-size SPLIT_MAX_SIZE] [--dry-run]
                             [--no-tensor-first-split] [--metadata METADATA]
                             [--print-supported-models]
                             [model]

Convert a huggingface model to a GGML compatible file

positional arguments:
  model                 directory containing model file

options:
  -h, --help            show this help message and exit
  --vocab-only          extract only the vocab
  --outfile OUTFILE     path to write to; default: based on input. {ftype}
                        will be replaced by the outtype.
  --outtype {f32,f16

In [11]:
# Convert merged Hugging Face model to GGUF format for deployment
!python /home/mohamedashour/Documents/Projects/RICS_APC_LLM_Finetuning/llama.cpp/convert_hf_to_gguf.py \
   /home/mohamedashour/Documents/Projects/RICS_APC_LLM_Finetuning/RICS_APC_Unsloth_GRPO_llama3_1_8b_merged/ \
   # ^ Path to the source model directory (merged model we created in previous step)
   
   --outfile /home/mohamedashour/Documents/Projects/RICS_APC_LLM_Finetuning/RICS_APC_Unsloth_GRPO_llama3_1_8b_merged/RICS_APC_Unsloth_GRPO_llama3_1_8b_merged.gguf \
   # ^ Output file path and name for the converted GGUF model
   
   --outtype q8_0 \
   # ^ Quantization type: 8-bit quantization with 0 exponents
   # q8_0 provides a good balance between model size and quality
   # Other options include q4_0 (smaller but lower quality) and f16 (larger but higher quality)
   
   --no-lazy \
   # ^ Disable lazy loading, which forces all weights to be loaded and converted at once
   # This can be more memory-intensive but helps catch potential issues early
   
   --verbose
   # ^ Enable detailed logging during conversion to monitor progress and debug issues

INFO:hf-to-gguf:Loading model: RICS_APC_Unsloth_GRPO_llama3_1_8b_merged
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> Q8_0, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> Q8_0, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> Q8_0, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.weight,         to

## **Step 6: Deployment on Ollama & Open WebUI**

In this step we are trying to make the best use the fine-tuned model after being converted to gguf format for further usage in chat completion purposes.

***Ollama*** is an open-source tool mainly written in Go lang (89%) that runs open LLMs on your local machine (or a server). It acts like a bridge between any open LLM and your machine, not only running them but also providing an API layer on top of them so that another application or service can use them.

***Open WebUI*** is a Web-based interface that allows you to interact with AI models, such as large language models (LLMs). It simplifies working with AI by providing a graphical user interface (GUI) instead of relying on command-line tools.

#### **Part 1: Creating a Model File to use in Ollama**
In order to enable using the fine-tuned gguf model on ***Ollama*** we have first to create a ***Model File***.

For the purpose of the downloaded ***LLama3.1*** model from ***Meta***, there are amazing notes that could be taken from https://ollama.com/library/llama3.1:8b

I created the model file without any extension. I started from a ***txt*** file and then deleted the ***txt*** extension.

<u>*Here is the model file that I created (You can also check the Github Repository for the file download. The file is named Model_File)*</u>

        from /home/mohamedashour/Documents/Projects/RICS_APC_LLM_Finetuning/RICS_APC_Unsloth_GRPO_llama3_1_8b_merged.gguf

        parameter temperature 0.2
        parameter num_ctx 4096

        parameter stop <|start_header_id|>
        parameter stop <|end_header_id|>
        parameter stop <|eot_id|>

        template """ 
        {{- if or .System .Tools }}<|start_header_id|>system<|end_header_id|>
        {{- if .System }}

        {{ .System }}
        {{- end }}
        {{- if .Tools }}

        Cutting Knowledge Date: December 2023

        When you receive a tool call response, use the output to format an answer to the original user question.

        You are a helpful assistant with tool calling capabilities.
        {{- end }}<|eot_id|>
        {{- end }}
        {{- range $i, $_ := .Messages }}
        {{- $last := eq (len (slice $.Messages $i)) 1 }}
        {{- if eq .Role "user" }}<|start_header_id|>user<|end_header_id|>
        {{- if and $.Tools $last }}

        Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

        Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

        {{ range $.Tools }}
        {{- . }}
        {{ end }}
        Question: {{ .Content }}<|eot_id|>
        {{- else }}

        {{ .Content }}<|eot_id|>
        {{- end }}{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

        {{ end }}
        {{- else if eq .Role "assistant" }}<|start_header_id|>assistant<|end_header_id|>
        {{- if .ToolCalls }}
        {{ range .ToolCalls }}
        {"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}}{{ end }}
        {{- else }}

        {{ .Content }}
        {{- end }}{{ if not $last }}<|eot_id|>{{ end }}
        {{- else if eq .Role "tool" }}<|start_header_id|>ipython<|end_header_id|>

        {{ .Content }}<|eot_id|>{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

        {{ end }}
        {{- end }}
        {{- end }}
        """

        system """ You are an AI language model specialized in providing detailed, accurate, and professional responses to questions related to the RICS Assessment of Professional Competence (APC). Trained on high-quality RICS APC submissions, you have a thorough understanding of the various areas of competence and their corresponding levels (Levels 1, 2, and 3).

        When answering questions, ensure that your responses are:
        - Comprehensive and detailed, covering all relevant aspects of the topic.
        - Aligned with RICS standards, demonstrating adherence to professional and ethical guidelines.
        - Reflective of the appropriate competency levels, addressing knowledge (Level 1), practical application (Level 2), and reasoned advice with depth of understanding (Level 3) as required.
        - Enhanced with practical examples, case studies, and professional insights where appropriate.
        - Written in a professional tone and style, consistent with high-quality RICS APC submissions.

        Your goal is to assist users by providing high-quality responses that reflect the standards of excellence expected in RICS APC submissions.
        """

#### **Part 2: Deployment of Created Model on Ollama & Open WebUI**
Once the model file is created you can then deploy the model to Ollama through opening a terminal window and then use the following line of code

        ollama create {Your_Desired_Model_Name} -f {Name_of_Model_File}
e.g.

        ollama create RICS_APC_Unsloth_GRPO_Llama3_1_8b -f Model_File

The model will then be uploaded to ***Ollama*** and you can run the model within terminal and chat with it using *ollama run {Your_Model_Name}*.

For the model deployment on ***Open WebUI***, you can open a terminal window and write:
        
        open-webui serve 
        
You can access ***Open WebUI*** by going to your browswer and then use the following link : http://localhost:8080

Open WebUI is already integrated with ***Ollama***'s API and any model that you have downloaded via ***Ollama*** or integrated to ***Ollama*** will be available on ***Open WebUI***.

***Open WebUI*** has an interface that ressembles ***ChatGPT*** and you can start a chat and choose your preferred model.