# Finetuning on Mac - Updated February 2025

### Disclaimer:
These steps are created after having a very thorough read of the following:
* The MLX community - https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md
* Andy Peating articles:
    * Part 1 - Setting up your environment- https://apeatling.com/articles/part-1-setting-up-your-environment/
    * Part 2 - Building your training data for fine-tuning  https://apeatling.com/articles/part-2-building-your-training-data-for-fine-tuning/
    * Part 3 - Fine-tuning your llm using the mlx framework https://apeatling.com/articles/part-3-fine-tuning-your-llm-using-the-mlx-framework/
    * Part 4- Testing and interacting with your fine-tuned LLM  https://apeatling.com/articles/part-4-testing-and-interacting-with-your-fine-tuned-llm/
* Llama 3 Model cards and prompting format available through : https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/
* Fine-tuning LLMs on Mac OS using MLX and run with Ollama - https://medium.com/rahasak/fine-tuning-llms-on-macos-using-mlx-and-run-with-ollama-182a20f1fd2c

Please note that different models have different model cards and prompting templates and you should visit the developer of every LLM to tweak the code to account for the relevant chat template.

# Part I - Setting up the coding environment 

1. Open terminal in your preferred location on your Mac. You can do that by opening finder and then if you have the path bar enabled, you can go to your preferred location, right click on the path bar and then choose open in terminal.

2. install home-brew from “https://brew.sh/“ where you have to paste the following code in terminal 

             /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

3. Make sure that your home-brew is on your PATH by typing on the following commands in your opened terminal window:
            
            echo >> /Users/{Your_Username}/.zprofile
            
            echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/{Your_Username}/.zprofile
            
            eval "$(/opt/homebrew/bin/brew shellenv)"

4. Install git using the code “brew install git” in your terminal window. You can have more details from the git website: https://git-scm.com/downloads/mac

5. Clone the mlx repository through the following code in your terminal window:
                 
            git clone https://github.com/ml-explore/mlx-examples.git

6. change the directory to go inside the lora folder which is located inside the mlx-examples folder through the following command in your terminal window:
     
            cd mlx-examples/lora

7. Make sure to have python installed on your machine. You can go to the website “https://www.python.org/downloads/macos/“ and download your preferred version. I would recommend a version that is 3.11 and later.

8. Make sure to have pip installed on your machine. If in doubt you can install it using the following command in your terminal

             Python -m ensurepip —upgrade

             OR

             Python3 -m ensurepip —upgrade

9. Following from step 5, having changed the directory to be inside the lora folder, we can now Install the requirements  by typing the following command in your terminal 

            Pip install -r requirements.txt
            
            OR
            
            pip3 install -r requirements.txt

10. The MLX LoRA fine-tuning is quite efficient for accounting for the required data, fine-tuning the model and fusing the original model with the trained adapters.
    
    However when it comes to converting the safetensors of the fine-tuned model to GGUF for further usage on Ollama or Open WebUI or other 3rd party apps, it is mandatory to use the llama.cpp

    The mlx_lm.convert doesnot provide multiple options for quantization, the default is 16bit. This is valid up to February 2025. This can change in future releases.

    You can clone the llama.cpp respository which can be through the following link. The code is as follows:

            git clone https://github.com/ggerganov/llama.cpp.git

    Then later on in this code, we would need to change the working directory to be inside the cloned llama.cpp on our machine to be able to use the convert_hf_to_gguf.py file for model conversion from safetensors to GGUF with the required quantization.

# Part II - Create/Import  and Edit data for fine-tuning:

10. Create your training data for fine-tuning:

    1. This is going to be done outside of the terminal environment. For the purpose of this example, I have a csv file containing the questions and answers of RICS APC submissions in the form of questions and answers. The file is composed of 3 columns “id”, “question” & “answer. 
    
    2. Every model has its chat template, for the current example, we are going to use the mistral  architecture. I used a chat template that was proposed by the MLX team on their github repository which works fine for the mistral architecture
        
                {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello."}, {"role": "assistant", "content": "How can I assistant you today."}]}

        In future fine-tuning I will use the official chat template of mistral which you can find below:

                <s>[INST] Human message [/INST] Assistant response </s>[INST] Human message [/INST] Assistant response
    
    3. For fine-tuning purposes using the MLX library, we have to convert the training data into 3 json files: “train.jsonl”, ”test.jsonl” & “valid.jsonl”
    
    4. I built a python code which takes the csv uploaded by the end user from any desired location and then it applies the llama 3 chat template on it and then saves the training, testing and validation files in jsonl format. 
    
        The training data is 80% of the original data size, testing is at 10% and validation is at 10%. All the resultant files are saved in a folder called data
        
        In case of gated models, you will have to install huggingface hub in terminal using          
            
                Pip install huggingface_hub               
            
                OR              
            
                Pip3 install huggingface_hub 
                
        This should be followed by logging into your account using your token by typing the following code in your terminal window
    
                Huggingface-cli login —token {Your_token} 

# Part III - Training your model, testing and validation:

### Step 3.1: Defining important Variables

In [17]:
#Defining the necessary directories for the models, the adapters and the data

# This code block is quite important as I am defining the different variables that I am go to use through my code.
# You can change them to suit your needs.
# The variables include the following:
# 1. Data directory - The location where the train.jsonl, the test.jsonl and the valid.jsonl files are saved
# 2. Downloaded Huggingface Model directory - In my case it is the Mistal Instruct v0.3 with 16 bits accuracy
# 3. Huggingface repository - In my case it is the Mistal Instruct v0.3 with 16 bits accuracy
# 4. My desired Huggingface repository name for saving the fine-tuned model
# 5. A write-token created from the Huggingface website to able to interact with the site to upload and download models.
# 6. A directory for the place where I want to save the converted model from the huggingface format to MLX format
# 7. The Desired name for the fine-tuned MLX model
# 8. The output directory for the fine-tuned MLX model
# 9. Llama.cpp directory where I cloned the github repository to be able to use it in terminal under this jupyter notebook.


#Enter your data below
data= "''"
downloaded_hf_model= "''"
hf_model = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" 
hf_token=''
hf_upload_repo="''"
mlx_path="''"
adapters="''"
output_directory="''"
Desired_model_name= "DSQwen_FT"
System_prompt= f"""
You are an AI language model specialized in providing detailed, accurate, and professional responses to questions related to cats.

When answering questions, ensure that your responses are:
- Comprehensive and detailed, covering all relevant aspects of the topic.
- Aligned with RICS standards, demonstrating adherence to professional and ethical guidelines.
- Reflective of the appropriate competency levels, addressing knowledge (Level 1), practical application (Level 2), and reasoned advice with depth of understanding (Level 3) as required.
- Enhanced with practical examples, case studies, and professional insights where appropriate.
- Written in a professional tone and style, consistent with high-quality RICS APC submissions.

Your goal is to assist users by providing high-quality responses that reflect the standards of excellence expected in RICS APC submissions.
""" ###Change as needed.
llama_cpp_path=" '/Users/dat/Desktop/llama.cpp' " 

!Huggingface-cli login --token {hf_token}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/dat/.cache/huggingface/token
Login successful


### Step 3.2: Downloading the desired model from the Huggingface

In [18]:
#This step is quite important to enable downloading models from the huggingface website to your local storage for further use.

!Huggingface-cli login --token {hf_token}

!huggingface-cli download --repo-type model --local-dir {downloaded_hf_model} {hf_model}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/dat/.cache/huggingface/token
Login successful
Fetching 9 files:   0%|                                   | 0/9 [00:00<?, ?it/s]Downloading 'generation_config.json' to '/Users/dat/Desktop/Crimes-against-humanity/model/.huggingface/download/generation_config.json.052ab54633116a634da950ab483233c4ace0aa82.incomplete'
Downloading 'tokenizer.json' to '/Users/dat/Desktop/Crimes-against-humanity/model/.huggingface/download/tokenizer.json.a34650995da6939a945c330eadb0687147ac3ef8.incomplete'
Downloading 'model.safetensors' to '/Users/dat/Desktop/Crimes-against-humanity/model/.huggingface/download/model.safetensors.58858233513d76b8703e72eed6ce16807b523328188e13329257fb9594462945.incomplete'
Downloading '

### Step 3.3: Huggingface Model Conversion to MLX format

In [19]:
#In this block of code, I am trying to understand the arguments that are provided by the MLX community regarding the conversion of the original huggingface model format to MLX format

!mlx_lm.convert --help

usage: mlx_lm.convert [-h] [--hf-path HF_PATH] [--mlx-path MLX_PATH] [-q]
                      [--q-group-size Q_GROUP_SIZE] [--q-bits Q_BITS]
                      [--quant-predicate QUANT_PREDICATE]
                      [--dtype {float16,bfloat16,float32}]
                      [--upload-repo UPLOAD_REPO] [-d]

Convert Hugging Face model to MLX format

options:
  -h, --help            show this help message and exit
  --hf-path HF_PATH     Path to the Hugging Face model.
  --mlx-path MLX_PATH   Path to save the MLX model.
  -q, --quantize        Generate a quantized model.
  --q-group-size Q_GROUP_SIZE
                        Group size for quantization.
  --q-bits Q_BITS       Bits per weight for quantization.
  --quant-predicate QUANT_PREDICATE
                        Mixed-bit quantization recipe. Choices: ['mixed_2_6',
                        'mixed_3_6']
  --dtype {float16,bfloat16,float32}
                        Type to save the non-quantized parameters.
  --upload-repo UPLO

In [20]:
#The code block below tackles the process of converting a model from the huggingface website into MLX format for further use.
#The variables used here in this code block are from Step 4.1 above.

!mlx_lm.convert \
    --hf-path {hf_model} \
    --mlx-path {mlx_path}

[INFO] Loading
Fetching 5 files:   0%|                                   | 0/5 [00:00<?, ?it/s]
model.safetensors:   0%|                            | 0.00/3.55G [00:00<?, ?B/s][A

generation_config.json: 100%|███████████████████| 181/181 [00:00<00:00, 748kB/s][A[A


tokenizer_config.json: 100%|███████████████| 3.07k/3.07k [00:00<00:00, 21.3MB/s][A[A

model.safetensors:   0%|                   | 10.5M/3.55G [00:00<01:41, 34.9MB/s][A

config.json: 100%|█████████████████████████████| 679/679 [00:00<00:00, 2.01MB/s][A[A
Fetching 5 files:  20%|█████▍                     | 1/5 [00:00<00:03,  1.15it/s]
model.safetensors:   1%|                   | 21.0M/3.55G [00:00<01:39, 35.6MB/s][A
model.safetensors:   1%|▏                  | 31.5M/3.55G [00:00<01:37, 36.0MB/s][A

tokenizer.json:   0%|                               | 0.00/7.03M [00:00<?, ?B/s][A[A
model.safetensors:   1%|▏                  | 41.9M/3.55G [00:01<01:38, 35.8MB/s][A
model.safetensors:   1%|▎                  | 52.4

### Step 3.4: Undertaking the fine-tuning of the chosen model using the Low Rank Adaptation under MLX_LM

In this step, I am trying to make the best use of the Low Rank Adaptation with mlx_lm.lora under the mlx_lm package to fine-tune the already converted MLX Mistral V0.3

In [21]:
#In this block of code, I am trying to understand the arguments that are provided by the MLX community regarding the LoRA fine-tuning to a desired model

! mlx_lm.lora --help

usage: mlx_lm.lora [-h] [--model MODEL] [--train] [--data DATA]
                   [--fine-tune-type {lora,dora,full}]
                   [--optimizer {adam,adamw}] [--mask-prompt]
                   [--num-layers NUM_LAYERS] [--batch-size BATCH_SIZE]
                   [--iters ITERS] [--val-batches VAL_BATCHES]
                   [--learning-rate LEARNING_RATE]
                   [--steps-per-report STEPS_PER_REPORT]
                   [--steps-per-eval STEPS_PER_EVAL]
                   [--resume-adapter-file RESUME_ADAPTER_FILE]
                   [--adapter-path ADAPTER_PATH] [--save-every SAVE_EVERY]
                   [--test] [--test-batches TEST_BATCHES]
                   [--max-seq-length MAX_SEQ_LENGTH] [-c CONFIG]
                   [--grad-checkpoint] [--seed SEED]

LoRA or QLoRA finetuning.

options:
  -h, --help            show this help message and exit
  --model MODEL         The path to the local model directory or Hugging Face
                        repo.
  --train

In [22]:
# Fine-tuning a Large Language Model using MLX framework optimized for Apple Silicon
# Detailed parameter explanations:

# --model ${mlx_path}
# Purpose: Specifies the base model path for fine-tuning
# Benefit: Allows building upon pre-trained knowledge, saving time and computational resources

# --train
# Purpose: Activates the training mode
# Benefit: Distinguishes between inference and training, enabling model updates

# --data ${data}
# Purpose: Points to the training dataset location which contains the 3 data files: train.jsonl, test.jsonl and valid.jsonl
# Benefit: Provides the model with task-specific examples to learn from

# --fine-tune-type lora
# Purpose: Implements LoRA (Low-Rank Adaptation) method
# Benefit: Reduces memory usage and training time while maintaining performance by updating only a small subset of parameters

# --num-layers 16
# Purpose: Defines the number of transformer layers to fine-tune
# Benefit: Controls the depth of model adaptation, balancing between performance and computational efficiency

# --batch-size 8
# Purpose: Sets number of examples processed simultaneously
# Benefit: Optimizes memory usage and training stability for Apple Silicon chips

# --iters 1000
# Purpose: Defines total training iterations
# Benefit: Ensures sufficient model adaptation while preventing overfitting

# --val-batches 50
# Purpose: Specifies validation batch count
# Benefit: Enables monitoring of model generalization and prevents overfitting

# --learning-rate 1e-5
# Purpose: Controls the size of parameter updates
# Benefit: Small value helps preserve base model knowledge while learning new tasks

# --steps-per-report 10
# Purpose: Sets frequency of progress updates
# Benefit: Allows monitoring training progress without excessive logging

# --steps-per-eval 200
# Purpose: Determines evaluation frequency
# Benefit: Regular performance checks without significant training slowdown

# --adapter-path ${adapters}
# Purpose: Specifies where to save LoRA weights
# Benefit: Enables reuse and sharing of fine-tuned adaptations

# --save-every 500
# Purpose: Sets checkpoint frequency
# Benefit: Prevents loss of progress in case of interruptions

# --max-seq-length 2048
# Purpose: Limits input sequence length
# Benefit: Balances between context window and memory usage

# --grad-checkpoint
# Purpose: Enables gradient checkpointing
# Benefit: Reduces memory usage by recomputing intermediate values during backpropagation

# --seed 42
# Purpose: Sets random number generator seed
# Benefit: Ensures reproducibility of training results

! mlx_lm.lora \
 --model ${mlx_path} \
 --train \
 --data ${data} \
 --fine-tune-type lora \
 --num-layers 16 \
 --batch-size 5 \
 --iters 500 \
 --val-batches 50 \
 --learning-rate 1e-5 \
 --steps-per-report 10 \
 --steps-per-eval 200 \
 --adapter-path ${adapters} \
 --save-every 500 \
 --max-seq-length 2048 \
 --grad-checkpoint \
 --seed 42

Loading pretrained model
Loading datasets
Training
Trainable parameters: 0.035% (0.623M/1777.088M)
Starting training..., iters: 500
Iter 1: Val loss 6.408, Val took 0.509s
Iter 10: Train loss 5.822, Learning Rate 1.000e-05, It/sec 1.399, Tokens/sec 319.176, Trained Tokens 2281, Peak mem 4.526 GB
Iter 20: Train loss 5.388, Learning Rate 1.000e-05, It/sec 0.740, Tokens/sec 161.739, Trained Tokens 4466, Peak mem 4.526 GB
Iter 30: Train loss 4.527, Learning Rate 1.000e-05, It/sec 1.499, Tokens/sec 345.140, Trained Tokens 6768, Peak mem 4.526 GB
Iter 40: Train loss 3.931, Learning Rate 1.000e-05, It/sec 0.108, Tokens/sec 24.219, Trained Tokens 9005, Peak mem 4.526 GB
Iter 50: Train loss 3.397, Learning Rate 1.000e-05, It/sec 0.226, Tokens/sec 50.746, Trained Tokens 11250, Peak mem 4.526 GB
Iter 60: Train loss 3.006, Learning Rate 1.000e-05, It/sec 1.415, Tokens/sec 324.716, Trained Tokens 13544, Peak mem 4.526 GB
Iter 70: Train loss 2.916, Learning Rate 1.000e-05, It/sec 0.487, Tokens/sec 1

### Step 3.5: Testing the adapaters

* In this step, I am doing the mathematical tests provided by the MLX community which include the **Loss** and the **Perplexity** tests.

### MLX Testing Metrics Analysis for Apple Silicon LLM Fine-tuning

#### 3. Test Loss (3.945)
* **Definition**: Cross-entropy loss measured on the test set
* **Typical Range**: 1.5 to 5.0
* **Interpretation**:
  * < 2.5: Excellent performance
  * 2.5-3.5: Good performance
  * 3.5-4.5: Moderate performance
  * ">" 4.5: Poor performance
* **Current Value Assessment**: 3.945 indicates moderate performance

#### 4. Test Perplexity (PPL) (51.684)
* **Definition**: Exponential of the loss (e^loss)
* **Purpose**: Measures model's uncertainty in predicting next tokens
* **Typical Range**: 10 to 100
* **Interpretation**:
  * < 20: Excellent performance
  * 20-40: Good performance
  * 40-60: Moderate performance
  * ">" 60: Poor performance
* **Current Value Assessment**: 51.684 indicates moderate uncertainty in predictions

### Relationship Between Metrics
* Loss and perplexity are exponentially related (PPL = e^loss)
* Both metrics indicate prediction accuracy
* Lower values indicate better performance

### Performance Assessment
The current results suggest moderate performance with potential for improvement through:
* Additional fine-tuning iterations
* Hyperparameter optimization
* Training data quality/quantity improvements
* Model architecture adjustments

### Note
These metrics provide quantitative measures of model performance and can guide optimization efforts during the fine-tuning process.

In [23]:
# Testing a Fine-tuned Large Language Model using MLX framework
# This command runs evaluation on a previously fine-tuned model using LoRA (Low-Rank Adaptation)

# Command breakdown with detailed explanations:

# --model ${mlx_path}
# Purpose: Specifies the path to either:
#   - A local model directory containing model files
#   - A Hugging Face repository name
# Benefit: Enables access to the base model for testing

# --data ${data}
# Purpose: Points to either:
#   - A directory containing test.jsonl file
#   - A Hugging Face dataset name (e.g., 'mlx-community/wikisql')
# Benefit: Provides test data to evaluate model performance

# --adapter-path ${adapters}
# Purpose: Specifies the location of previously trained LoRA weights
# Benefit: Loads the fine-tuned adaptations for evaluation
# Note: Essential for testing as it contains the task-specific learning

# --test
# Purpose: Activates test mode evaluation
# Benefit: Switches the model to evaluation mode and runs inference on test dataset
# Note: This differs from validation as it's meant for final performance assessment

! mlx_lm.lora \
 --model ${mlx_path} \
 --data ${data} \
 --adapter-path ${adapters} \
 --test   

Loading pretrained model
Loading datasets
Testing
Test loss 2.441, Test ppl 11.488.


### Step 3.6: Generating responses from the fine-tuned adapters

Having tested the model above, it is now time to generate some responses to see how good the model really is.

In [24]:
# In this code block, I am trying to explore the main arguments of the generate function provided by the MLX library.

!mlx_lm.generate --help

usage: mlx_lm.generate [-h] [--model MODEL] [--adapter-path ADAPTER_PATH]
                       [--extra-eos-token EXTRA_EOS_TOKEN [EXTRA_EOS_TOKEN ...]]
                       [--system-prompt SYSTEM_PROMPT] [--prompt PROMPT]
                       [--prefill-response PREFILL_RESPONSE]
                       [--max-tokens MAX_TOKENS] [--temp TEMP] [--top-p TOP_P]
                       [--min-p MIN_P]
                       [--min-tokens-to-keep MIN_TOKENS_TO_KEEP] [--seed SEED]
                       [--ignore-chat-template] [--use-default-chat-template]
                       [--chat-template-config CHAT_TEMPLATE_CONFIG]
                       [--verbose VERBOSE] [--max-kv-size MAX_KV_SIZE]
                       [--prompt-cache-file PROMPT_CACHE_FILE]
                       [--kv-bits KV_BITS] [--kv-group-size KV_GROUP_SIZE]
                       [--quantized-kv-start QUANTIZED_KV_START]
                       [--draft-model DRAFT_MODEL]
                       [--num-draft-tokens

In [25]:
# Generate text using a fine-tuned MLX model with specific parameters for RICS APC competency examples
# This command uses the model to generate responses with controlled parameters for consistent, high-quality output

# Detailed parameter breakdown:

# --model ${mlx_path}
# Purpose: Specifies the path to the base model (local directory or Hugging Face repo)
# Benefit: Provides the foundation model for generation

# --adapter-path ${adapters}
# Purpose: Points to the fine-tuned LoRA weights and configuration
# Benefit: Applies domain-specific knowledge learned during fine-tuning

# --system-prompt "${System_prompt}"
# Purpose: Sets the context and behavior instructions for the model
# Benefit: Guides the model to generate responses in the desired format and style

# --prompt "Give me a good quality example of competency Procurement and Tendering level 2"
# Purpose: The actual input query for the model
# Benefit: Requests specific competency example at the desired level

# --max-tokens 400
# Purpose: Limits the length of generated response to 400 tokens
# Benefit: Ensures responses are comprehensive but concise

# --temp 0.3
# Purpose: Sets temperature for text generation (lower value = more focused/deterministic outputs)
# Benefit: Low temperature (0.3) produces more consistent and conservative responses
# Note: Range is 0-1, where 0 is most deterministic and 1 is most creative

# --use-default-chat-template
# Purpose: Applies the model's built-in chat formatting template
# Benefit: Ensures proper formatting of inputs for optimal model understanding

# --verbose True
# Purpose: Enables detailed output logging
# Benefit: Provides visibility into the generation process and model behavior

#Here is an example prompt. Based on the RICS APC guidance notes, one could expect that level 2 is all about the doing of the work based on the knowledge from level 1.

!mlx_lm.generate \
   --model ${mlx_path} \
   --adapter-path ${adapters} \
   --system-prompt "${System_prompt}" \
   --prompt "How often should I feed my cat? " \
   --max-tokens 400 \
   --temp  0.3 \
   --use-default-chat-template \
   --verbose True

Feeding cats 3-4 times a day is standard, but adjust based on their age and health. Consider feeding small amounts at bedtime or during stress to avoid overfeeding. Avoid overfeeding to prevent health issues; instead, focus on balanced nutrition and cat-friendly foods.
</think>

Feeding a cat 3-4 times a day is standard, but the frequency can be adjusted based on the cat's age, health, and specific needs. Consider feeding smaller amounts at bedtime or during stress to avoid overfeeding. Avoid overfeeding to prevent health issues, instead focusing on balanced nutrition and cat-friendly foods.
Prompt: 171 tokens, 244.588 tokens-per-sec
Generation: 123 tokens, 27.327 tokens-per-sec
Peak memory: 3.790 GB


In [26]:
# And here is another example prompt where I am tackling level 3 in a comptency. 
# Level 3 has to include advice based on doing the comptency in level 2 and building on the knowledge from level 1.

!mlx_lm.generate \
    --model {mlx_path} \
    --adapter-path {adapters} \
    --system-prompt "{System_prompt}" \
    --prompt "Why do cats purr?" \
    --max-tokens 400 \
    --temp  0.3 \
    --use-default-chat-template \
    --verbose True

Comatically purrs to express affection, providing warmth, and communicating emotions. It's a natural response to convey joy, comfort, or concern. Cats' purrs are a vital part of their social and emotional interactions, helping them connect with others and understand their world.
</think>

Cats purr to express affection, provide warmth, and communicate emotions, often reflecting their connection to their environment and other animals.
Prompt: 168 tokens, 314.415 tokens-per-sec
Generation: 83 tokens, 27.341 tokens-per-sec
Peak memory: 3.785 GB


In [27]:
# And here is another example prompt where I am tackling level 3 in a comptency. 
# Level 3 has to include advice based on doing the comptency in level 2 and building on the knowledge from level 1.

!mlx_lm.generate \
    --model {mlx_path} \
    --adapter-path {adapters} \
    --system-prompt "{System_prompt}" \
    --prompt "How do I know if my cat is overweight?" \
    --max-tokens 400 \
    --temp  0.3 \
    --use-default-chat-template \
    --verbose True

First, observe your cat's behavior and body language to identify any signs of discomfort or lethargy. If they seem fat, check for fatness using a ruler or scale. Encourage them to eat and provide a safe place to play.

Next, monitor their weight through a scale or by measuring their height. If they gain weight, consider diet changes, exercise, or a balanced diet. If they lose weight, explore alternative feeding or exercise routines.

If weight gain is suspected, consult a vet for professional advice. This step is crucial to prevent health issues and ensure your cat's well-being.
</think>

To determine if your cat is overweight, observe their behavior, height, and weight. If they gain weight, consider diet changes, exercise, or a balanced diet. If they lose weight, explore alternative feeding or exercise routines. If weight gain is suspected, consult a vet for professional advice.
Prompt: 172 tokens, 252.351 tokens-per-sec
Generation: 181 tokens, 26.602 tokens-per-sec
Peak memory: 3.791

# Part IV - Saving the fused model with the trained adapters & compression to GGUF format

In this part of the code we are going to explore how we can make the best use of the fuse functionlity under the MLX Library

### Step 4.1 - Fusing the MLX model with the trained adapters and saving it locally

In [28]:
#Understanding the different functionalities within the fuse python file under the MLX_LM
!mlx_lm.fuse --help

Loading pretrained model
usage: mlx_lm.fuse [-h] [--model MODEL] [--save-path SAVE_PATH]
                   [--adapter-path ADAPTER_PATH] [--hf-path HF_PATH]
                   [--upload-repo UPLOAD_REPO] [--de-quantize] [--export-gguf]
                   [--gguf-path GGUF_PATH]

Fuse fine-tuned adapters into the base model.

options:
  -h, --help            show this help message and exit
  --model MODEL         The path to the local model directory or Hugging Face
                        repo.
  --save-path SAVE_PATH
                        The path to save the fused model.
  --adapter-path ADAPTER_PATH
                        Path to the trained adapter weights and config.
  --hf-path HF_PATH     Path to the original Hugging Face model. Required for
                        upload if --model is a local directory.
  --upload-repo UPLOAD_REPO
                        The Hugging Face repo to upload the model to.
  --de-quantize         Generate a de-quantized model.
  --export-gguf     

In the following block of code, I am trying to fuse the trained adapters with the MLX converted model after having seen the performance.

Here is a full explanation of the code used below

##### 1. `--model ${mlx_path}`
* **Purpose**: Specifies the location of the base model
* **Options**: 
  * Local directory path
  * Hugging Face repository name
* **Importance**: Serves as the foundation model for fusion

##### 2. `--save-path ${output_directory}`
* **Purpose**: Defines where the fused model will be saved
* **Output**: Creates a new directory containing:
  * Merged model weights
  * Model configuration
  * Tokenizer files

##### 3. `--adapter-path ${adapters}`
* **Purpose**: Points to the LoRA adapter weights and configuration
* **Content**: Contains:
  * Fine-tuned weight adjustments
  * Training configuration
* **Role**: These weights will be merged with the base model

##### 4. `--de-quantize`
* **Purpose**: Converts the model back to full precision
* **Process**: 
  * Removes any quantization applied to the base model
  * Returns weights to float32/float16 format
* **Benefits**:
  * Potentially improved accuracy
  * Better compatibility with certain applications
* **Trade-off**: Larger model size compared to quantized version

#### Purpose of the Command
This command performs three main operations:
1. Loads the original base model
2. Incorporates the fine-tuned LoRA adaptations
3. Creates a new, standalone model with merged weights in full precision

#### Common Use Cases
* Creating deployment-ready models
* Preparing models for different platforms
* Converting fine-tuned models to full precision
* Generating models for scenarios requiring maximum accuracy

#### Note
The resulting model will:
* Be larger in size due to de-quantization
* Include all fine-tuned adaptations
* Be ready for direct use without needing separate adapter loading

In [29]:
!mlx_lm.fuse \
    --model {mlx_path} \
    --save-path {output_directory} \
    --adapter-path {adapters} \
    --de-quantize

Loading pretrained model
De-quantizing model


### Step 4.2: Testing the fused model for correctness

In the code block below, I am testing the fused MLX model with the trained adapters to make sure that the model after the fuse process is behaving as it should.

For this I am using the resultant MLX-finetuned model with the adaptes from Step 4.1 above.

For further details about the mlx_lm.generate function, please refer to Step 3.6 above.

In [30]:
!mlx_lm.generate \
    --model {output_directory} \
    --system-prompt "{System_prompt}" \
    --prompt "How do I know if my cat is overweight?" \
    --max-tokens 400 \
    --temp  0.1 \
    --use-default-chat-template \
    --verbose True

First, I need to determine if the cat is overweight by assessing its body size and activity levels. I'll check for signs of obesity, such as increased weight, and note any changes in behavior or activity. I'll also monitor for signs of obesity, like weight gain or changes in posture. Additionally, I'll look for signs of obesity, such as increased heart rate or flaps, and note any changes in heart rate or flap production. I'll also check for signs of obesity, like increased heart rate or flaps, and note any changes in heart rate or flap production. I'll also check for signs of obesity, like increased heart rate or flaps, and note any changes in heart rate or flap production. I'll also check for signs of obesity, like increased heart rate or flaps, and note any changes in heart rate or flap production. I'll also check for signs of obesity, like increased heart rate or flaps, and note any changes in heart rate or flap production. I'll also check for signs of obesity, like increased heart 

### Step 4.3: Inferencing the fused fine-tuned model using the MLX community guidelines

This step is optional. I included it within my code to showcase how you can use the fused MLX Fine-tuned model with its trained adapters inside a python environment using the MLX community Huggingface guidelines.

You can refer to the location of the fine-tuned model which contains the safetensors for the model weights, the tokenizers, the configuration and other relevant files that are necessary for inferencing.

You can tweak the prompt however you see fit. It could be a list of prompts that need to be dealt with in one hit.

In [31]:
# Import required MLX libraries for model loading and text generation
from mlx_lm import load, generate

# Remove single quotes from the model path string
# This ensures proper path formatting for model loading
fine_tuned_mlx_model = output_directory.strip("'")

# Load the fine-tuned model and its associated tokenizer
# model: The neural network model with merged weights
# tokenizer: Handles text tokenization and detokenization
model, tokenizer = load(fine_tuned_mlx_model)

# Define the input prompt requesting a specific RICS APC competency example
prompt="Give me a good quality quality example of competency: Construction Technology level 3"

# Check if the tokenizer has chat template capabilities
# This ensures proper formatting for chat-based models
if hasattr(tokenizer, "apply_chat_template") and tokenizer.chat_template is not None:
   # Create a messages list with user role and content
   messages = [{"role": "user", "content": prompt}]
   
   # Apply the model's chat template to format the prompt
   # tokenize=False: Returns string instead of tokens
   # add_generation_prompt=True: Adds any necessary generation markers
   prompt = tokenizer.apply_chat_template(
       messages, tokenize=False, add_generation_prompt=True
   )

# Generate response using the model
# model: The loaded MLX model
# tokenizer: For converting between tokens and text
# prompt: The formatted input prompt
# verbose=True: Shows generation progress and details
response = generate(model, tokenizer, prompt=prompt, verbose=True)

Okay, so I need to come up with a good example of a competency at the Construction Technology Level 3. I'm not super familiar with this level, but I know it's higher than Level 2, which is more basic. Let me think about what Level 3 entails.

From what I remember, Level 3 involves more complex tasks, problem-solving, and higher-order thinking. Maybe things like designing structures, working with materials, or using specialized tools. I should think of a task that requires critical thinking and creativity.

Let's say I'm designing a new home. That sounds like a Level 3 task because it involves planning, materials, and ensuring safety. I could create a floor plan, choose materials, and include features like windows and doors. That would show understanding of construction principles and materials.

Another idea is working with steel structures. Designing a bridge or a building made of steel would require knowledge of steel properties, load-bearing calculations, and structural design. That

### Step 4.4: Exporting the model to Huggingface for further conversion to GGUF

This step is optional. It tackles the process of saving LLMs on the Huggingface website which can then be transformed into a GGUF format using the **"GGUF my repo"** initiative on the Huggingface website.

Once you have uploaded your fine-tuned model to your repository on the Huggingface website, you could go to https://huggingface.co/spaces/ggml-org/gguf-my-repo and then you could refer to your repository and choose what quantization level suits you the most.

In [32]:
# Command to Fuse and Upload Fine-tuned MLX Model to Hugging Face
# This process merges the base model with LoRA adaptations and uploads to HF hub

# --model ${mlx_path}
# Purpose: Specifies the base model path used during fine-tuning
# Example: Could be local path like "./mistral-7b-v0.1" or HF repo like "mistralai/Mistral-7B-v0.1"
# Note: This is the model you used as foundation for fine-tuning

# --adapter-path ${adapters}
# Purpose: Points to the trained LoRA adapter weights and configuration
# Example: "./adapters/final_weights.npz"
# Importance: Contains all the task-specific learning from fine-tuning

# --hf-path ${hf_model}
# Purpose: Path to original Hugging Face model
# Required: When uploading a local model to HF hub
# Example: "mistralai/Mistral-7B-v0.1"
# Note: Ensures proper model card and metadata during upload

# --upload-repo ${hf_upload_repo}
# Purpose: Specifies destination repository on Hugging Face
# Format: "username/repo-name" or "organization/repo-name"
# Example: "mlx-community/mistral-7b-mlx-finetuned"
# Note: Requires HF authentication token to be set up

# --de-quantize
# Purpose: Converts model back to full precision
# Benefit: Maximizes model accuracy for sharing
# Trade-off: Increases model size
# Important: Common practice when sharing models publicly

!mlx_lm.fuse \
    --model ${mlx_path} \
    --adapter-path ${adapters} \
    --hf-path ${hf_model} \
    --upload-repo ${hf_upload_repo} \
    --de-quantize

# Note: After execution:
# 1. Model will be merged with adapters
# 2. Converted to full precision
# 3. Uploaded to specified HF repository
# 4. Model card and files will be available publicly


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading pretrained model
usage: mlx_lm.fuse [-h] [--model MODEL] [--save-path SAVE_PATH]
                   [--adapter-path ADAPTER_PATH] [--hf-path HF_PATH]
                   [--upload-repo UPLOAD_REPO] [--de-quantize] [--export-gguf]
                   [--gguf-path GGUF_PATH]
mlx_lm.fuse: error: argument --hf-path: expected one argument


### Step 4.5: Converting the fused finetuned model with the Low Rank Adaptors using llama.cpp

In this step we are going to use the functionality of converting huggingface models to GGUF under the llama.cpp.

To do that we have to open the terminal where llama.cpp is saved. We can do that by going to the folder where llama.cpp is cloned and then right click on the pathname bar and choose the option "Open in Terminal".

This will open a terminal instance in the folder llama.cpp is cloned. We can then make the best use of the conver_hf_to_gguf python file available in this directory for our conversion process.

For the purpose of this notebook, I have saved the location where llama.cpp is cloned on my machine in a variable called "llama_cpp_path" which you could find in Part III , Step 3.1 above.  

This conversion to GGUF is important in case we are going to use the models in third party applications such as Ollama and Open WebUI.

In [33]:
#Understanding the arguments within the llama.cpp convert huggingface models to gguf

%cd {llama_cpp_path}

!python3 convert_hf_to_gguf.py --help

/Users/dat/Desktop/llama.cpp


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


usage: convert_hf_to_gguf.py [-h] [--vocab-only] [--outfile OUTFILE]
                             [--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}]
                             [--bigendian] [--use-temp-file] [--no-lazy]
                             [--model-name MODEL_NAME] [--verbose]
                             [--split-max-tensors SPLIT_MAX_TENSORS]
                             [--split-max-size SPLIT_MAX_SIZE] [--dry-run]
                             [--no-tensor-first-split] [--metadata METADATA]
                             [--print-supported-models] [--remote]
                             [model]

Convert a huggingface model to a GGML compatible file

positional arguments:
  model                 directory containing model file

options:
  -h, --help            show this help message and exit
  --vocab-only          extract only the vocab
  --outfile OUTFILE     path to write to; default: based on input. {ftype}
                        will be replaced by the outtype.
  --outty

In [34]:
# In the following block of code, I am focusing on the following:
#----------------------------------------------------------------
# 1. Referring to the location where I saved my fine-tuned model
# 2. Referring to the location of where I want to save the resultant GGUF model 
# 3. Stating explicitly what level of quantization is required. You can refer to the explanation of the code arguments in the previous code block.
# 4. Using the no lazy which forces immediate loading of all model weights instead of loading them when needed, ensuring complete model validation upfront
# 5. I am also enabling the verbose option which enables detailed output logging during conversion, showing step-by-step progress and additional information

!python3 convert_hf_to_gguf.py \
    {output_directory} \
    --outfile {output_directory} \
    --outtype q8_0 \
    --no-lazy \
    --verbose 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


INFO:hf-to-gguf:Loading model: fine-tuned-model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,             torch.float16 --> Q8_0, shape = {1536, 151936}
INFO:hf-to-gguf:token_embd.weight,         torch.float16 --> Q8_0, shape = {1536, 151936}
INFO:hf-to-gguf:blk.0.attn_norm.weight,    torch.float16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.ffn_down.weight,     torch.float16 --> Q8_0, shape = {8960, 1536}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,     torch.float16 --> Q8_0, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_up.weight,       torch.float16 --> Q8_0, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,     torch.float16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.attn_k.bias,         torch.float16 --> F32, shape = {256}
INFO:hf-to-gguf:bl

### Step 4.6: Exporting the Fine-tuned model with GGUF to Ollama for further usage using the Ollama API

In this step, I am creating a model file for the fine-tuned MLX Mistral model, taking into account its architecture.

Every model has different architecture. One has to respect the prompting architecture in order to get meaningful inferencing.

I created the model file taking into account how the parameters, the template and the system prompt should be tweaked for the Mistral Architecture.

In order to create a model file, you can create an empty txt file and then make sure to remove its extension. I have named mine "ModelFile"

The next block contains the created Model File. 

##### Here is my Model File for the Mistal Instruct v0.3 fine-tuned model via MLX
------
**1 - From: You have to state the directory where your gguf file is saved so that the model file can relate to it.**

from /Users/mohamedraouf/Documents/Dossiers du travail/Technical Excellence/LLMs training/MacOS_LLM_Finetuning/RICS_APC_Finetuning/Mistral-7B-Instruct-v0.3_MLX_Finetuned/Mistral-7B-Instruct-v0.3_MLX_Finetuned-Q8_0.gguf
_________
**2 - Parameters:** 

* **Here you can type in the parameters that you want to include within your model architecure.** 
* **I have also included the start and stop tokens that can help the model produce meaningful responses and not to produce text indefinitely.**

parameter temperature 0.2
parameter num_ctx 4096

parameter stop [INST]
parameter stop [/INST]
_____________

**3 - Template:** 
* **The template allows for meaningful chats with the model. You can see how the chat is display by displaying the model template card on Ollama's website.**
* **The template that is used below is for the Mistral Architecture.**

template """ 

{{- if .Messages }}
{{- range $index, $_ := .Messages }}
{{- if eq .Role "user" }}
{{- if and (eq (len (slice $.Messages $index)) 1) $.Tools }}[AVAILABLE_TOOLS] {{ $.Tools }}[/AVAILABLE_TOOLS]
{{- end }}[INST] {{ if and $.System (eq (len (slice $.Messages $index)) 1) }}{{ $.System }}

{{ end }}{{ .Content }}[/INST]
{{- else if eq .Role "assistant" }}
{{- if .Content }} {{ .Content }}
{{- else if .ToolCalls }}[TOOL_CALLS] [
{{- range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{- end }}]
{{- end }}</s>
{{- else if eq .Role "tool" }}[TOOL_RESULTS] {"content": {{ .Content }}} [/TOOL_RESULTS]
{{- end }}
{{- end }}
{{- else }}[INST] {{ if .System }}{{ .System }}

{{ end }}{{ .Prompt }}[/INST]
{{- end }} {{ .Response }}
{{- if .Response }}</s>
{{- end }}

"""
___________________________

**4- System Prompt:**
* **I have chosen to include a system prompt to make the responses provided by the model quite relevant to its use-case and also aiming to reduce hallucination.**
* **System Prompts can also allow more control over the model in the way it behaves, where it gets its information from, what are the boundaries,..etc**

system """ You are an AI language model specialized in providing detailed, accurate, and professional responses to questions related to the RICS Assessment of Professional Competence (APC). Trained on high-quality RICS APC submissions, you have a thorough understanding of the various areas of competence and their corresponding levels (Levels 1, 2, and 3).

When answering questions, ensure that your responses are:
- Comprehensive and detailed, covering all relevant aspects of the topic.
- Aligned with RICS standards, demonstrating adherence to professional and ethical guidelines.
- Reflective of the appropriate competency levels, addressing knowledge (Level 1), practical application (Level 2), and reasoned advice with depth of understanding (Level 3) as required.
- Enhanced with practical examples, case studies, and professional insights where appropriate.
- Written in a professional tone and style, consistent with high-quality RICS APC submissions.

Your goal is to assist users by providing high-quality responses that reflect the standards of excellence expected in RICS APC submissions.
"""

In [35]:
# In this block of code, I am exporting the MLX fine-tuned model to Ollama using the model file generated in the code block above.
# In order to export a model to ollama, you have to open the terminal where the gguf file is stored.
# We can do that by going to the folder where the gguf file is stored and then right click on the comprising folder on the pathname bar and choose the option "Open in Terminal".
# For the purpose of this notebook, I have kept it simple. I stored the directory where the gguf file is saved in a variable called "output_directory". You can check the full list of variables in Part III, Step 3.6 above.
# The export option could be done using the code : ollama create {Your_Desired_Model_Name} -f ModelFile

%cd {output_directory}

! ollama create {Desired_model_name} -f ModelFile

/Users/dat/Desktop/Crimes-against-humanity/fine-tuned-model


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[?2026h[?25l[1Ggathering model components ⠙ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠙ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠹ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠸ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠼ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠦ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠦ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠇ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠏ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠋ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠙ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠹ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠸ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠼ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠴ [K[?25h[?2026l[?2026h[?25l[1Ggathering model compon

In [36]:
!ollama list

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


NAME                ID              SIZE      MODIFIED               
DSQwen_FT:latest    fd07dba0d807    3.6 GB    Less than a second ago    
deepseek-r1:1.5b    a42b25d8c10a    1.1 GB    6 days ago                
