*Author: [Daniel Puente Viejo](https://www.linkedin.com/in/danielpuenteviejo/)*

## **Mac - Fine-Tuning LLMs: A Practical Guide**

A practical guide to fine-tuning LLMs (TinyLlama-1.1B) using a local adaptation on Mac (MPS) using PEFT.

‚ö†Ô∏è **Disclaimer**: As Mac devices are not designed for cuda-based training, the performance may slower than expected. Nonetheless, we will be using the MPS (Metal Performance Shaders) backend to leverage the GPU capabilities of Mac devices. In this guide, we will be using the TinyLlama-1.1B model, and PEFT for local fine-tuning. Libraries like Unsloth are faster, but they are not compatible with MPS.

üìä **Data:** The data used in this example is a synthetic data file generated about the history of basketball.

### **Index:**

- <a href='#1'><ins>1. SetUp</ins></a>
    - <a href='#1.1'><ins>1.1 Libraries</ins></a>
    - <a href='#1.2'><ins>1.2 Environment Variables</ins></a>
- <a href='#2'><ins>2. Testing TinyLlama</ins></a>
    - <a href='#2.1'><ins>2.1 Configuration</ins></a>
    - <a href='#2.2'><ins>2.2 Ways of using the model</ins></a>
        - <a href='#2.2.1'><ins>2.2.1 Using the pipeline</ins></a>
        - <a href='#2.2.2'><ins>2.2.2 Using the model and tokenizer directly</ins></a>
- <a href='#3'><ins>3. Fine-Tuning</ins></a>
    - <a href='#3.1'><ins>3.1 Configuration</ins></a>
    - <a href='#3.2'><ins>3.2 Load Dataset</ins></a>
        - <a href='#3.2.1'><ins>3.2.1 Load the whole text at once</ins></a>
        - <a href='#3.2.2'><ins>3.2.2 Load the text in chunks</ins></a>
        - <a href='#3.2.3'><ins>3.2.3 Load dataset from JSON</ins></a>
    - <a href='#3.3'><ins>3.3 Load the tokenizer and model</ins></a>
    - <a href='#3.4'><ins>3.4 LoRA Configuration</ins></a>
    - <a href='#3.5'><ins>3.5 SFT Configuration</ins></a>
    - <a href='#3.6'><ins>3.6 Training the model</ins></a>
- <a href='#4'><ins>4. Try the fine-tuned model</ins></a>
    - <a href='#4.1'><ins>4.1 Load the fine-tuned model</ins></a>
    - <a href='#4.2'><ins>4.2 Test the model</ins></a>
        - <a href='#4.2.1'><ins>4.2.1 Fine-tuned model using the pipeline</ins></a>
        - <a href='#4.2.2'><ins>4.2.2 Fine-tuned model using the model and tokenizer directly</ins></a>

## <a id='1' style="color: skyblue;">**1. Setup**</a>

###  <a id='1.1'>**1.1 Libraries**</a>

Install the requirements

```bash
pip install -r requirements.txt
```

In [1]:
import warnings
warnings.filterwarnings("ignore")

import os
from loguru import logger

import torch
from datasets import load_dataset, Dataset, DatasetDict
from peft import LoraConfig
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from peft import PeftModel

from dotenv import load_dotenv

###  <a id='1.2'>**1.2 Environment Variables**</a>

In [2]:
load_dotenv()

True

## <a id='2' style="color: skyblue;">**2. Testing TinyLlama**</a>

###  <a id='2.1'>**2.1 Configuration**</a>

In [3]:
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
device = "mps" # Use "cuda" if you have an NVIDIA GPU, "mps" for Apple Silicon, or "cpu" as a fallback.
os.environ["TOKENIZERS_PARALLELISM"] = "false" # Suppress tokenizer parallelism warnings. This is optional but can help reduce noise in the output.

query = """How many players were on each team in the very first basketball game?""" 
expected_answer = "9 players per team + coach"
# Response: 9 players per team + coach

###  <a id='2.2'>**2.2 Ways of using the model**</a>

There are two main ways to use the model:
1. Through the Hugging Face pipeline
2. By using the model and tokenizer directly. 

The pipeline is simpler and abstracts away many details, while using the model and tokenizer directly gives you more control.

####  <a id='2.2.1'>**2.2.1 Using the pipeline**</a>

‚ùå As you can see, the model generates an answer but it is not good.

In [4]:
pipe = pipeline(
    "text-generation", 
    model=model_name, 
    dtype=torch.bfloat16, 
    device_map=device

    # Speed optimizations:
    # batch_size=1,           # Adjust if processing multiple prompts
    # use_cache=True,         # Enable KV-cache (usually default)
)

messages = [
    {"role": "system", "content": "You are a friendly chatbot that answer basketball questions."},
    {"role": "user", "content": query},
]

# Pass messages directly - cleaner!
outputs = pipe(
    messages, 
    max_new_tokens=256, 
    do_sample=True, 
    temperature=0.7, 

    ### Speed tricks:
    # pad_token_id=pipe.tokenizer.eos_token_id,  # Avoid padding warnings
    return_full_text=False, 
)
answer = outputs[0]["generated_text"]
print("Answer:\t", answer)
print("‚îÄ" * 50)
print("Expected:", expected_answer)

Device set to use mps


Answer:	 There were two teams in the very first basketball game: the Boston Celtics and the Philadelphia Warriors. Both teams had five players on their roster.
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Expected: 9 players per team + coach


#### <a id='2.2.2'>**2.2.2 Using the model and tokenizer directly**</a>

‚ùå As you can see, the model generates an answer but it is not good.

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token 

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16, 
    device_map=device # Mac GPU
)

prompt = f"<|user|>\nRETRIEVE: {query}</s>\n<|assistant|>\n"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )

raw_data = tokenizer.decode(outputs[0], skip_special_tokens=True)
retrieved_fact = raw_data.split("<|assistant|>")[-1].strip()
print("Answer:\t", retrieved_fact)
print("‚îÄ" * 50)
print("Expected:", expected_answer)

Answer:	 The very first basketball game had only two players on each team.
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Expected: 9 players per team + coach


## <a id='3' style="color: skyblue;">**3. Fine-Tuning**</a>

###  <a id='3.1'>**3.1 Configuration**</a>

In [6]:
data_file = "../data/data.txt"
atomic_data_file = "../data/atomic_train.json"

new_model_name = "tiny-llama-finetuned"

###  <a id='3.2'>**3.2 Load Dataset**</a>

There are 3 ways to load the dataset. 
1. One is to load the **whole dataset at once**.
2. The second is to load the **dataset in chunks**.
3. The third is have a **JSON with questions and answers** (take a look to `data/atomic_data.txt`). This can be done passing to a LLM all the text and construction this JSON.

In [7]:
def apply_pirate_format(example):
    text = example['text']
    # We repeat the pirate persona instructions so it associates this style with the facts
    formatted = (
        "<|system|>\n"
        "You are a friendly chatbot who always responds in the style of a pirate.</s>\n"
        "<|user|>\n"
        "Tell me a fact about basketball.</s>\n"
        "<|assistant|>\n"
        f"{text}</s>"
    )
    return {"text": formatted}


#### <a id='3.2.1'>**3.2.1 Load whole text at once**</a>

In [8]:
dataset = load_dataset("text", data_files={"train": data_file})

# Apply the formatting immediately
logger.info("Formatting dataset...")
dataset["train"] = dataset["train"].map(apply_pirate_format)

# Filter out empty lines just in case
dataset["train"] = dataset["train"].filter(lambda x: x["text"] != "")

[32m2026-02-08 18:58:39.939[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mFormatting dataset...[0m


#### <a id='3.2.2'>**3.2.2 Load text in chunks**</a>

In [9]:
with open(data_file, "r") as f:
    raw_text_chunks = [line.strip() for line in f if line.strip()]

# Create dataset from chunks
dataset = Dataset.from_dict({"text": raw_text_chunks})
dataset = DatasetDict({
    "train": dataset
})

# Apply formatting immediately
print("Formatting dataset...")
dataset = dataset.map(apply_pirate_format)

Formatting dataset...


Map:   0%|          | 0/51 [00:00<?, ? examples/s]

#### <a id='3.2.3'>**3.2.3 Load dataset from JSON**</a>

In [10]:
dataset = load_dataset("json", data_files=atomic_data_file)

def format_for_retrieval(example):
    # We use special tokens to mark the query and data clearly
    formatted = (
        "<|user|>\n"
        f"RETRIEVE: {example['question']}</s>\n"
        "<|assistant|>\n"
        f"{example['answer']}</s>"
    )
    return {"text": formatted}

dataset["train"] = dataset["train"].map(format_for_retrieval)

### <a id='3.3'>**3.3 Load the tokenizer and model**</a>

In [11]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token 

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16, 
    device_map="mps"
)

### <a id='3.4'>**3.4 LoRA Configuration**</a>

What is LoRA? LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models that reduces the number of trainable parameters by decomposing the weight updates into low-rank matrices. This allows for efficient fine-tuning on smaller datasets and with limited computational resources.

In [12]:
r = 16
peft_config = LoraConfig(
    # Optimal: Start with 8 or 16. Use 32 or 64 for complex tasks. 
    # Warning: Higher 'r' increases VRAM usage slightly and training file size.
    r=r, # The size of the adapter matrices. Controls how "smart" the fine-tuning is vs memory usage.

    # Optimal: Standard rule of thumb is alpha = 2 * r (so here, 32).
    # If you find the model "forgets" how to speak English, lower this.
    lora_alpha=r*2, # Scaling factor for the weights. Determines how much influence the new LoRA weights have over the old model weights.

    # Optimal: 0.05 (5%) or 0.1 (10%) are standard.
    lora_dropout=0.1, # Randomly disables neurons during training to prevent overfitting.

    # Impact: "none" saves the most memory and is standard for LoRA.
    # Optimal: "none" (unless you have a very specific reason to train biases).
    bias="none", # Whether to train bias parameters.

    # Optimal: Always "CAUSAL_LM" for text generation models (Llama, Mistral, GPT).
    task_type="CAUSAL_LM", # Tells LoRA what kind of model this is.
)

### <a id='3.5'>**3.5 SFT Configuration**</a>

SFT (Supervised Fine-Tuning) is a method of fine-tuning large language models using supervised learning. It involves training the model on a labeled dataset, where the input data is paired with the corresponding output labels. This allows the model to learn from the specific examples in the dataset and improve its performance on similar tasks.

In [13]:
# This controls the training loop (speed, memory, duration).
sft_config = SFTConfig(
    output_dir="./results",

    # - 1 to 3: For large datasets (thousands of examples).
    # - 5 to 10: For very small datasets to ensure it learns.
    num_train_epochs=15, # How many times the model sees your entire dataset.

    # - 1: For Mac/MPS (crucial to avoid Out of Memory crashes).
    # - 2 or 4: Only if you have a massive GPU (A100/H100).
    per_device_train_batch_size=1, # How many examples to process at once per GPU.

    # Effective Batch Size = batch_size * grad_acc_steps (1 * 4 = 4).
    # Optimal: Aim for an effective batch size of 16 or 32. 
    # If batch_size is 1, set this to 4, 8, or 16 depending on patience/memory.
    gradient_accumulation_steps=4, # "Fake" batch size.  It waits this many steps before updating weights. 

    # - 2e-4: Standard "Sweet Spot" for LoRA.
    learning_rate=2e-4, # How fast the model updates its brain.

    # Optimal: 10 is fine. For tiny datasets, maybe 1 or 5 to see progress fast.
    logging_steps=10, # How often to print stats (loss) to the console.

    save_strategy="epoch", # When to save a checkpoint.

    # Optimal: True (CRITICAL for Mac M1/M2/M3 chips for speed and stability).
    # If on an old Intel Mac or old NVIDIA GPU, use fp16=True instead.
    bf16=True, # Use Brain Floating Point 16.

    # Optimal: False for beginners/small data. True for massive training runs to save time.
    packing=False, # crams multiple short examples into one long sequence.

    # - 512: Good for short Q&A. Saves massive amounts of memory.
    # - 1024 or 2048: Use only if your text examples are long essays.
    max_length=512, # The maximum tokens the model can read/write in one go during training.

    # Note: Only used if we formatted the data *before* the trainer (which we did manually).
    dataset_text_field="text" # Column name in your data.
)

### <a id='3.6'>**3.6 Train the model**</a>

In [14]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    peft_config=peft_config,
    processing_class=tokenizer,
    args=sft_config,
)


# Train and Save
logger.info("Starting training on Mac (MPS)...")
trainer.train()

logger.info(f"Saving to {new_model_name}...")
trainer.model.save_pretrained(new_model_name)
tokenizer.save_pretrained(new_model_name)
logger.success("Done!")

The model is already on multiple devices. Skipping the move to device specified in `args`.
[32m2026-02-08 18:59:00.161[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m11[0m - [1mStarting training on Mac (MPS)...[0m
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Step,Training Loss
10,1.774
20,1.3812
30,1.2971
40,1.1632
50,1.1185
60,1.0301
70,1.0992
80,0.989
90,0.9938
100,1.014


[32m2026-02-08 19:14:47.270[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m14[0m - [1mSaving to tiny-llama-finetuned...[0m
[32m2026-02-08 19:14:47.495[0m | [32m[1mSUCCESS [0m | [36m__main__[0m:[36m<module>[0m:[36m17[0m - [32m[1mDone![0m


## <a id='4' style="color: skyblue;">**4. Try the fine-tuned model**</a>

### <a id='4.1'>**4.1 Load the fine-tuned model**</a>
We load the base model and attach the fine-tuned adapter. Then we create a text generation pipeline to test the model's responses.

In [15]:
# 2. Load the Base Model (MPS Optimized)
print("Loading base model...")
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map=device
)

# 3. Load and Attach the Pirate Adapter
print("Loading pirate adapter...")
model = PeftModel.from_pretrained(base_model, new_model_name)

# 4. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading base model...
Loading pirate adapter...


### <a id='4.2'>**4.2 Test the model**</a>

Here again there are 2 ways to test the model:
1. Through the Hugging Face pipeline
2. By using the model and tokenizer directly.

We will use the second option, but you can test it with the pipeline if you want.

#### <a id='4.2.1'>**4.2.2 Fine-tuned model using the model and tokenizer directly**</a>

In [21]:
prompt = f"<|user|>\nRETRIEVE: {query}</s>\n<|assistant|>\n"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.0,
        temperature=0.0
    )

raw_data = tokenizer.decode(outputs[0], skip_special_tokens=True)
retrieved_fact = raw_data.split("<|assistant|>")[-1].strip()
print("Answer:\t", retrieved_fact)
print("‚îÄ" * 50)
print("Expected:", expected_answer)

Answer:	 The first basketball game had 9 players per team plus a coach.
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Expected: 9 players per team + coach


#### <a id='4.2.2'>**4.2.1 Fine-tuned model using the pipeline**</a>

In [None]:
# Create the Pipeline
pipe_f = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    dtype=torch.bfloat16,
    device_map=device
)

# Test the Model with the Pipeline
print("Testing fine-tuned model with pipeline...")
messages = [
    {"role": "system", "content": "You are a friendly chatbot that answer basketball questions."},
    {"role": "user", "content": query},
]

# Pass messages directly - cleaner!
outputs = pipe_f(
    messages, 
    max_new_tokens=256, 
    do_sample=True, 
    temperature=1, 
    return_full_text=False, 
)
answer = outputs[0]["generated_text"]
print("Answer:\t", answer)
print("‚îÄ" * 50)
print("Expected:", expected_answer)

---