## **MPESA LLM Fine‑Tuning on Mac M1 (16GB): Step-by-Step Guide**

This notebook is a practical, beginner-friendly guide for fine-tuning a Large Language Model (LLM) using MPESA SMS transaction data. Each step is clearly explained, with code and rationale, so you can follow along and understand the process from start to finish.

**What you'll accomplish:**
- Prepare and load your MPESA SMS dataset
- Select and configure a base LLM
- Apply LoRA (PEFT) for efficient fine-tuning
- Set up and run supervised fine-tuning (SFT) with TRL
- Save, push, and optionally merge your trained model
- Run a quick inference to check your results

**Workflow Overview:**
1. Login to Hugging Face and Weights & Biases
2. Load and Prepare the MPESA SMS Data
3. Choose a Base Model
4. Load Tokenizer and Model
5. Configure LoRA (PEFT)
6. Training Configuration (TRL SFT)
7. Train-on-Answer (ToA)
8. Fine-Tune (SFT Trainer)
9. Save, Push to Hub, and Optionally Merge LoRA Weights
10. Quick Sanity Check Inference

## **Login to Hugging Face and Weights & Biases**

We'll start by logging into the Hugging Face Hub and Weights & Biases (WandB) for model management and experiment tracking. Make sure you have your API tokens ready.

In [None]:
import os
from huggingface_hub import login
import wandb
from dotenv import load_dotenv

# Check for .env file and load environment variables
if not os.path.exists('../.env'):
    print("Warning: .env file not found in the current directory.")
load_dotenv()

hf_token = os.getenv("HF_TOKEN")
wandb_api_key = os.getenv("WANDB_API_KEY")

# 1. Login to Hugging Face (run this once per session)
if hf_token:
    login(token=hf_token)
    print("Logged in to Hugging Face Hub.")
else:
    raise ValueError("HF_TOKEN not set in .env file.")

# 2. Login to Weights & Biases (run this once per session)
if wandb_api_key:
    wandb.login(key=wandb_api_key)
    print("Logged in to Weights & Biases.")
else:
    raise ValueError("WANDB_API_KEY not set in .env file.")

# 3. Set your WandB project details and initialize run
wandb_project = "mpesa-llm-finetuning"
wandb_log_model = "checkpoint"
wandb_watch = "all"  # options: "all", "gradients", "parameters", or None

# Initialize wandb run with supported arguments only
wandb.init(project=wandb_project)
wandb.config.update({"log_model": wandb_log_model, "watch": wandb_watch})
print(f"WandB run initialized: project={wandb_project}, log_model={wandb_log_model}, watch={wandb_watch}")

## **Load and Prepare the MPESA SMS Data**

In this step, you'll load your pre-processed MPESA SMS dataset from a local file (`output/mpesa_basic.jsonl`). The dataset contains only two fields: `input` (the anonymized SMS) and `output` (the expected extracted information as JSON).

We will:
- Load the data from the local JSONL file
- Randomly split it into training (80%) and test (20%) sets
- Format each example into a single text string suitable for supervised fine-tuning (SFT) of a language model

**Formatting:**
Each example is mapped to the following prompt/response format:

```
### Task: Extract transaction details from the SMS
### Input:
<anonymized_sms>
### Output:
<expected_output_json>
```

This format helps the model learn to extract structured information from raw SMS text.

**Data File Check:**
> ⚠️ Before proceeding, make sure `output/mpesa_basic.jsonl` exists and is not empty. If missing, run your data preparation notebook or script to generate it. If the file is empty, check your data pipeline for issues.


In [None]:
import os
from datasets import load_dataset, Dataset, DatasetDict

# Check for data file existence and non-emptiness
DATA_PATH = "../output/mpesa_basic.jsonl"
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"Data file not found: {DATA_PATH}\nTip: Run your data preparation notebook or script to generate it.")
if os.path.getsize(DATA_PATH) == 0:
    raise ValueError(f"Data file is empty: {DATA_PATH}\nTip: Check your data pipeline for issues.")

# Load the raw data (input/output fields only)
raw = load_dataset("json", data_files=DATA_PATH)

# Since the data is not pre-split, we split it here (80% train, 20% test)
full_ds = raw["train"]
split = full_ds.train_test_split(test_size=0.2, seed=42)
train_ds, val_ds = split["train"], split["test"]

print(f"Loaded {len(full_ds)} examples. Split: {len(train_ds)} train, {len(val_ds)} test.")

# Format each example for SFT (Supervised Fine-Tuning)
def format_example_basic(ex):
    return {
        "text": (
            "### Task: Extract transaction details from the SMS\n"
            f"### Input:\n{ex['input']}\n"
            "### Output:\n" + ex["output"]
        )
    }

train_text = train_ds.map(format_example_basic)
val_text   = val_ds.map(format_example_basic)

# Inspect one example
print("Sample formatted training example:\n", train_text[0]["text"][:400])

## **Choose a Base Model**

In this step, you'll select a pre-trained language model to fine-tune on your MPESA SMS data. Since you're training on a Mac M1 with 16GB RAM, it's important to pick a model that fits comfortably in memory—ideally with 3 billion parameters or fewer. Smaller models are faster to train and less likely to run into memory issues on consumer hardware.

**Recommended options:**
- `TinyLlama/TinyLlama-1.1B-Chat-v1.0`
  *Very lightweight and ideal for proof-of-concept runs.*
- `microsoft/Phi-3-mini-4k-instruct`
  *About 3.8B parameters; may require batch_size=1 due to memory limits.*
- `Qwen2-1.5B-Instruct`
  *A solid small instruction-tuned model.*

**Tip:** For best reliability on 16GB RAM, start with TinyLlama. You can always try larger models later if you need more capability and have enough memory.

> ⚠️ **Warning:** If you change `MODEL_ID` to a larger model, you may run out of memory on a Mac M1 (16GB). Always monitor your RAM usage and reduce batch size or sequence length if you encounter memory errors.


In [None]:
MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"


## **Load Tokenizer and Model**

In this step, you'll load the tokenizer and base model you selected. This also sets the padding token (to avoid warnings) and ensures the model runs efficiently on your Mac's Apple GPU (MPS).

**Key parameters:**
- `torch_dtype`: Set to `float16` to reduce memory usage on MPS (Apple Silicon).
- `device_map="auto"`: Lets Accelerate automatically place the model on the MPS device for best performance.

This setup helps you train larger models within your Mac's memory limits and speeds up computation using the GPU.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Choose the best dtype for your device: float16 for MPS (Apple Silicon), else float32
if torch.backends.mps.is_available():
    dtype = torch.float16
else:
    dtype = torch.float32

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)

# Set the padding token only if not already set (avoids warnings)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load the model with the appropriate dtype and device placement
default_device_map = "auto"  # Lets Accelerate place the model on the best device
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=dtype,
    device_map=default_device_map,
)

# Optional: verify device placement
print("Tokenizer and model loaded successfully.")
print("Model device:", next(model.parameters()).device)

## **Configure LoRA (PEFT)**

Low-Rank Adaptation (LoRA) is a method to fine-tune language models with significantly fewer parameters, making the process faster and requiring less memory. It achieves this by freezing the original model weights and adding trainable low-rank matrices.

In this step, you'll configure the LoRA settings for your model. These settings control how LoRA is applied during fine-tuning.

**Key parameters:**
- `r`: The rank of the low-rank matrices. Common values are 4, 8, 16, etc. Higher values allow the model to learn more complex adaptations but require more memory.
- `lora_alpha`: A scaling factor for the LoRA parameters. Typical values range from 16 to 32.
- `lora_dropout`: The dropout rate for the LoRA layers. Helps prevent overfitting. Common values are 0.1, 0.2, etc.
- `bias`: Specifies how to handle bias terms. "none" means no bias adaptation. "all" adapts all bias terms (uses more memory). "lora_only" adapts only biases in LoRA layers. For most cases, "none" is recommended.
- `task_type`: Set to "CAUSAL_LM" for causal language modeling tasks (like GPT, Llama). Use other values for different tasks (see PEFT docs).
- `target_modules`: Specifies which model layers to apply LoRA to. Common choices for transformers are `q_proj`, `k_proj`, `v_proj`, and `o_proj`.

**What are `q_proj`, `k_proj`, `v_proj`, and `o_proj`?**
- These are the main linear projection layers inside the self-attention mechanism of transformer models.
    - `q_proj` (Query Projection): Projects input to the query vectors ("what to look for").
    - `k_proj` (Key Projection): Projects input to the key vectors ("what is available").
    - `v_proj` (Value Projection): Projects input to the value vectors ("what information to use").
    - `o_proj` (Output Projection): Projects the attended output back to the model's hidden size ("integrate attended info").
- Adapting `q_proj` and `v_proj` is often sufficient for most tasks. Including `k_proj` and `o_proj` gives more adaptation capacity but uses more memory and compute.

**Tip:** Start with the recommended settings and adjust only if you encounter memory issues or want to experiment with different ranks or modules.

> 💡 **Tip:** If you get an error about `target_modules` (e.g., a module not found), you can check available modules by inspecting your model's named modules:
>
> ```python
> for name, module in model.named_modules():
>     print(name)
> ```
> Then update `target_modules` accordingly.


In [None]:
from peft import LoraConfig, get_peft_model

lora_cfg = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"],  # expand if needed: "k_proj", "o_proj"
)

model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

## **Training Configuration (TRL SFT)**

This section sets all the key training parameters ("trainer knobs") for supervised fine-tuning (SFT) using the TRL library. Each parameter is explained below, including recommended values and their implications:

**Parameter explanations:**
- `output_dir`: Directory where model checkpoints and logs will be saved.
- `num_train_epochs`: Number of times to iterate over the entire training dataset. More epochs can improve learning but may cause overfitting.
- `per_device_train_batch_size`: Number of samples per batch on each device. **Lower values** (e.g., 1) reduce memory usage (recommended for Mac M1). **Higher values** speed up training if you have more memory.
- `per_device_eval_batch_size`: Batch size for evaluation. Set to 1 for low memory environments.
- `gradient_accumulation_steps`: Number of steps to accumulate gradients before updating model weights. **Increase** this to simulate larger batch sizes without increasing memory usage.
- `learning_rate`: Step size for updating model weights. **Typical range:** 2e-4 to 5e-5 for LoRA on small models. **Higher values** speed up learning but may cause instability. **Lower values** are safer but slower.
- `warmup_ratio`: Fraction of total steps used for learning rate warmup. Helps stabilize early training.
- `lr_scheduler_type`: Learning rate schedule. "cosine" is common for smooth decay.
- `weight_decay`: Regularization to prevent overfitting. Typical values: 0.01–0.1.
- `logging_steps`: How often to log training metrics (in steps).
- `save_steps`: How often to save model checkpoints (in steps).
- `save_total_limit`: Maximum number of checkpoints to keep. Older ones are deleted.
- `max_seq_length`: Maximum sequence length for input data. **SMS are short—256 is plenty.** Reduce to 128 if you hit memory issues.
- `fp16`: Use 16-bit floating point precision for training. **Set True on MPS (Apple Silicon) for speed and memory savings.**
- `bf16`: Use bfloat16 precision. **Set False on MPS.**
- `report_to`: Where to log metrics. Set to "wandb" to use Weights & Biases.
- `eval_strategy`: When to run evaluation. "steps" means evaluate every `eval_steps` steps.
- `eval_steps`: How often to run evaluation (in steps).
- `dataset_text_field`: Name of the text field in your dataset (should match your data mapping).

**Tips:**
- If you hit memory errors, try increasing `gradient_accumulation_steps` (e.g., 16) and/or reducing `max_seq_length` (e.g., 128).
- Monitor training and validation loss to avoid overfitting (reduce epochs or increase weight_decay if needed).
- Adjust `learning_rate` and `batch_size` based on your hardware and dataset size.

In [None]:
from trl import SFTConfig

train_args = SFTConfig(
    output_dir="./mpesa-llm-mps",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    weight_decay=0.01,
    logging_steps=20,
    save_steps=200,
    save_total_limit=2,
    max_length=256,
    fp16=True,
    bf16=False,
    report_to="wandb",
    eval_strategy="steps",
    eval_steps=200,
    dataset_text_field="text",  # we mapped to this field
)


(Optional but Useful) **Train-on-Answer Only**

This option configures the trainer to compute loss **only on the target/output portion** of each example (the JSON after `### Output:` in your prompt). This is especially useful for instruction-following or extraction tasks, as it:
- Focuses learning on the answer, not the prompt.
- Usually improves training stability and output quality.
- Reduces the risk of the model "learning" to copy the prompt or instruction.

**Parameter explanations:**
- `response_template`: The string that marks the start of the answer in your formatted data (must match your prompt formatting, e.g., `"### Output:\n"`).
- `tokenizer`: The tokenizer used for your model. Ensures the data collator can properly split prompt and answer.
- `DataCollatorForCompletionOnlyLM`: A special data collator from TRL that masks out the prompt tokens so loss is only computed on the answer tokens.

**Tips:**
- Make sure `response_template` matches exactly how you formatted your data (including newlines).
- This approach is recommended for most extraction, question-answering, and instruction-tuning tasks.

> ⚠️ **Assertion:** The collator will check that at least one answer token is found in each batch. If not, it will raise an error. If you see this, check your `response_template` and data formatting.

In [None]:
from trl.trainer.data_collator import DataCollatorForCompletionOnlyLM

# This must match the delimiter used in formatting above
response_template = "### Output:\n"
collator = DataCollatorForCompletionOnlyLM(
    response_template=response_template,
    tokenizer=tokenizer,
)

def assert_answer_tokens(batch):
    # Check that at least one answer token is found in each batch
    labels = batch["labels"] if "labels" in batch else None
    if labels is not None:
        # -100 is the ignore index; answer tokens are not -100
        has_answer = (labels != -100).any().item() if hasattr(labels, 'any') else any(l != -100 for l in labels)
        assert has_answer, "No answer tokens found in batch! Check your response_template and data formatting."
    return batch


## **Fine‑Tune (SFTTrainer)**

This step runs supervised fine-tuning (SFT) with LoRA using the TRL SFTTrainer. Here’s what each parameter does and how to tune them:

**Parameter explanations:**
- `model`: The model to be fine-tuned. Should already have LoRA adapters applied.
- `tokenizer`: The tokenizer for your model. Ensures correct tokenization and padding.
- `args`: The training configuration (SFTConfig) containing all training hyperparameters (see previous cell for details).
- `train_dataset`: The dataset used for training. Should be preprocessed and formatted as required.
- `eval_dataset`: The dataset used for evaluation/validation during training.
- `data_collator`: (Optional) Controls how batches are created. If using `DataCollatorForCompletionOnlyLM`, loss is computed only on the answer portion. If omitted, loss is computed on the full prompt+answer.
- `compute_metrics`: (Optional) A function to compute custom metrics during evaluation.
- `callbacks`: (Optional) List of callback functions for custom training/evaluation hooks.

**Tips:**
- Use `data_collator=collator` to focus loss on the answer only (recommended for extraction/instruction tasks).
- Monitor training and validation loss to check for overfitting or underfitting.
- Adjust `args` (batch size, learning rate, epochs, etc.) as needed for your hardware and dataset size.
- You can add custom callbacks or metrics for advanced monitoring or early stopping.

This setup ensures efficient, targeted fine-tuning of your LLM with LoRA on your MPESA SMS data.

In [None]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=train_text,
    eval_dataset=val_text,
    data_collator=lambda batch: assert_answer_tokens(collator(batch)),  # check answer tokens
)

trainer.train()

## **Save, Push to Hub, and (Optionally) Merge LoRA**

This step covers how to save your fine-tuned adapters, push them to the Hugging Face Hub, and (optionally) merge LoRA weights into a standalone model for easier deployment.

**Parameter explanations and workflow:**
- `trainer.model.push_to_hub(adapter_repo, private=True)`: Uploads your LoRA adapters to your Hugging Face repository. Set `private=False` if you want the repo to be public.
- `trainer.save_model("./adapters")`: Saves the LoRA adapters locally for backup or offline use.
- `merge_and_unload()`: Merges the LoRA adapters into the base model weights, producing a single model file. This is useful for exporting a standalone checkpoint for inference on other platforms or stacks.
- `base.load_state_dict(model.state_dict(), strict=False)`: Loads the fine-tuned LoRA weights into the base model before merging.
- `merged.save_pretrained("./mpesa-merged")`: Saves the merged, standalone model locally.

**When should you merge?**
- **Merge** if you want a single checkpoint for deployment or inference outside the PEFT/LoRA ecosystem (e.g., for ONNX export, or use in other frameworks).
- **Do not merge** if you want to keep the adapters lightweight and flexible for further fine-tuning or experimentation. Keeping adapters separate is more memory-efficient and allows for easy swapping or stacking of adapters.

**Tips:**
- Always save both the adapters and (if needed) the merged model for maximum flexibility.
- Use descriptive repo names and local paths to keep track of your experiments.
- If you plan to share your model, make sure to push to a public repo or set `private=False`.

> 💡 **Tip:** Make sure you have write access to the Hugging Face repo before pushing. If you get a permissions error, check your repo settings and your HF token scopes.
>
> 💡 **Tip:** When merging LoRA weights, ensure the base model and LoRA config match exactly (same model architecture, LoRA rank, and target modules). Mismatches can cause errors or degraded performance.


In [None]:
# Push adapters to your HF repo
adapter_repo = "your-username/mpesa-tinyllama-lora"
trainer.model.push_to_hub(adapter_repo, private=True)

# Save locally too
trainer.save_model("./adapters")

# (Optional) Merge LoRA into base weights for easier inference export
from peft import LoraConfig
from transformers import AutoModelForCausalLM

# Reload base model, apply adapters, then merge
base = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=dtype, device_map="auto")
base = get_peft_model(base, lora_cfg)
base.load_state_dict(model.state_dict(), strict=False)
merged = base.merge_and_unload()
merged.save_pretrained("./mpesa-merged")

## **Quick Sanity‑Check Inference**

This step runs a quick inference to verify that your fine-tuned model produces the expected JSON output for a sample MPESA SMS. This is a practical way to check if your model is working as intended before deploying or running a full evaluation.

**Parameter explanations:**
- `pipeline`: The Hugging Face pipeline for text generation. Handles tokenization, model inference, and decoding.
- `model`: The trained model to use for inference. Here, it's the LoRA‑wrapped model from training.
- `tokenizer`: The tokenizer used for your model. Ensures input is properly tokenized and output is decoded.
- `device_map`: Controls which device(s) to use for inference. "auto" lets Accelerate pick the best device (MPS/CPU/GPU).
- `torch_dtype`: Data type for inference. Use `float16` on MPS for speed and memory savings.
- `sample`: The formatted prompt for inference. Should match your training prompt structure.
- `max_new_tokens`: Maximum number of tokens to generate in the output. Increase if your expected JSON is long.
- `do_sample`: Whether to use sampling (randomness) in generation. `False` means deterministic output (recommended for extraction tasks).

**Tips:**
- Always use a prompt that matches your training format for best results.
- If the output is truncated, increase `max_new_tokens`.
- For more robust evaluation, try several real SMS examples and compare outputs to expected JSON.
- If you see hallucinated or incomplete outputs, consider further fine-tuning or prompt engineering.

This quick check gives you confidence that your model is extracting information as expected before moving to production or sharing results.

> ⚠️ **Sanity-Check Output:** The code below checks the output length and warns if the output is empty or likely truncated. Adjust `max_new_tokens` if needed.

In [None]:
from transformers import pipeline

gen = pipeline(
    "text-generation",
    model=trainer.model,  # uses LoRA‑wrapped model
    tokenizer=tokenizer,
    device_map="auto",
    torch_dtype=dtype,
)

sample = (
    "### Task: Extract transaction details from the SMS\n"
    "### Input:\nQFC3D45G7 Confirmed. You have received Ksh500 from Customer_1 XXXXXXX on 12/08/24 at 11:32 AM. New M-PESA balance is Ksh1,250.\n"
    "### Output:\n"
)

out = gen(sample, max_new_tokens=150, do_sample=False)
output_text = out[0]["generated_text"][len(sample):]
print(output_text)

if not output_text.strip():
    print("⚠️ Warning: Output is empty. Check your model, prompt, or try increasing max_new_tokens.")
elif len(output_text) >= 140:
    print("⚠️ Warning: Output may be truncated. Try increasing max_new_tokens for longer outputs.")
