> 🌟 **UPDATED BY [Muhammed Shah](https://muhammedshah.com) to support running this on Apple Silicon** | **ORIGINAL AUTHOR OF THIS NOTEBOOK:** [Adithya SK](https://adithyask.com)

## Instruct Fine-tuning [Gemma](https://blog.google/technology/developers/gemma-open-models/) using qLora and Supervise Finetuning (MPS - Apple Silicon)

This is a comprehensive notebook and tutorial on how to fine tune the `gemma-7b-it` Model on Apple Silicon using MPS (Metal Performance Shaders) - the Apple equivalent of NVIDIA CUDA.

### MPS vs CUDA: Key Differences for Apple Silicon

**MPS (Metal Performance Shaders)** is Apple's equivalent to NVIDIA's CUDA for GPU acceleration on Apple Silicon Macs. Here are the key differences when running this notebook:

#### Hardware Requirements:
- **CUDA Version**: Requires NVIDIA GPU (RTX 3090, A100, etc.)
- **MPS Version**: Requires Apple Silicon Mac (M1, M2, M3, M4) with macOS 12.3+

#### Memory Considerations:
- **CUDA**: Separate GPU VRAM (e.g., 24GB on RTX 3090)
- **MPS**: Unified memory shared between CPU and GPU (16GB, 32GB, 64GB+ recommended)

#### Quantization Support:
- **CUDA**: Full support for 4-bit and 8-bit quantization with BitsAndBytes
- **MPS**: Limited quantization support; may fall back to full precision or 8-bit

#### Performance Notes:
- **CUDA**: Generally faster for training large models
- **MPS**: Excellent performance with lower power consumption, especially for inference

This notebook has been adapted to automatically detect and use the appropriate backend for your hardware.

----------

## Prerequisites

Before delving into the fine-tuning process, ensure that you have the following prerequisites in place:

1. **Apple Silicon Mac**: This notebook is designed to run on Apple Silicon Macs (M1, M2, M3, M4) using MPS (Metal Performance Shaders) - Apple's equivalent to NVIDIA CUDA. [gemma-2b](https://huggingface.co/google/gemma-2b) can be fine-tuned on most Apple Silicon Macs with 16GB+ unified memory, while [gemma-7b](https://huggingface.co/google/gemma-7b) requires Macs with 32GB+ unified memory.
2. **Python Packages**: Ensure that you have the necessary Python packages installed. You can use the following commands to install them:

Let's begin by checking if MPS is available on your Apple Silicon Mac:

In [3]:
# Check if MPS (Metal Performance Shaders) is available - Apple Silicon equivalent of CUDA
import torch
import platform

print(f"PyTorch version: {torch.__version__}")
print(f"System: {platform.system()} {platform.machine()}")
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"MPS built: {torch.backends.mps.is_built()}")

if torch.backends.mps.is_available():
    print("✅ MPS is available! You can use Apple Silicon GPU acceleration.")
    device = torch.device("mps")
    print(f"Using device: {device}")
else:
    print("❌ MPS not available. Using CPU.")
    device = torch.device("cpu")

PyTorch version: 2.7.1
System: Darwin arm64
MPS available: True
MPS built: True
✅ MPS is available! You can use Apple Silicon GPU acceleration.
Using device: mps


## Step 2 - Model loading
We'll load the model using QLoRA quantization to reduce memory usage. On Apple Silicon MPS, we'll use the most compatible quantization method or fall back to full precision if needed.

In [4]:
# Install required packages - MPS (Apple Silicon) compatible versions
!pip3 install -q -U bitsandbytes==0.42.0  # May have limited MPS support
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.0

# For Apple Silicon, ensure you have the latest PyTorch with MPS support
# If you encounter issues, you may need to install PyTorch nightly:
# !pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

print("✅ Packages installed. Note: Some quantization features may have limited support on MPS (Apple Silicon)")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1

In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Check device availability
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")

# Note: 4-bit quantization with BitsAndBytes may have limited support on MPS
# For Apple Silicon, we'll use a more compatible configuration
if device == "mps":
    # For MPS (Apple Silicon), use 8-bit quantization or no quantization
    # 4-bit quantization support on MPS is experimental
    try:
        bnb_config = BitsAndBytesConfig(
            load_in_8bit=True,  # Use 8-bit instead of 4-bit for better MPS compatibility
        )
        print("Using 8-bit quantization for MPS compatibility")
    except:
        # Fallback: no quantization on MPS if BitsAndBytes doesn't support it
        bnb_config = None
        print("BitsAndBytes quantization not supported on MPS, using full precision")
else:
    # Original CUDA configuration for reference
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]

Using device: mps
Using 8-bit quantization for MPS compatibility





Now we specify the model ID and then we load it with our previously defined quantization configuration.Now we specify the model ID and then we load it with our previously defined quantization configuration.

In [4]:
# if you are using google colab

# import os
# from google.colab import userdata
# os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

In [None]:
# # Method 1: if you are using Jupyter Notebook or any other IDE
# !pip3 install ipywidgets

# from huggingface_hub import notebook_login
# notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [17]:
# Method 2 (If method 1 doesn't work):

# Run the below command if its the first time you are running this notebook
# !pip3 install python-dotenv

import os
from dotenv import load_dotenv
from huggingface_hub import login

# Load environment variables from .env file
load_dotenv()

# Get the token from environment variables
hf_token = os.getenv('HUGGINGFACE_HUB_TOKEN')

if hf_token:
    login(token=hf_token)
    print("Successfully logged in to Hugging Face!")
else:
    print("Warning: HUGGINGFACE_HUB_TOKEN not found in environment variables")

Successfully logged in to Hugging Face!


In [None]:
# model_id = "google/gemma-7b-it"
# # model_id = "google/gemma-7b"
# # model_id = "google/gemma-2b-it"
# # model_id = "google/gemma-2b"

# # Load model with MPS-compatible settings
# if device == "mps" and bnb_config is not None:
#     # Load with quantization if supported
#     model = AutoModelForCausalLM.from_pretrained(
#         model_id, 
#         quantization_config=bnb_config, 
#         device_map="auto",  # Let transformers handle device mapping
#         torch_dtype=torch.float16  # Use float16 for memory efficiency on Apple Silicon
#     )
# elif device == "mps":
#     # Load without quantization for MPS
#     model = AutoModelForCausalLM.from_pretrained(
#         model_id,
#         torch_dtype=torch.float16,  # Use float16 for memory efficiency
#         device_map="auto"
#     )
# else:
#     # Original CUDA configuration
#     model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

# tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

# print(f"Model loaded on device: {next(model.parameters()).device}")

In [20]:
import os
from dotenv import load_dotenv
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load environment variables and login
load_dotenv()
hf_token = os.getenv('HUGGINGFACE_HUB_TOKEN')

if hf_token:
    login(token=hf_token)
    print("✅ Logged in to Hugging Face")
else:
    raise ValueError("❌ HUGGINGFACE_HUB_TOKEN not found in .env file")

# Model configuration
# model_id = "google/gemma-7b-it"
model_id = "google/gemma-2-2b-it"

try:
    print(f"🔄 Loading model: {model_id}")
    print("This may take several minutes for first-time download...")
    
    # Load model with MPS-compatible settings
    if device == "mps" and bnb_config is not None:
        model = AutoModelForCausalLM.from_pretrained(
            model_id, 
            quantization_config=bnb_config, 
            device_map="auto",
            torch_dtype=torch.float16,
            token=hf_token  # Explicit token passing
        )
    elif device == "mps":
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            device_map="auto",
            token=hf_token  # Explicit token passing
        )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_id, 
            quantization_config=bnb_config, 
            device_map={"":0},
            token=hf_token  # Explicit token passing
        )
    
    tokenizer = AutoTokenizer.from_pretrained(
        model_id, 
        add_eos_token=True,
        token=hf_token  # Explicit token passing
    )
    
    print(f"✅ Model loaded on device: {next(model.parameters()).device}")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")
    print("\nTroubleshooting steps:")
    print("1. Make sure you accepted the Gemma license at: https://huggingface.co/google/gemma-7b-it")
    print("2. Check your internet connection")
    print("3. Verify your Hugging Face token has the right permissions")

✅ Logged in to Hugging Face
🔄 Loading model: google/gemma-2-2b-it
This may take several minutes for first-time download...
❌ Error loading model: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like google/gemma-2-2b-it is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Troubleshooting steps:
1. Make sure you accepted the Gemma license at: https://huggingface.co/google/gemma-7b-it
2. Check your internet connection
3. Verify your Hugging Face token has the right permissions


In [19]:
# To check cache location, where our models are downloaded

from huggingface_hub import snapshot_download
import os

# Show cache directory
cache_dir = os.path.expanduser("~/.cache/huggingface/hub/")
print(f"Models are cached in: {cache_dir}")

# Check available space (models can be several GB)
import shutil
free_space = shutil.disk_usage(cache_dir).free / (1024**3)  # GB
print(f"Available space: {free_space:.1f} GB")

Models are cached in: /Users/muzzmac/.cache/huggingface/hub/
Available space: 386.3 GB


In [None]:
def get_completion(query: str, model, tokenizer) -> str:
    # Use MPS if available, otherwise CPU - Apple Silicon equivalent of "cuda:0"
    device = "mps" if torch.backends.mps.is_available() else "cpu"
    
    prompt_template = """
    <start_of_turn>user
    Below is an instruction that describes a task. Write a response that appropriately completes the request.
    {query}
    <end_of_turn>\n<start_of_turn>model
    

    """
    prompt = prompt_template.format(query=query)

    encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

    model_inputs = encodeds.to(device)

    generated_ids = model.generate(
        **model_inputs, 
        max_new_tokens=1000, 
        do_sample=True, 
        pad_token_id=tokenizer.eos_token_id
    )
    decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    return decoded

In [None]:
result = get_completion(query="code the fibonacci series in python using reccursion", model=model, tokenizer=tokenizer)
print(result)

## Step 3 - Load dataset for finetuning

### Lets Load the Dataset

For this tutorial, we will fine-tune Mistral 7B Instruct for code generation.

We will be using this [dataset](https://huggingface.co/datasets/TokenBender/code_instructions_122k_alpaca_style) which is curated by [TokenBender (e/xperiments)](https://twitter.com/4evaBehindSOTA) and is an excellent data source for fine-tuning models for code generation. It follows the alpaca style of instructions, which is an excellent starting point for this task. The dataset structure should resemble the following:

```json
{
  "instruction": "Create a function to calculate the sum of a sequence of integers.",
  "input": "[1, 2, 3, 4, 5]",
  "output": "# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum"
}
```

In [None]:
from datasets import load_dataset

dataset = load_dataset("TokenBender/code_instructions_122k_alpaca_style", split="train")
dataset

In [None]:
df = dataset.to_pandas()
df.head(10)

Instruction Fintuning - Prepare the dataset under the format of "prompt" so the model can better understand :
1. the function generate_prompt : take the instruction and output and generate a prompt
2. shuffle the dataset
3. tokenizer the dataset

### Formatting the Dataset

Now, let's format the dataset in the required [gemma instruction formate](https://huggingface.co/google/gemma-7b-it).

> Many tutorials and blogs skip over this part, but I feel this is a really important step.

```
<start_of_turn>user What is your favorite condiment? <end_of_turn>
<start_of_turn>model Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!<end_of_turn>
```

You can use the following code to process your dataset and create a JSONL file in the correct format:

In [16]:
def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """
    prefix_text = 'Below is an instruction that describes a task. Write a response that ' \
               'appropriately completes the request.\n\n'
    # Samples with additional context into.
    if data_point['input']:
        text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} here are the inputs {data_point["input"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
    # Without
    else:
        text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)

We'll need to tokenize our data so the model can understand.


In [None]:
dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

Split dataset into 90% for training and 10% for testing

In [18]:
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]

### After Formatting, We should get something like this

```json
{
"text":"<start_of_turn>user Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] <end_of_turn>
<start_of_turn>model # Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum <end_of_turn>",
"instruction":"Create a function to calculate the sum of a sequence of integers",
"input":"[1, 2, 3, 4, 5]",
"output":"# Python code def sum_sequence(sequence): sum = 0 for num in,
 sequence: sum += num return sum",
"prompt":"<start_of_turn>user Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] <end_of_turn>
<start_of_turn>model # Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum <end_of_turn>"

}
```

While using SFT (**[Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer)**) for fine-tuning, we will be only passing in the “text” column of the dataset for fine-tuning.

In [None]:
print(test_data)

## Step 4 - Apply Lora  
Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function and  the prepare_model_for_kbit_training method from PEFT.

In [None]:
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model

# Enable gradient checkpointing for memory efficiency on Apple Silicon
model.gradient_checkpointing_enable()

# Prepare model for training - works on both MPS and CUDA
if bnb_config is not None:
    # If using quantization
    model = prepare_model_for_kbit_training(model)
    print("Model prepared for quantized training (Apple Silicon MPS compatible)")
else:
    # If not using quantization, ensure model is in training mode
    model.train()
    print("Model prepared for full-precision training on MPS")

In [None]:
print(model)

In [None]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    # For MPS (Apple Silicon), we need to handle different layer types
    if device == "mps" and bnb_config is not None:
        # If using 8-bit quantization on MPS
        cls = bnb.nn.Linear8bitLt
    elif bnb_config is not None:
        # Original CUDA 4-bit quantization
        cls = bnb.nn.Linear4bit
    else:
        # No quantization - use standard PyTorch Linear layers (MPS compatible)
        cls = torch.nn.Linear
    
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
        if 'lm_head' in lora_module_names: # needed for 16-bit
            lora_module_names.remove('lm_head')
    
    print(f"Targeting layer type: {cls} (MPS compatible: {device == 'mps'})")
    return list(lora_module_names)

In [None]:
modules = find_all_linear_names(model)
print(modules)

In [26]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

In [None]:
trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")

## Step 5 - Run the training!

Setting the training arguments:
* for the reason of demo, we just ran it for few steps (100) just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
# import transformers

# tokenizer.pad_token = tokenizer.eos_token


# trainer = transformers.Trainer(
#     model=model,
#     train_dataset=train_data,
#     eval_dataset=test_data,
#     args=transformers.TrainingArguments(
#         per_device_train_batch_size=1,
#         gradient_accumulation_steps=4,
#         warmup_steps=0.03,
#         max_steps=100,
#         learning_rate=2e-4,
#         fp16=True,
#         logging_steps=1,
#         output_dir="outputs_mistral_b_finance_finetuned_test",
#         optim="paged_adamw_8bit",
#         save_strategy="epoch",
#     ),
#     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
# )


### Fine-Tuning with qLora and Supervised Fine-Tuning

We're ready to fine-tune our model using qLora. For this tutorial, we'll use the `SFTTrainer` from the `trl` library for supervised fine-tuning. Ensure that you've installed the `trl` library as mentioned in the prerequisites.

In [None]:
# Fine-tuning with qLora and Supervised Fine-Tuning on Apple Silicon MPS
import transformers
from trl import SFTTrainer

tokenizer.pad_token = tokenizer.eos_token

# Clear cache - MPS equivalent of torch.cuda.empty_cache()
if device == "mps":
    torch.mps.empty_cache()
    print("MPS cache cleared (Apple Silicon equivalent of CUDA cache clear)")
else:
    torch.cuda.empty_cache()

# Configure training arguments for MPS (Apple Silicon)
training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1,  # Keep small for memory efficiency on Apple Silicon
    gradient_accumulation_steps=4,
    warmup_steps=0.03,
    max_steps=100,
    learning_rate=2e-4,
    # Note: fp16 may not be fully supported on MPS, use bf16 if available
    fp16=False if device == "mps" else True,
    bf16=True if device == "mps" and torch.backends.mps.is_available() else False,
    logging_steps=1,
    output_dir="outputs",
    # Use standard AdamW for MPS compatibility instead of paged_adamw_8bit
    optim="adamw_torch" if device == "mps" else "paged_adamw_8bit",
    save_strategy="epoch",
    dataloader_pin_memory=False,  # Disable pin memory for MPS
    dataloader_num_workers=0,     # Use single worker for MPS stability
)

print(f"Training configured for device: {device}")
print(f"Using optimizer: {training_args.optim}")
print(f"Using precision: {'bf16' if training_args.bf16 else 'fp16' if training_args.fp16 else 'fp32'}")

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field="prompt",
    peft_config=lora_config,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

## Lets start training

In [None]:
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

 Share adapters on the 🤗 Hub

In [30]:
new_model = "gemma-Code-Instruct-Finetune-test" #Name of the model you will be pushing to huggingface model hub

In [31]:
trainer.model.save_pretrained(new_model)

In [None]:
# Load base model for merging - MPS compatible
if device == "mps":
    base_model = AutoModelForCausalLM.from_pretrained(
        model_id,
        low_cpu_mem_usage=True,
        return_dict=True,
        torch_dtype=torch.float16,  # Use float16 for Apple Silicon
        device_map="auto",  # Let transformers handle MPS device mapping
    )
else:
    # Original CUDA configuration
    base_model = AutoModelForCausalLM.from_pretrained(
        model_id,
        low_cpu_mem_usage=True,
        return_dict=True,
        torch_dtype=torch.float16,
        device_map={"": 0},
    )

# Merge the LoRA adapter with base model (works on both MPS and CUDA)
merged_model = PeftModel.from_pretrained(base_model, new_model)
merged_model = merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model", safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"Model merged and saved successfully on {device} (Apple Silicon MPS compatible)")

In [None]:
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

## Test out Finetuned Model

In [None]:
result = get_completion(query="code the fibonacci series in python using reccursion", model=merged_model, tokenizer=tokenizer)
print(result)

--------

## MPS Troubleshooting for Apple Silicon

If you encounter issues while running on Apple Silicon, try these solutions:

### Common MPS Issues and Solutions:

1. **Memory Issues**: 
   - Reduce `per_device_train_batch_size` to 1
   - Increase `gradient_accumulation_steps` to maintain effective batch size
   - Use smaller model variant (gemma-2b instead of gemma-7b)

2. **Quantization Errors**:
   - If BitsAndBytes fails, the notebook will automatically fall back to full precision
   - You can manually disable quantization by setting `bnb_config = None`

3. **Training Instability**:
   - MPS sometimes works better with `bf16=True` instead of `fp16=True`
   - Try reducing learning rate to `1e-4` if training is unstable

4. **Performance Optimization**:
   - Ensure you're using the latest PyTorch version with MPS support
   - Close other memory-intensive applications during training
   - Use `torch.mps.empty_cache()` to clear MPS memory between operations

### Memory Usage Guidelines:
- **16GB Unified Memory**: Use gemma-2b model
- **32GB+ Unified Memory**: Can handle gemma-7b model
- **64GB+ Unified Memory**: Comfortable training with larger batch sizes