# Chapter 7: Fine-Tuning & Adaptation - LoRA Demo

## Making Models Your Own with Parameter-Efficient Fine-Tuning

**Pocket Agents: A Practical Guide to On‑Device Artificial Intelligence**

---

## What You'll Learn

- **LoRA (Low-Rank Adaptation)** for memory-efficient fine-tuning
- **Parameter efficiency** - training <1% of model parameters
- **Domain adaptation** - teaching models specific knowledge
- **Before/after comparison** - seeing real improvement

## Why This Works

We're using **TinyLlama-1.1B-Chat**, which is:
- Already instruction-tuned (understands user/assistant format)
- Small enough for quick training (1.1B parameters)
- Works well on consumer hardware (MPS/CPU)
- Shows clear improvement with minimal training

**Expected runtime**: 3-5 minutes total


## Step 1: Environment Setup


In [1]:
# Environment variables
import os
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['WANDB_DISABLED'] = 'true'
os.environ['HF_HUB_DISABLE_PROGRESS_BARS'] = '1'

print("✅ Environment variables set")
print(f"   TOKENIZERS_PARALLELISM: {os.environ.get('TOKENIZERS_PARALLELISM')}")
print(f"   WANDB_DISABLED: {os.environ.get('WANDB_DISABLED')}")
print(f"   HF_HUB_DISABLE_PROGRESS_BARS: {os.environ.get('HF_HUB_DISABLE_PROGRESS_BARS')}")

# Import required libraries
print("\n🔧 Importing libraries...")
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset
import torch

print("✅ All imports successful!")
print(f"   PyTorch: {torch.__version__}")


✅ Environment variables set
   TOKENIZERS_PARALLELISM: false
   WANDB_DISABLED: true
   HF_HUB_DISABLE_PROGRESS_BARS: 1

🔧 Importing libraries...
✅ All imports successful!
   PyTorch: 2.9.0


## Step 2: Load TinyLlama-1.1B-Chat (Base Model)


In [2]:
print("Loading TinyLlama-1.1B-Chat...")
print("   This may take a few minutes on first run (downloading ~2GB model)")

model_name = 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set pad token if missing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("Pad token set")

# Move to device (MPS/CPU detection)
if torch.backends.mps.is_available():
    device = torch.device('mps')
    print("Using MPS (Apple Silicon GPU)")
else:
    device = torch.device('cpu')
    print("Using CPU")

model.to(device)
print(f"Model loaded and moved to: {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")


Loading TinyLlama-1.1B-Chat...
   This may take a few minutes on first run (downloading ~2GB model)
Using MPS (Apple Silicon GPU)
Model loaded and moved to: mps
Model parameters: 1,100,048,384


## Step 3: Test Base Model (Before Training)

Let's see how the base model responds to our test prompts **before** fine-tuning.


In [3]:
def generate_response(model, tokenizer, prompt, max_length=100):
    """Generate a response from the model"""
    inputs = tokenizer.encode(prompt, return_tensors='pt').to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=inputs.shape[1] + max_length,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            attention_mask=torch.ones_like(inputs)
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the response part (after the prompt)
    return response[len(prompt):].strip()

# Test prompts using TinyLlama's chat format
test_prompts = [
    '<|user|>\nWhat is machine learning?<|assistant|>\n',
    '<|user|>\nExplain Python in simple terms<|assistant|>\n',
    '<|user|>\nWhat are the benefits of renewable energy?<|assistant|>\n'
]

print("Testing BASE MODEL (before training):")
print("=" * 70)

base_responses = []
for i, prompt in enumerate(test_prompts, 1):
    response = generate_response(model, tokenizer, prompt)
    base_responses.append(response)
    print(f"\nTest {i}:")
    print(f"Prompt: {prompt.replace('<|user|>', '').replace('<|assistant|>', '').strip()}")
    print(f"Base Response: {response}")
    print("-" * 70)

print("\nBase model testing complete!")
print("Responses saved for later comparison")


Testing BASE MODEL (before training):

Test 1:
Prompt: What is machine learning?
Base Response: Machine learning is a branch of Artificial Intelligence (AI) that involves the use of algorithms and computational models to learn from data without being explicitly programmed. It has revolutionized many aspects of human society, from web search to self-driving cars. Here are some key concepts of machine learning:

1. Learning:machine learning involves the process of identifying patterns in data and using these patterns to make predictions or decisions.

2. Algorithms:machine learning
----------------------------------------------------------------------

Test 2:
Prompt: Explain Python in simple terms
Base Response: Python is a general-purpose programming language that is easy to learn and understand. It's a lightweight and efficient language that's commonly used for web development, machine learning, data analysis, and more. Here are some Python in simple terms:

1. High-Level: Python is a

## Step 4: Apply LoRA Configuration

Now we'll add LoRA adapters to the model. LoRA only trains a small fraction of parameters (~0.5%) while keeping the base model frozen.


In [4]:
print("Creating LoRA configuration...")

lora_config = LoraConfig(
    r=16,  # Rank of the low-rank matrices
    lora_alpha=32,  # Scaling factor (typically 2x the rank)
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],  # TinyLlama attention modules
    lora_dropout=0.1,  # Regularization to prevent overfitting
    bias='none',
    task_type=TaskType.CAUSAL_LM,
)

print("LoRA configuration created:")
print(f"   Rank: {lora_config.r}")
print(f"   Alpha: {lora_config.lora_alpha}")
print(f"   Target modules: {lora_config.target_modules}")
print(f"   Dropout: {lora_config.lora_dropout}")

print("\nApplying LoRA to model...")
lora_model = get_peft_model(model, lora_config)

print("\nLoRA applied successfully!")
print("\nTrainable Parameters:")
lora_model.print_trainable_parameters()

print("\nNotice: We're only training ~0.5% of the model's parameters!")
print("   This is the power of LoRA - efficient fine-tuning with minimal resources.")


Creating LoRA configuration...
LoRA configuration created:
   Rank: 16
   Alpha: 32
   Target modules: {'v_proj', 'o_proj', 'q_proj', 'k_proj'}
   Dropout: 0.1

Applying LoRA to model...
'NoneType' object has no attribute 'cadam32bit_grad_fp32'

LoRA applied successfully!

Trainable Parameters:
trainable params: 4,505,600 || all params: 1,104,553,984 || trainable%: 0.4079

Notice: We're only training ~0.5% of the model's parameters!
   This is the power of LoRA - efficient fine-tuning with minimal resources.


  warn("The installed version of bitsandbytes was compiled without GPU support. "


## Step 5: Prepare Training Data

We'll create a small, high-quality dataset to teach the model specific knowledge.


In [5]:
print("Preparing training data...")

# High-quality instruction-response pairs
training_data = [
    {
        'instruction': 'What is machine learning?',
        'response': 'Machine learning is a field of artificial intelligence where computers learn from data to make predictions or decisions without being explicitly programmed for each task. It uses algorithms that improve automatically through experience.'
    },
    {
        'instruction': 'Explain Python in simple terms',
        'response': 'Python is a beginner-friendly programming language known for its simple, readable syntax. It is widely used for web development, data analysis, artificial intelligence, and automation. Python code looks almost like English, making it easy to learn and understand.'
    },
    {
        'instruction': 'What are the benefits of renewable energy?',
        'response': 'Renewable energy offers several key benefits: environmental sustainability by reducing greenhouse gas emissions, energy independence by reducing reliance on fossil fuels, economic benefits through job creation and lower long-term costs, and resource conservation by using naturally replenishing sources like solar and wind.'
    },
    {
        'instruction': 'How does photosynthesis work?',
        'response': 'Photosynthesis is the process by which plants convert sunlight, water, and carbon dioxide into glucose (sugar) and oxygen. Chlorophyll in plant leaves captures light energy, which drives chemical reactions that produce food for the plant while releasing oxygen as a byproduct.'
    },
    {
        'instruction': 'What is the water cycle?',
        'response': 'The water cycle is the continuous movement of water on, above, and below Earth\'s surface. It includes evaporation from bodies of water, condensation into clouds, precipitation as rain or snow, and collection back into oceans, lakes, and rivers. This cycle is powered by the sun\'s energy.'
    },
    {
        'instruction': 'Explain what a neural network is',
        'response': 'A neural network is a computing system inspired by biological brains. It consists of interconnected nodes (neurons) organized in layers that process information. Each connection has a weight that adjusts during training, allowing the network to learn patterns from data and make predictions.'
    },
    {
        'instruction': 'What is climate change?',
        'response': 'Climate change refers to long-term shifts in global temperatures and weather patterns. While climate naturally varies, current changes are primarily driven by human activities, especially burning fossil fuels, which releases greenhouse gases that trap heat in the atmosphere and warm the planet.'
    },
    {
        'instruction': 'How do vaccines work?',
        'response': 'Vaccines work by training your immune system to recognize and fight specific diseases. They contain weakened or inactive parts of a particular pathogen that trigger an immune response. Your body produces antibodies and remembers the pathogen, providing protection if you encounter the real disease later.'
    }
]

print(f"Created {len(training_data)} high-quality training examples")
print("   Topics: ML, Programming, Science, Environment, Health")

# Format using TinyLlama's chat template
def format_instruction(example):
    return f"<|user|>\n{example['instruction']}<|assistant|>\n{example['response']}"

formatted_data = [{'text': format_instruction(example)} for example in training_data]
train_dataset = Dataset.from_list(formatted_data)

print("\nTokenizing training data...")
def tokenize_function(examples):
    tokenized = tokenizer(
        examples['text'],
        truncation=True,
        padding=True,
        max_length=256,
        return_tensors=None
    )
    return tokenized

tokenized_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(['text'])

print("Training data prepared and tokenized")
print(f"   Dataset size: {len(tokenized_dataset)} examples")


Preparing training data...
Created 8 high-quality training examples
   Topics: ML, Programming, Science, Environment, Health

Tokenizing training data...


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Training data prepared and tokenized
   Dataset size: 8 examples


## Step 6: Training Setup & Execution

Now we'll train the LoRA adapters. This should take about 2-3 minutes.


In [6]:
print("Setting up training configuration...")

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8,
    return_tensors='pt'
)

# Training arguments
training_args = TrainingArguments(
    output_dir='./tinyllama_lora',
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    learning_rate=2e-4,
    warmup_steps=5,
    logging_steps=1,
    save_steps=20,
    eval_strategy='no',
    save_total_limit=1,
    remove_unused_columns=False,
    push_to_hub=False,
    report_to=None,
    dataloader_pin_memory=False,
    fp16=False,  # MPS compatibility
    bf16=False,
    max_steps=15,  # Quick demo - just 15 steps
)

print("Training configuration created")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Max steps: {training_args.max_steps}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")

# Complete callback class
training_losses = []

class TrainingCallback:
    """Complete callback with all required methods"""
    def on_train_begin(self, args, state, control, **kwargs):
        print("\nTraining started!")
        return control
    
    def on_epoch_begin(self, args, state, control, **kwargs):
        print(f"Starting epoch {state.epoch}")
        return control
    
    def on_epoch_end(self, args, state, control, **kwargs):
        print(f"Completed epoch {state.epoch}")
        return control
    
    def on_step_begin(self, args, state, control, **kwargs):
        return control
    
    def on_step_end(self, args, state, control, **kwargs):
        return control
    
    def on_substep_begin(self, args, state, control, **kwargs):
        return control
    
    def on_substep_end(self, args, state, control, **kwargs):
        return control
    
    def on_pre_optimizer_step(self, args, state, control, **kwargs):
        return control
    
    def on_optimizer_step(self, args, state, control, **kwargs):
        return control
    
    def on_save(self, args, state, control, **kwargs):
        print("Model checkpoint saved!")
        return control
    
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and 'loss' in logs:
            training_losses.append(logs['loss'])
            print(f"Step {state.global_step}: Loss = {logs['loss']:.4f}")
        return control
    
    def on_train_end(self, args, state, control, **kwargs):
        print("\nTraining completed!")
        return control

# Create trainer
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)
trainer.add_callback(TrainingCallback())

print("\nTrainer created successfully")
print("\nStarting LoRA training...")
print("   This will take about 2-3 minutes")
print("   Watch the loss decrease as the model learns!\n")

# Train!
try:
    trainer.train()
    print("\nTraining completed successfully!")
    
    # Save the trained LoRA adapters
    lora_model.save_pretrained('./tinyllama_trained_lora')
    print("Trained LoRA adapters saved to ./tinyllama_trained_lora")
    
    # Show training progress
    if training_losses:
        print(f"\nTraining loss progression:")
        print(f"   Initial loss: {training_losses[0]:.4f}")
        print(f"   Final loss: {training_losses[-1]:.4f}")
        print(f"   Improvement: {((training_losses[0] - training_losses[-1]) / training_losses[0] * 100):.1f}%")
    
except Exception as e:
    print(f"\nTraining failed: {e}")
    import traceback
    traceback.print_exc()


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Setting up training configuration...
Training configuration created
   Learning rate: 0.0002
   Max steps: 15
   Batch size: 1

Trainer created successfully

Starting LoRA training...
   This will take about 2-3 minutes
   Watch the loss decrease as the model learns!


Training started!
Starting epoch 0


Step,Training Loss
1,1.5504
2,1.2507
3,1.5777
4,1.4585
5,1.4448
6,1.5585
7,1.1621
8,1.5073
9,1.0356
10,1.3757


Step 1: Loss = 1.5504
Step 2: Loss = 1.2507
Step 3: Loss = 1.5777
Step 4: Loss = 1.4585
Step 5: Loss = 1.4448
Step 6: Loss = 1.5585
Step 7: Loss = 1.1621
Step 8: Loss = 1.5073
Completed epoch 1.0
Starting epoch 1.0
Step 9: Loss = 1.0356
Step 10: Loss = 1.3757
Step 11: Loss = 1.0590
Step 12: Loss = 1.2490
Step 13: Loss = 0.7885
Step 14: Loss = 1.0772
Step 15: Loss = 1.0721
Model checkpoint saved!
Completed epoch 1.875

Training completed!

Training completed successfully!
Trained LoRA adapters saved to ./tinyllama_trained_lora

Training loss progression:
   Initial loss: 1.5504
   Final loss: 1.0721
   Improvement: 30.9%


## Step 7: Test Trained Model (After Training)

Now let's test the trained model with the same prompts to see the improvement!


In [7]:
print("Testing TRAINED MODEL (after LoRA fine-tuning):")
print("=" * 70)

lora_responses = []
for i, prompt in enumerate(test_prompts, 1):
    response = generate_response(lora_model, tokenizer, prompt)
    lora_responses.append(response)
    print(f"\nTest {i}:")
    print(f"Prompt: {prompt.replace('<|user|>', '').replace('<|assistant|>', '').strip()}")
    print(f"LoRA Response: {response}")
    print("-" * 70)

print("\nTrained model testing complete!")
print("Responses saved for comparison")


Testing TRAINED MODEL (after LoRA fine-tuning):

Test 1:
Prompt: What is machine learning?
LoRA Response: Machine learning is a field of computer science that involves learning from data without explicit human intervention. It involves the use of algorithms to learn from data that has previously been collected. Machine learning algorithms can be used for tasks such as recommendation systems, forecasting, and natural language processing.
----------------------------------------------------------------------

Test 2:
Prompt: Explain Python in simple terms
LoRA Response: Python is a high-level, interpreted, object-oriented programming language. It is widely used in web development, data science, and artificial intelligence applications. Python's syntax is simple and easy to learn, making it a popular choice for beginners. It has a large community of developers who contribute to the Python programming language. Python has a wide range of libraries and tools that developers can use to creat

## Step 8: Qualitative Analysis - Compare Results

Let's compare the base model vs the LoRA-finetuned model side-by-side.


In [8]:
print("COMPARISON: Base Model vs LoRA-Finetuned Model")
print("=" * 80)

for i, prompt in enumerate(test_prompts):
    clean_prompt = prompt.replace('<|user|>', '').replace('<|assistant|>', '').strip()
    print(f"\n{'='*80}")
    print(f"Prompt {i+1}: {clean_prompt}")
    print(f"{'='*80}")
    print(f"\nBase Model Response:")
    print(f"   {base_responses[i]}")
    print(f"\nLoRA-Trained Response:")
    print(f"   {lora_responses[i]}")
    print()

print("\n" + "=" * 80)
print("\nKey Observations:")
print("   - The LoRA model should provide more focused, informative responses")
print("   - Responses should better match the training data style")
print("   - We achieved this by training only ~0.5% of the model's parameters!")
print("\nTraining Efficiency:")
print(f"   - Training time: ~2-3 minutes")
print(f"   - Parameters trained: ~0.5% of total")
print(f"   - Memory efficient: No need to train full model")
print(f"   - Adapter size: Only a few MB (vs full model's 2GB)")


COMPARISON: Base Model vs LoRA-Finetuned Model

Prompt 1: What is machine learning?

Base Model Response:
   Machine learning is a branch of Artificial Intelligence (AI) that involves the use of algorithms and computational models to learn from data without being explicitly programmed. It has revolutionized many aspects of human society, from web search to self-driving cars. Here are some key concepts of machine learning:

1. Learning:machine learning involves the process of identifying patterns in data and using these patterns to make predictions or decisions.

2. Algorithms:machine learning

LoRA-Trained Response:
   Machine learning is a field of computer science that involves learning from data without explicit human intervention. It involves the use of algorithms to learn from data that has previously been collected. Machine learning algorithms can be used for tasks such as recommendation systems, forecasting, and natural language processing.


Prompt 2: Explain Python in simple t

## Step 9: Key Takeaways

### Why TinyLlama Works Better Than GPT-2

1. **Instruction-Tuned**: TinyLlama-1.1B-Chat is pre-trained to understand user/assistant format
2. **Modern Architecture**: Based on LLaMA, with improvements over GPT-2
3. **Chat-Optimized**: Specifically designed for conversational tasks
4. **Better Baseline**: Starts with instruction-following capability

### How LoRA Enables Efficient Fine-Tuning

1. **Parameter Efficiency**: Only trains 0.5% of model parameters
2. **Low-Rank Adaptation**: Adds trainable matrices to attention layers
3. **Memory Efficient**: Base model stays frozen, only adapters updated
4. **Quick Training**: Fewer parameters = faster convergence
5. **Small Adapters**: Can store multiple task-specific adapters cheaply

### When to Use LoRA vs Full Fine-Tuning

**Use LoRA When**:
- Limited compute resources
- Quick adaptation needed
- Multiple task-specific models required
- Base model is already good

**Use Full Fine-Tuning When**:
- Complete model behavior change needed
- Abundant compute resources
- Domain is very different from pre-training
- Maximum performance is critical

### Limitations and Next Steps

**Current Limitations**:
- Small training set (8 examples)
- Quick demo (15 steps)
- Limited evaluation metrics

**To Improve Further**:
1. Use more training data (100+ examples)
2. Train for more steps (50-100+)
3. Add evaluation metrics (BLEU, ROUGE)
4. Experiment with different LoRA ranks
5. Try QLoRA for even more efficiency

### Conclusion

LoRA is a powerful technique for efficient fine-tuning. By training only a small fraction of parameters, we can adapt pre-trained models to specific tasks quickly and cheaply. This makes on-device fine-tuning practical and accessible!
