# RapidFire AI with TensorBoard in Google Colab

This tutorial demonstrates how to use RapidFire AI with TensorBoard for real-time metrics visualization in Google Colab.

## Why TensorBoard in Colab?

- **Real-time visualization**: View training metrics as they happen
- **No frontend loading delay**: TensorBoard loads instantly in Colab
- **Native Colab support**: TensorBoard works natively with `%tensorboard` magic
- **Live updates**: Metrics update every 30 seconds while training cell is blocked

## Setup

First, let's install RapidFire AI and load the TensorBoard extension:

In [None]:
# Install RapidFire AI
!pip install rapidfireai

# Load TensorBoard extension
%load_ext tensorboard

## Configure RapidFire to Use TensorBoard

We'll set environment variables to tell RapidFire to use TensorBoard instead of MLflow:

In [None]:
import os

# Configure RapidFire to use TensorBoard
os.environ['RF_TRACKING_BACKEND'] = 'tensorboard'  # Options: 'mlflow', 'tensorboard', 'both'
# TensorBoard log directory will be auto-created in experiment path

## Start RapidFire Services in Colab Mode

**IMPORTANT**: RapidFire requires the dispatcher service to manage experiment state. Open the Colab terminal (Tools > Command palette > Terminal) and run:

```bash
export RF_TRACKING_BACKEND=tensorboard
rapidfireai start --colab
```

The `--colab` flag will:
- ✅ Start the dispatcher service (required for experiment state management)
- ⊗ Skip the frontend server (using TensorBoard instead)
- ⊗ Skip MLflow when using TensorBoard-only tracking (conditional)

You should see output like:
```
📦 RapidFire AI Initializing...
✅ [1/1] Dispatcher server started
🚀 RapidFire running in Colab mode!
📊 Use TensorBoard for metrics visualization:
   %tensorboard --logdir ~/experiments/{experiment_name}/tensorboard_logs
```

**Note**: If you want to use both TensorBoard and MLflow, set `RF_TRACKING_BACKEND=both` and the MLflow service will also start.

Leave this terminal running while you work in your notebook!

## Import RapidFire Components

In [None]:
from rapidfireai import Experiment
from rapidfireai.automl import List, RFGridSearch, RFModelConfig, RFLoraConfig, RFSFTConfig

## Load Dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset")

# REDUCED dataset for memory constraints in Colab
train_dataset = dataset["train"].select(range(64))  # Reduced from 128
eval_dataset = dataset["train"].select(range(50, 60))  # 10 examples
train_dataset = train_dataset.shuffle(seed=42)
eval_dataset = eval_dataset.shuffle(seed=42)

## Define Data Processing Function

We'll format the data as Q&A pairs for GPT-2:

In [None]:
def sample_formatting_function(example):
    """Format the dataset for GPT-2 while preserving original fields"""
    return {
        "text": f"Question: {example['instruction']}\nAnswer: {example['response']}",
        "instruction": example['instruction'],  # Keep original
        "response": example['response']  # Keep original
    }

# Apply formatting to datasets
eval_dataset = eval_dataset.map(sample_formatting_function)
train_dataset = train_dataset.map(sample_formatting_function)

## Define Metrics Function

We'll use a lightweight metrics computation with just ROUGE-L to save memory:

In [None]:
def sample_compute_metrics(eval_preds):
    """Lightweight metrics computation"""
    predictions, labels = eval_preds

    try:
        import evaluate

        # Only compute ROUGE-L (skip BLEU to save memory)
        rouge = evaluate.load("rouge")
        rouge_output = rouge.compute(
            predictions=predictions,
            references=labels,
            use_stemmer=True,
            rouge_types=["rougeL"]  # Only compute rougeL
        )

        return {
            "rougeL": round(rouge_output["rougeL"], 4),
        }
    except Exception as e:
        # Fallback if metrics fail
        print(f"Metrics computation failed: {e}")
        return {}

## Initialize Experiment

In [None]:
# Create experiment with unique name
experiment = Experiment(experiment_name="tensorboard-demo")

## Get TensorBoard Log Directory

The TensorBoard logs are stored in the experiment directory. Let's get the path:

In [None]:
# Get experiment path
from rapidfireai.utils.datapaths import DataPath
from rapidfireai.db.rf_db import RfDb

db = RfDb()
experiment_path = db.get_experiments_path("tensorboard-demo")
tensorboard_log_dir = f"{experiment_path}/tensorboard_logs"

print(f"TensorBoard logs will be saved to: {tensorboard_log_dir}")

## Start TensorBoard

**IMPORTANT**: Start TensorBoard BEFORE running training, so you can watch metrics update in real-time!

## Define Model Configuration

We'll use GPT-2 (124M parameters) which is 10x smaller than TinyLlama and perfect for Colab's memory constraints:

In [ ]:
# GPT-2 specific LoRA configs - different module names!
peft_configs_lite = List([
    RFLoraConfig(
        r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        target_modules=["c_attn"],  # GPT-2 combines Q,K,V in c_attn
        bias="none"
    ),
    RFLoraConfig(
        r=32,
        lora_alpha=64,
        lora_dropout=0.1,
        target_modules=["c_attn", "c_proj"],  # c_attn (QKV) + c_proj (output)
        bias="none"
    )
])

# 2 configs with GPT-2 (124M params - 10x smaller than TinyLlama!)
config_set_lite = List([
    RFModelConfig(
        model_name="gpt2",  # Only 124M params
        peft_config=peft_configs_lite,
        training_args=RFSFTConfig(
            learning_rate=5e-4,  # Lower than TinyLlama since GPT-2 is more sensitive
            lr_scheduler_type="linear",
            per_device_train_batch_size=2,  # Reduced for memory
            per_device_eval_batch_size=2,
            max_steps=128,
            gradient_accumulation_steps=2,  # Effective batch size = 4
            logging_steps=2,
            eval_strategy="steps",
            eval_steps=4,
            fp16=True,
            gradient_checkpointing=True,  # Save memory
            report_to="none",  # Disables wandb
        ),
        model_type="causal_lm",
        model_kwargs={
            "device_map": "auto",
            "torch_dtype": "float16",  # Explicit fp16
            "use_cache": False
        },
        formatting_func=sample_formatting_function,
        compute_metrics=sample_compute_metrics,
        generation_config={
            "max_new_tokens": 128,  # Reduced from 256
            "temperature": 0.7,     # Lower temp for GPT-2
            "top_p": 0.9,
            "top_k": 40,           # GPT-2 works well with slightly higher k
            "repetition_penalty": 1.1,
            "pad_token_id": 50256,  # GPT-2's EOS token
        }
    ),
    RFModelConfig(
        model_name="gpt2",
        peft_config=peft_configs_lite,
        training_args=RFSFTConfig(
            learning_rate=2e-4,  # Even more conservative
            lr_scheduler_type="cosine",  # Try cosine schedule
            per_device_train_batch_size=2,
            per_device_eval_batch_size=2,
            max_steps=128,
            gradient_accumulation_steps=2,
            logging_steps=2,
            eval_strategy="steps",
            eval_steps=4,
            fp16=True,
            gradient_checkpointing=True,
            report_to="none",  # Disables wandb
            warmup_steps=10,  # Add warmup for stability
        ),
        model_type="causal_lm",
        model_kwargs={
            "device_map": "auto",
            "torch_dtype": "float16",
            "use_cache": False
        },
        formatting_func=sample_formatting_function,
        compute_metrics=sample_compute_metrics,
        generation_config={
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_p": 0.9,
            "top_k": 40,
            "repetition_penalty": 1.1,
            "pad_token_id": 50256,
        }
    )
])

In [None]:
# Define LoRA configs
peft_configs = List([
    RFLoraConfig(
        r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        target_modules=["q_proj", "v_proj"],
        bias="none"
    ),
    RFLoraConfig(
        r=32,
        lora_alpha=64,
        lora_dropout=0.1,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        bias="none"
    )
])

# Define model configs
config_set = List([
    RFModelConfig(
        model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        peft_config=peft_configs,
        training_args=RFSFTConfig(
            learning_rate=1e-3,
            lr_scheduler_type="linear",
            per_device_train_batch_size=4,
            per_device_eval_batch_size=4,
            max_steps=64,  # Short training for demo
            gradient_accumulation_steps=1,
            logging_steps=2,  # Frequent logging for TensorBoard
            eval_strategy="steps",
            eval_steps=8,
            fp16=True,
        ),
        model_type="causal_lm",
        model_kwargs={"device_map": "auto", "torch_dtype": "auto", "use_cache": False},
        formatting_func=sample_formatting_function,
    )
])

In [ ]:
def sample_create_model(model_config):
    """Function to create model object with GPT-2 adjustments"""
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model_name = model_config["model_name"]
    model_type = model_config["model_type"]
    model_kwargs = model_config["model_kwargs"]

    if model_type == "causal_lm":
        model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
    else:
        # Default to causal LM
        model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # GPT-2 specific: Set pad token (GPT-2 doesn't have one by default)
    if "gpt2" in model_name.lower():
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "left"  # GPT-2 works better with left padding
        model.config.pad_token_id = model.config.eos_token_id

    return (model, tokenizer)

In [None]:
def sample_create_model(model_config):
    """Function to create model object for any given config"""
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model_name = model_config["model_name"]
    model_kwargs = model_config["model_kwargs"]
    
    model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    return (model, tokenizer)

In [ ]:
# Simple grid search across all config combinations = 4 total (2 LoRA configs × 2 training configs)
config_group = RFGridSearch(
    configs=config_set_lite,
    trainer_type="SFT"
)

In [None]:
# Simple grid search
config_group = RFGridSearch(
    configs=config_set,
    trainer_type="SFT"
)

In [ ]:
# Launch training - metrics will appear in TensorBoard above!
experiment.run_fit(
    config_group, 
    sample_create_model, 
    train_dataset, 
    eval_dataset, 
    num_chunks=4,  # 4 chunks for parallel execution
    seed=42
)

In [None]:
# Launch training - metrics will appear in TensorBoard above!
experiment.run_fit(
    config_group, 
    sample_create_model, 
    train_dataset, 
    eval_dataset, 
    num_chunks=2,  # 2 chunks for demo
    seed=42
)

## End Experiment

In [None]:
experiment.end()

## View TensorBoard Logs

After training completes, you can still view the full logs:

In [None]:
# View final logs
%tensorboard --logdir {tensorboard_log_dir}

## Using Both MLflow and TensorBoard

You can also log to both backends simultaneously by setting:

```python
os.environ['RF_TRACKING_BACKEND'] = 'both'
```

This gives you:
- **TensorBoard**: Real-time visualization during training
- **MLflow**: Experiment comparison and model registry

## Tips for Colab + TensorBoard

1. **Start TensorBoard first**: Always start TensorBoard before training
2. **Frequent logging**: Set `logging_steps` to a small value (e.g., 2-5) for responsive updates
3. **Refresh rate**: TensorBoard polls logs every 30 seconds in Colab
4. **Multiple experiments**: Use different experiment names for different runs
5. **Clean logs**: Delete old logs with `!rm -rf {tensorboard_log_dir}` to start fresh

## Comparison: TensorBoard vs MLflow in Colab

| Feature | TensorBoard | MLflow |
|---------|-------------|--------|
| Real-time updates | ✅ Yes (30s polling) | ❌ No (frontend load time) |
| Colab native | ✅ %tensorboard magic | ❌ Requires tunneling |
| Load time | ✅ Instant | ❌ 3-5 minutes via tunnel |
| Model registry | ❌ No | ✅ Yes |
| Experiment comparison | ✅ Basic | ✅ Advanced |

**Recommendation**: Use `'both'` backend to get the best of both worlds!

## Next Steps

- Try different model configs and compare in TensorBoard
- Experiment with `'both'` backend for comprehensive tracking
- Check out other RapidFire tutorials for DPO and GRPO training

Happy training! 🚀