<div align="center">
<a href="https://rapidfire.ai/"><img src="https://raw.githubusercontent.com/RapidFireAI/rapidfireai/main/images/RapidFire - Blue bug -white text.svg" width="115"></a>
<a href="https://discord.gg/6vSTtncKNN"><img src="https://raw.githubusercontent.com/RapidFireAI/rapidfireai/main/images/discord-button.svg" width="145"></a>
<a href="https://oss-docs.rapidfire.ai/"><img src="https://raw.githubusercontent.com/RapidFireAI/rapidfireai/main/images/documentation-button.svg" width="125"></a>
<br/>
Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/RapidFireAI/rapidfireai">GitHub</a></i> ‚≠ê
<br/>
To install RapidFire AI on your own machine, see the <a href="https://oss-docs.rapidfire.ai/en/latest/walkthrough.html">Install and Get Started</a> guide in our docs.
</div>

‚ö†Ô∏è **IMPORTANT:** Do not let the Colab notebook tab stay idle for more than 5min; Colab will disconnect otherwise. Refresh the TensorBoard screen or interact with the cells to avoid disconnection.

# RapidFire AI in Google Colab with TensorBoard

This tutorial demonstrates how to use RapidFire AI in Google Colab with in-built TensorBoard for real-time metrics visualization.

## Start RapidFire Services in Colab Mode

RapidFire requires the API Server to manage experiment state. Open the Colab terminal (Tools > Command palette > Terminal) and run:

```bash
pip install rapidfireai # Takes 1 min
rapidfireai init # Takes 1 min
export RF_TRACKING_BACKEND=tensorboard
rapidfireai start --colab & # Takes 0.5 min
```

The `--colab` flag will:
- ‚úÖ Start the API Server (required for experiment state management)
- ‚äó Skip the frontend server (using TensorBoard instead)

You should see output like:
```
üì¶ RapidFire AI Initializing...
‚úÖ [1/1] Dispatcher server started
üöÄ RapidFire running in Colab mode!
üìä Use TensorBoard for metrics visualization:
   %tensorboard --logdir ~/experiments/{experiment_name}/tensorboard_logs
```

**IMPORTANT: Leave this terminal running while you work in your notebook!**

## Configure RapidFire to Use TensorBoard

We'll set environment variables to tell RapidFire to use TensorBoard instead of MLflow:

In [None]:
import os

# Load TensorBoard extension
%load_ext tensorboard

# Configure RapidFire to use TensorBoard
os.environ['RF_TRACKING_BACKEND'] = 'tensorboard'  # Options: 'mlflow', 'tensorboard', 'both'
# TensorBoard log directory will be auto-created in experiment path

## Configure Hugging Face token

Install huggingface-hub and provide your HF token in place of YOUR-TOKEN-HERE.

**IMPORTANT: Hugging Face does not allow us to provide a public HF token. You need to sign up for a Hugging Face account and obtain a token.**

In [None]:
!pip install "huggingface-hub[cli]"

In [None]:
!hf auth login --token YOUR-TOKEN-HERE

## Import RapidFire Components

In [None]:
from rapidfireai import Experiment
from rapidfireai.automl import List, RFGridSearch, RFModelConfig, RFLoraConfig, RFSFTConfig

## Load Dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset")

# REDUCED dataset for memory constraints in Colab
train_dataset = dataset["train"].select(range(64))  # Reduced from 128
eval_dataset = dataset["train"].select(range(50, 60))  # 10 examples
train_dataset = train_dataset.shuffle(seed=42)
eval_dataset = eval_dataset.shuffle(seed=42)

## Define Data Processing Function

We'll format the data as Q&A pairs for GPT-2:

In [None]:
def sample_formatting_function(example):
    """Format the dataset for GPT-2 while preserving original fields"""
    return {
        "text": f"Question: {example['instruction']}\nAnswer: {example['response']}",
        "instruction": example['instruction'],  # Keep original
        "response": example['response']  # Keep original
    }

# Apply formatting to datasets
eval_dataset = eval_dataset.map(sample_formatting_function)
train_dataset = train_dataset.map(sample_formatting_function)

## Define Metrics Function

We'll use a lightweight metrics computation with just ROUGE-L to save memory:

In [None]:
def sample_compute_metrics(eval_preds):
    """Lightweight metrics computation"""
    predictions, labels = eval_preds

    try:
        import evaluate

        # Only compute ROUGE-L (skip BLEU to save memory)
        rouge = evaluate.load("rouge")
        rouge_output = rouge.compute(
            predictions=predictions,
            references=labels,
            use_stemmer=True,
            rouge_types=["rougeL"]  # Only compute rougeL
        )

        return {
            "rougeL": round(rouge_output["rougeL"], 4),
        }
    except Exception as e:
        # Fallback if metrics fail
        print(f"Metrics computation failed: {e}")
        return {}

## Initialize Experiment

In [None]:
# Create experiment with unique name
my_experiment = "tensorboard-demo-1"
experiment = Experiment(experiment_name=my_experiment)

## Get TensorBoard Log Directory

The TensorBoard logs are stored in the experiment directory. Let's get the path:

In [None]:
# Get experiment path
from rapidfireai.db.rf_db import RfDb

db = RfDb()
experiment_path = db.get_experiments_path(my_experiment)
tensorboard_log_dir = f"{experiment_path}/{my_experiment}/tensorboard_logs"

print(f"TensorBoard logs will be saved to: {tensorboard_log_dir}")

## Define Model Configurations

This tutorial showcases GPT-2 (124M parameters), which is perfect for Colab's memory constraints:

In [None]:
# GPT-2 specific LoRA configs - different module names!
peft_configs_lite = List([
    RFLoraConfig(
        r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        target_modules=["c_attn"],  # GPT-2 combines Q,K,V in c_attn
        bias="none"
    ),
    RFLoraConfig(
        r=32,
        lora_alpha=64,
        lora_dropout=0.1,
        target_modules=["c_attn", "c_proj"],  # c_attn (QKV) + c_proj (output)
        bias="none"
    )
])

# 2 configs with GPT-2
config_set_lite = List([
    RFModelConfig(
        model_name="gpt2",  # Only 124M params
        peft_config=peft_configs_lite,
        training_args=RFSFTConfig(
            learning_rate=5e-4,  # Low lr for more stability
            lr_scheduler_type="linear",
            per_device_train_batch_size=2,
            gradient_accumulation_steps=2,  # Effective bs = 4
            max_steps=64, # Raise this to see more learning
            logging_steps=2,
            eval_strategy="steps",
            eval_steps=4,
            per_device_eval_batch_size=2,
            fp16=True,
            gradient_checkpointing=True,  # Save memory
            report_to="none",  # Disables wandb
        ),
        model_type="causal_lm",
        model_kwargs={
            "device_map": "auto",
            "torch_dtype": "float16",  # Explicit fp16
            "use_cache": False
        },
        formatting_func=sample_formatting_function,
        compute_metrics=sample_compute_metrics,
        generation_config={
            "max_new_tokens": 128,  # Reduced from 256
            "temperature": 0.7,
            "top_p": 0.9,
            "top_k": 40,
            "repetition_penalty": 1.1,
            "pad_token_id": 50256,  # GPT-2's EOS token
        }
    ),
    RFModelConfig(
        model_name="gpt2",
        peft_config=peft_configs_lite,
        training_args=RFSFTConfig(
            learning_rate=2e-4,  # Even more conservative
            lr_scheduler_type="cosine",  # Try cosine schedule
            per_device_train_batch_size=2,
            gradient_accumulation_steps=2,
            max_steps=64, # Raise this to see more learning behviors
            logging_steps=2,
            eval_strategy="steps",
            eval_steps=4,
            per_device_eval_batch_size=2,
            fp16=True,
            gradient_checkpointing=True,
            report_to="none",  # Disables wandb
            warmup_steps=10,  # Add warmup for stability
        ),
        model_type="causal_lm",
        model_kwargs={
            "device_map": "auto",
            "torch_dtype": "float16",
            "use_cache": False
        },
        formatting_func=sample_formatting_function,
        compute_metrics=sample_compute_metrics,
        generation_config={
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_p": 0.9,
            "top_k": 40,
            "repetition_penalty": 1.1,
            "pad_token_id": 50256,
        }
    )
])

In [None]:
def sample_create_model(model_config):
    """Function to create model object with GPT-2 adjustments"""
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model_name = model_config["model_name"]
    model_type = model_config["model_type"]
    model_kwargs = model_config["model_kwargs"]

    if model_type == "causal_lm":
        model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
    else:
        # Default to causal LM
        model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # GPT-2 specific: Set pad token (GPT-2 doesn't have one by default)
    if "gpt2" in model_name.lower():
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "left"  # GPT-2 works better with left padding
        model.config.pad_token_id = model.config.eos_token_id

    return (model, tokenizer)

In [None]:
# Simple grid search across all config combinations: 4 total (2 LoRA configs √ó 2 trainer configs)
config_group = RFGridSearch(
    configs=config_set_lite,
    trainer_type="SFT"
)

## Launch Interactive Run Controller

RapidFire AI provides an Interactive Controller that lets you manage executing runs dynamically in real-time from the notebook:

- ‚èπÔ∏è **Stop**: Gracefully stop a running config
- ‚ñ∂Ô∏è **Resume**: Resume a stopped run
- üóëÔ∏è **Delete**: Remove a run from this experiment
- üìã **Clone**: Create a new run by editing the config dictionary of a parent run to try new knob values; optional warm start of parameters
- üîÑ **Refresh**: Update run status and metrics

The Controller uses ipywidgets and is compatible with both Colab (ipywidgets 7.x) and Jupyter (ipywidgets 8.x).

In [None]:
# Create Interactive Controller
from rapidfireai.utils.interactive_controller import InteractiveController

controller = InteractiveController(dispatcher_url="http://127.0.0.1:8081")
controller.display()

## Start TensorBoard

**IMPORTANT: Make sure to start TensorBoard BEFORE invoking run_fit() below so that you can watch metrics appear in real-time!**

In [None]:
%tensorboard --logdir {tensorboard_log_dir}

## Run Training + Validation

Now we get to the main function for running multi-config training and evals. The metrics will appear in TensorBoard above in real-time.

In [None]:
# Launch training
experiment.run_fit(
    config_group,
    sample_create_model,
    train_dataset,
    eval_dataset,
    num_chunks=4,  # 4 chunks for hyperparallel execution
    seed=42
)

## End Experiment

In [None]:
experiment.end()

## View TensorBoard Plots and Logs

After your experiment is ended, you can still view the full logs in TensorBoard:

In [None]:
# View final logs
%tensorboard --logdir {tensorboard_log_dir}

# View RapidFire AI Log Files

You can track the work being done by the system via the RapidFire AI-produced log files in rapidfire_experiments/ folder. To see the log files, open the Colab terminal and run the commands:

```bash
tail -n 20 rapidfire_experiments/rapidfire.log
tail -n 20 rapidfire_experiments/training.log
```