# Language Model Fine-tuning with Hyperparameter Optimization

This notebook demonstrates how to fine-tune a DistilGPT-2 model using Ray Tune for automated hyperparameter optimization. We will:

1. Install required packages
2. Set up the model and tokenizer
3. Prepare the dataset
4. Define training functions
5. Configure hyperparameter search
6. Run automated optimization
7. Analyze results

## Step 1: Install Required Packages

Install the necessary packages for language model fine-tuning and hyperparameter optimization:
- `transformers`: Hugging Face library for pre-trained models
- `datasets`: For loading and processing datasets
- `accelerate`: For distributed training support
- `ray[tune]`: For hyperparameter optimization

In [None]:
# Install required packages with quiet flag to reduce output verbosity
# transformers: for pre-trained language models and tokenizers
# datasets: for loading and processing machine learning datasets
# accelerate: for efficient distributed training
# ray[tune]: for automated hyperparameter optimization
!pip install transformers datasets accelerate ray[tune] --quiet

[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/igraph-0.11.8-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/texttable-1.7.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/looseversion-1.3.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/o

## Step 2: Import Libraries and Setup

Import all necessary libraries and set up the computing device.

In [None]:
# Import PyTorch for deep learning operations
import torch

# Import Hugging Face transformers components for language modeling
from transformers import (
    AutoTokenizer,                    # For text tokenization
    AutoModelForCausalLM,            # For causal language models like GPT
    Trainer,                         # For training loop management
    TrainingArguments,               # For training configuration
    DataCollatorForLanguageModeling  # For batch preparation during training
)

# Import datasets library for loading and processing data
from datasets import load_dataset

# Import Ray for distributed computing and hyperparameter optimization
import ray
from ray import tune, train
from ray.tune.schedulers import ASHAScheduler  # Asynchronous Successive Halving Algorithm

# Determine the computing device (GPU if available, otherwise CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## Step 3: Model and Tokenizer Setup

Define a function to load and configure the DistilGPT-2 model and its tokenizer.

In [None]:
# Define function to initialize model and tokenizer
def get_model_and_tokenizer():
    # Load pre-trained DistilGPT-2 tokenizer (smaller version of GPT-2)
    tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
    
    # Set padding token to be the same as end-of-sequence token
    # This is necessary because GPT-2 doesn't have a dedicated padding token
    tokenizer.pad_token = tokenizer.eos_token
    
    # Load the pre-trained DistilGPT-2 model for causal language modeling
    # Causal LM means the model predicts the next token given previous tokens
    model = AutoModelForCausalLM.from_pretrained("distilgpt2")
    
    # Resize token embeddings to match tokenizer vocabulary size
    # This ensures compatibility between model and tokenizer
    model.resize_token_embeddings(len(tokenizer))
    
    # Move model to the appropriate device (GPU/CPU) and return both components
    return model.to(device), tokenizer

## Step 4: Dataset Preparation

Define a function to load and tokenize the WikiText-2 dataset for language model training.

In [None]:
# Define function to load and preprocess the dataset
def load_tokenized_dataset(tokenizer, block_size=64):
    # Load WikiText-2 dataset (a collection of Wikipedia articles)
    # Using only 1% of the training split for faster experimentation
    raw_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")

    # Define tokenization function for the dataset
    def tokenize(example):
        # Tokenize the text with padding and truncation
        # max_length=block_size limits sequence length for memory efficiency
        # padding="max_length" ensures all sequences have the same length
        # truncation=True cuts off text that exceeds max_length
        tokens = tokenizer(
            example["text"], 
            padding="max_length", 
            truncation=True, 
            max_length=block_size
        )
        
        # For causal language modeling, labels are the same as input_ids
        # The model learns to predict the next token in the sequence
        tokens["labels"] = tokens["input_ids"].copy()
        
        return tokens

    # Apply tokenization to the entire dataset
    # batched=True processes multiple examples at once for efficiency
    # remove_columns=["text"] removes the original text column after tokenization
    tokenized = raw_dataset.map(tokenize, batched=True, remove_columns=["text"])
    
    # Set format to PyTorch tensors for compatibility with PyTorch models
    tokenized.set_format("torch")
    
    return tokenized

## Step 5: Training Function for Ray Tune

Define the training function that will be optimized by Ray Tune. This function receives hyperparameter configurations and reports back the evaluation loss.

In [None]:
# Define training function that Ray Tune will call with different hyperparameter configurations
def train_with_tune(config):
    # Print the current hyperparameter configuration being tested
    print(f"Starting trial with config: {config}")
    
    try:
        # Initialize model and tokenizer for this trial
        model, tokenizer = get_model_and_tokenizer()
        
        # Load and tokenize the dataset
        dataset = load_tokenized_dataset(tokenizer)

        # Split dataset into training and evaluation sets (80/20 split)
        train_size = int(0.8 * len(dataset))
        train_dataset = dataset.select(range(train_size))
        eval_dataset = dataset.select(range(train_size, len(dataset)))

        # Print dataset sizes for monitoring
        print(f"Train size: {len(train_dataset)}, Eval size: {len(eval_dataset)}")

        # Configure training arguments using hyperparameters from Ray Tune
        training_args = TrainingArguments(
            output_dir="./output",                                    # Directory to save model checkpoints
            per_device_train_batch_size=config["batch_size"],        # Batch size from hyperparameter search
            learning_rate=config["lr"],                              # Learning rate from hyperparameter search
            num_train_epochs=config["epochs"],                       # Number of epochs from hyperparameter search
            logging_steps=5,                                         # Log metrics every 5 steps
            save_strategy="no",                                      # Don't save checkpoints to save disk space
            report_to="none",                                        # Disable external logging (like wandb)
            fp16=torch.cuda.is_available(),                          # Use half-precision if GPU is available
        )

        # Create data collator for language modeling
        # mlm=False indicates causal language modeling (like GPT), not masked language modeling (like BERT)
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=tokenizer, 
            mlm=False  # GPT-style causal LM, not masked LM
        )

        # Initialize the Trainer with model, arguments, datasets, and data collator
        trainer = Trainer(
            model=model,                    # The model to train
            args=training_args,             # Training configuration
            train_dataset=train_dataset,    # Training data
            eval_dataset=eval_dataset,      # Evaluation data
            data_collator=data_collator,    # Handles batch preparation
        )

        # Execute the training process
        trainer.train()

        # Evaluate the model on the validation set
        result = trainer.evaluate()
        print("Eval results:", result)

        # Extract evaluation loss for Ray Tune optimization
        eval_loss = result.get("eval_loss", None)
        
        # Handle case where evaluation loss is missing
        if eval_loss is None:
            print("Warning: eval_loss missing. Reporting dummy loss = 9999.")
            eval_loss = 9999.0

        # Report the loss back to Ray Tune for optimization
        # Ray Tune will use this to determine which hyperparameters work best
        train.report({"loss": eval_loss})

    except Exception as e:
        # Handle any errors during training by reporting a high loss value
        print(f"Trial crashed: {e}")
        train.report({"loss": 9999.0})

## Step 6: Configure Hyperparameter Search Space

Define the hyperparameters to optimize and their possible values.

In [None]:
# Define the hyperparameter search space for optimization
search_space = {
    # Batch size options: small values for memory efficiency
    # tune.choice() randomly selects one of the provided options
    "batch_size": tune.choice([1, 2]),
    
    # Learning rate range: logarithmic distribution between 1e-5 and 1e-4
    # tune.loguniform() samples from a log-uniform distribution
    # This is common for learning rates as they often work best on a log scale
    "lr": tune.loguniform(1e-5, 1e-4),
    
    # Number of training epochs: limited to 1 or 2 for fast experimentation
    # In practice, you might use more epochs for better convergence
    "epochs": tune.choice([1, 2]),
}

# Initialize ASHA (Asynchronous Successive Halving Algorithm) scheduler
# This scheduler stops poorly performing trials early to save computational resources
# It allocates more resources to promising hyperparameter configurations
scheduler = ASHAScheduler()

## Step 7: Initialize Ray and Run Hyperparameter Optimization

Initialize Ray for distributed computing and execute the hyperparameter search.

In [None]:
# Clean shutdown of any existing Ray instance to avoid conflicts
ray.shutdown()

# Initialize Ray for distributed computing
# ignore_reinit_error=True prevents errors if Ray is already running
# num_cpus=2 limits CPU usage for this notebook environment
ray.init(ignore_reinit_error=True, num_cpus=2)

# Execute the hyperparameter optimization using Ray Tune
analysis = tune.run(
    train_with_tune,                                          # Function to optimize
    config=search_space,                                      # Hyperparameter search space
    num_samples=2,                                           # Number of different configurations to try
    scheduler=scheduler,                                      # ASHA scheduler for early stopping
    metric="loss",                                           # Metric to optimize (minimize loss)
    mode="min",                                              # Minimize the metric (lower loss is better)
    resources_per_trial={                                    # Resource allocation per trial
        "cpu": 1,                                            # 1 CPU core per trial
        "gpu": 0.5 if torch.cuda.is_available() else 0      # 0.5 GPU per trial if available
    },
    raise_on_failed_trial=False,                             # Continue even if some trials fail
)

## Step 8: Analyze Optimization Results

Extract and display the best hyperparameter configuration found by Ray Tune.

In [None]:
# Analyze and display the optimization results
if analysis.best_config:
    # If successful trials were found, display the best hyperparameter configuration
    print("Best hyperparameters found:", analysis.best_config)
    
    # You can also access additional information about the best trial:
    # analysis.best_trial - the best trial object
    # analysis.best_result - the best result metrics
    # analysis.best_logdir - directory containing the best trial's logs
    
else:
    # If no successful trials were completed, inform the user
    print("No successful trials completed. The model setup is ready for manual tuning.")

# Additional analysis you can perform:
# analysis.trials - list of all trials
# analysis.get_best_trial() - get the best trial object
# analysis.results_df - pandas DataFrame with all results