<a target="_blank" href="https://colab.research.google.com/github/sonder-art/automl_o24/blob/main/codigo/nlp_chatbots/hf_lora2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# LoRA Finetuning with HuggingFace Transformers

This notebook provides a comprehensive implementation of LoRA finetuning for HuggingFace language models. The code is modular, easily configurable, and supports both GPU (CUDA) and CPU execution.

## Table of Contents
1. [Prerequisites](#Prerequisites)
2. [Data Pipeline](#Data-Pipeline)
3. [Model Architecture](#Model-Architecture)
4. [Training Configuration](#Training-Configuration)
5. [Training Features](#Training-Features)
6. [Evaluation and Visualization](#Evaluation-and-Visualization)
7. [Testing](#Testing)
8. [Full Training Example](#Full-Training-Example)


In [1]:
# Install necessary packages if not already installed
%pip install transformers datasets accelerate bitsandbytes loralib torch


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install -q huggingface_hub transformers

Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
from huggingface_hub import HfFolder, login
from transformers import AutoModel

def setup_huggingface(token):
    """Setup HuggingFace authentication and verify login"""
    try:
        # Set token and login
        os.environ["HUGGINGFACE_TOKEN"] = token
        login(token=token)
        
        # Verify login by attempting to download a private model
        # This will fail if not properly authenticated
        test_model = AutoModel.from_pretrained("hf-internal-testing/tiny-random-bert")
        print("✓ Login successful - You can now access private models and datasets")
        return True
    except Exception as e:
        print(f"✗ Login failed: {str(e)}")
        return False


In [4]:

# Usage
token = "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"  # Replace with your token from https://huggingface.co/settings/tokens
setup_huggingface(token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/uumami/.cache/huggingface/token
Login successful
✓ Login successful - You can now access private models and datasets


True

## Prerequisites

- **CUDA Support**: Automatically uses GPU if available, with a fallback to CPU.
- **Modular Architecture**: Functions and classes are designed for reuse.
- **Error Handling and Logging**: Implemented throughout the code.
- **Model and Data Agnostic**: Easily change models and datasets.


In [5]:
import torch
import logging
import sys
import os
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)
from datasets import load_dataset, DatasetDict
import numpy as np
from typing import Any, Dict, List, Tuple
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from accelerate import Accelerator

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


In [6]:
# Device configuration
def get_device() -> torch.device:
    """
    Returns the available device (GPU/CPU).

    Returns:
        torch.device: The device to use.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    logger.info(f"Using device: {device}")
    return device

device = get_device()


INFO:__main__:Using device: cuda


## Data Pipeline

We will create train/validation/test splits and preprocess the data with proper tokenization, dynamic padding, and input validation.


In [7]:
def load_and_split_dataset(
    dataset_name: str,
    split_ratios: Tuple[float, float, float] = (0.8, 0.1, 0.1),
    **kwargs
) -> DatasetDict:
    """
    Loads and splits a dataset into train, validation, and test sets.

    Args:
        dataset_name (str): The name of the dataset to load.
        split_ratios (Tuple[float, float, float], optional): Ratios for train, val, test splits.

    Returns:
        DatasetDict: A dictionary with 'train', 'validation', and 'test' datasets.
    """
    try:
        dataset = load_dataset(dataset_name, **kwargs)
        logger.info(f"Loaded dataset {dataset_name}")
    except Exception as e:
        logger.error(f"Error loading dataset {dataset_name}: {e}")
        sys.exit(1)

    # Assume the dataset has a 'train' split
    train_val_test = dataset["train"].train_test_split(
        test_size=split_ratios[2],
        seed=42
    )
    train_val = train_val_test["train"].train_test_split(
        test_size=split_ratios[1] / (split_ratios[0] + split_ratios[1]),
        seed=42
    )

    datasets = DatasetDict({
        "train": train_val["train"],
        "validation": train_val["test"],
        "test": train_val_test["test"]
    })
    logger.info("Split dataset into train, validation, and test sets")
    return datasets




In [8]:
# Example usage:
datasets = load_and_split_dataset("wikitext", name="wikitext-2-raw-v1")

INFO:__main__:Loaded dataset wikitext
INFO:__main__:Split dataset into train, validation, and test sets


In [9]:
def preprocess_data(
    datasets: DatasetDict,
    tokenizer: AutoTokenizer,
    text_column_name: str = "text",
    block_size: int = 128
) -> DatasetDict:
    """
    Tokenizes and preprocesses the datasets.

    Args:
        datasets (DatasetDict): The dataset dictionary.
        tokenizer (AutoTokenizer): The tokenizer to use.
        text_column_name (str, optional): The column name containing the text.
        block_size (int, optional): The block size for chunking the data.

    Returns:
        DatasetDict: The tokenized datasets.
    """
    def tokenize_function(examples):
        return tokenizer(examples[text_column_name])

    tokenized_datasets = datasets.map(
        tokenize_function,
        batched=True,
        num_proc=4,
        remove_columns=[text_column_name],
        load_from_cache_file=True,
        desc="Running tokenizer on dataset"
    )

    def group_texts(examples):
        concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
        total_length = len(concatenated_examples["input_ids"])
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        result["labels"] = result["input_ids"].copy()
        return result

    lm_datasets = tokenized_datasets.map(
        group_texts,
        batched=True,
        num_proc=4,
        load_from_cache_file=True,
        desc=f"Grouping texts into chunks of {block_size}"
    )

    logger.info("Preprocessed datasets")
    return lm_datasets



In [12]:
# Example usage:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer.pad_token = tokenizer.eos_token  # For GPT-2
lm_datasets = preprocess_data(datasets, tokenizer)


INFO:__main__:Preprocessed datasets


## Model Architecture

We select a base model and configure it for LoRA finetuning.


In [13]:
import loralib as lora

def get_model(
    model_name: str,
    device: torch.device,
    lora_r: int = 8,
    lora_alpha: int = 16,
    lora_dropout: float = 0.1,
    target_modules: List[str] = ["q_proj", "v_proj"]
) -> AutoModelForCausalLM:
    """
    Loads a pre-trained model and prepares it for LoRA finetuning.

    Args:
        model_name (str): The name of the pre-trained model.
        device (torch.device): The device to load the model on.
        lora_r (int, optional): Rank of the LoRA matrices.
        lora_alpha (int, optional): Alpha scaling parameter for LoRA.
        lora_dropout (float, optional): Dropout probability for LoRA layers.
        target_modules (List[str], optional): The modules to apply LoRA to.

    Returns:
        AutoModelForCausalLM: The model ready for finetuning.
    """
    try:
        model = AutoModelForCausalLM.from_pretrained(model_name)
        model.to(device)
        logger.info(f"Loaded model {model_name}")
    except Exception as e:
        logger.error(f"Error loading model {model_name}: {e}")
        sys.exit(1)

    # Apply LoRA
    for name, module in model.named_modules():
        if any(target_module in name for target_module in target_modules):
            lora.inject(module, r=lora_r, alpha=lora_alpha, dropout=lora_dropout)
            logger.info(f"Applied LoRA to {name}")

    return model




In [14]:
# Example usage:
model = get_model("meta-llama/Llama-3.2-1B-Instruct", device)

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

ERROR:__main__:Error loading model meta-llama/Llama-3.2-1B-Instruct: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacity of 3.94 GiB of which 19.38 MiB is free. Including non-PyTorch memory, this process has 3.91 GiB memory in use. Of the allocated memory 3.86 GiB is allocated by PyTorch, and 13.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


## Training Configuration

We expose key hyperparameters for easy configuration.


In [None]:
from dataclasses import dataclass
@dataclass
class TrainingConfig:
    learning_rate: float = 5e-5
    lora_r: int = 8
    lora_alpha: int = 16
    lora_dropout: float = 0.1
    num_train_epochs: int = 3
    per_device_train_batch_size: int = 4
    per_device_eval_batch_size: int = 4
    evaluation_strategy: str = "steps"
    eval_steps: int = 500
    save_steps: int = 500
    logging_steps: int = 100
    output_dir: str = "./results"
    seed: int = 42

# Example usage:
config = TrainingConfig()


## Training Features

We implement the loss function, metrics tracking, validation callbacks, and checkpointing.


In [None]:
def compute_metrics(eval_preds):
    """
    Computes perplexity and other metrics.

    Args:
        eval_preds (Tuple): Predictions and labels.

    Returns:
        Dict[str, float]: The computed metrics.
    """
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    shift_logits = logits[..., :-1, :].reshape(-1, logits.shape[-1])
    shift_labels = labels[..., 1:].reshape(-1)
    loss_fct = torch.nn.CrossEntropyLoss()
    loss = loss_fct(
        torch.tensor(shift_logits),
        torch.tensor(shift_labels)
    )
    perplexity = torch.exp(loss)
    return {"perplexity": perplexity.item()}

# No need to test this function separately as it will be used during evaluation


In [None]:
def get_data_collator(tokenizer: AutoTokenizer):
    """
    Returns a data collator for language modeling.

    Args:
        tokenizer (AutoTokenizer): The tokenizer used.

    Returns:
        DataCollatorForLanguageModeling: The data collator.
    """
    return DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False
    )

data_collator = get_data_collator(tokenizer)


In [None]:
from transformers import EarlyStoppingCallback

def train_model(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    lm_datasets: DatasetDict,
    config: TrainingConfig
):
    """
    Trains the model with the given configuration.

    Args:
        model (AutoModelForCausalLM): The model to train.
        tokenizer (AutoTokenizer): The tokenizer.
        lm_datasets (DatasetDict): The tokenized datasets.
        config (TrainingConfig): The training configuration.
    """
    training_args = TrainingArguments(
        output_dir=config.output_dir,
        overwrite_output_dir=True,
        num_train_epochs=config.num_train_epochs,
        per_device_train_batch_size=config.per_device_train_batch_size,
        per_device_eval_batch_size=config.per_device_eval_batch_size,
        evaluation_strategy=config.evaluation_strategy,
        learning_rate=config.learning_rate,
        save_steps=config.save_steps,
        eval_steps=config.eval_steps,
        logging_steps=config.logging_steps,
        seed=config.seed,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_datasets["train"],
        eval_dataset=lm_datasets["validation"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )

    trainer.train()
    logger.info("Training complete")
    return trainer

# Example usage will be in the full training section


## Evaluation and Visualization

We evaluate the model before and after finetuning and visualize training progress.


In [None]:
def evaluate_model(
    trainer: Trainer,
    lm_datasets: DatasetDict,
    split: str = "test"
):
    """
    Evaluates the model on the specified dataset split.

    Args:
        trainer (Trainer): The trainer object.
        lm_datasets (DatasetDict): The tokenized datasets.
        split (str, optional): The dataset split to evaluate on.

    Returns:
        Dict[str, float]: Evaluation metrics.
    """
    eval_results = trainer.evaluate(eval_dataset=lm_datasets[split])
    logger.info(f"Evaluation results on {split} set: {eval_results}")
    return eval_results

# No separate test code needed


In [None]:
def plot_metrics(log_history: List[Dict[str, Any]]):
    """
    Plots training metrics from the trainer's log history.

    Args:
        log_history (List[Dict[str, Any]]): The trainer's log history.
    """
    steps = []
    losses = []
    eval_losses = []
    perplexities = []
    for log in log_history:
        if "loss" in log:
            steps.append(log["step"])
            losses.append(log["loss"])
        if "eval_loss" in log:
            eval_losses.append(log["eval_loss"])
        if "perplexity" in log:
            perplexities.append(log["perplexity"])

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(steps, losses, label="Training Loss")
    plt.xlabel("Steps")
    plt.ylabel("Loss")
    plt.legend()

    plt.subplot(1, 2, 2)
    plt.plot(eval_losses, label="Validation Loss")
    plt.plot(perplexities, label="Perplexity")
    plt.xlabel("Evaluation Steps")
    plt.legend()
    plt.show()

# No separate test code needed


## Testing

We perform model inference before and after finetuning and analyze performance.


In [None]:
def generate_text(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    prompt: str,
    max_length: int = 50,
    num_return_sequences: int = 1
) -> List[str]:
    """
    Generates text using the model based on the prompt.

    Args:
        model (AutoModelForCausalLM): The language model.
        tokenizer (AutoTokenizer): The tokenizer.
        prompt (str): The text prompt.
        max_length (int, optional): Maximum length of generated text.
        num_return_sequences (int, optional): Number of sequences to generate.

    Returns:
        List[str]: Generated text sequences.
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        do_sample=True,
        top_p=0.95,
        top_k=60
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)


In [None]:
# Example usage:
pre_finetune_output = generate_text(model, tokenizer, "Once upon a time")
print("Before finetuning:", pre_finetune_output)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Before finetuning: ['Once upon a time, the world had seen a fair, orderly and stable society, where everyone had a decent standard of living, the rich were cared for by their fellow citizens, and the poor themselves received adequate protection from cruel rulers.\n\n\nToday']


## Full Training Example

We bring everything together and run the full training pipeline.


In [None]:
# Load and preprocess data
datasets = load_and_split_dataset("wikitext", name="wikitext-2-raw-v1")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # For GPT-2
lm_datasets = preprocess_data(datasets, tokenizer)

# Load model
device = get_device()
model = get_model("gpt2", device)

# Training configuration
config = TrainingConfig(
    learning_rate=5e-5,
    num_train_epochs=1,  # For quick example, set to higher for real training
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    output_dir="./gpt2-lora-finetuned"
)

# Pre-finetuning evaluation
pre_eval_results = evaluate_model(
    Trainer(model=model, tokenizer=tokenizer),
    lm_datasets,
    split="test"
)

# Training
trainer = train_model(model, tokenizer, lm_datasets, config)

# 

INFO:__main__:Loaded dataset wikitext
INFO:__main__:Split dataset into train, validation, and test sets
INFO:__main__:Preprocessed datasets
INFO:__main__:Using device: cuda
INFO:__main__:Loaded model gpt2


  0%|          | 0/237 [00:00<?, ?it/s]

INFO:__main__:Evaluation results on test set: {'eval_loss': 4.238061428070068, 'eval_model_preparation_time': 0.0021, 'eval_runtime': 52.5668, 'eval_samples_per_second': 36.068, 'eval_steps_per_second': 4.509}


  0%|          | 0/7468 [00:00<?, ?it/s]

{'loss': 4.074, 'grad_norm': 14.423720359802246, 'learning_rate': 4.933047670058918e-05, 'epoch': 0.01}


KeyboardInterrupt: 

In [None]:
# Post-finetuning evaluation
post_eval_results = evaluate_model(trainer, lm_datasets, split="test")

# Visualization
plot_metrics(trainer.state.log_history)

In [None]:
# Generate text after finetuning
post_finetune_output = generate_text(model, tokenizer, "Once upon a time")
print("After finetuning:", post_finetune_output)
