# Modal Fine-Tuning Script for DeBERTa-v3-large with DoRA

Fine-tunes microsoft/deberta-v3-large for PII Named Entity Recognition using Weight-Decomposed Low-Rank Adaptation
(DoRA) on an H100 GPU.

Dataset: Ari-S-123/better-english-pii-anonymizer Base Model: microsoft/deberta-v3-large Adapter Method: DoRA (via PEFT
library)


## Set up Modal


In [38]:
%pip install modal

Note: you may need to restart the kernel to use updated packages.


In [39]:
!python -m modal setup

The web browser should have opened for you to authenticate and get an API 
token.
If it didn't, please copy this URL into your web browser manually:

‚†ã Waiting for authentication in the web browser
https://modal.com/token-flow/tf-Z8JxPJBG8wQFuCqXReNSQa

‚†ã Waiting for authentication in the web browser
‚†ã Waiting for authentication in the web browser

‚†ã Waiting for token flow to complete...
‚†ô Waiting for token flow to complete...
‚†π Waiting for token flow to complete...
‚†∏ Waiting for token flow to complete...
‚†º Waiting for token flow to complete...
‚†¥ Waiting for token flow to complete...
‚†¶ Waiting for token flow to complete...
‚†ß Waiting for token flow to complete...
‚†á Waiting for token flow to complete...
‚†è Waiting for token flow to complete...
‚†ã Waiting for token flow to complete...
‚†ô Waiting for token flow to complete...
‚†π Waiting for token flow to complete...
‚†∏ Waiting for token flow to complete...
‚†º Waiting for token flow to complete...
‚†¶ Waiting f

## Imports and Modal App Configuration


In [40]:
import modal

APP_NAME: str = "pii-deberta-dora-finetune"

# Define the Modal App
app = modal.App(name=APP_NAME)

# HuggingFace Hub configuration
# Create the secret first by running in a terminal:
#   modal secret create huggingface HF_TOKEN=hf_your_token_here
HF_SECRET = modal.Secret.from_name("huggingface")

# Create a persistent volume to store checkpoints and final model
# This persists between runs so you don't lose progress if something fails
volume = modal.Volume.from_name("pii-model-checkpoints", create_if_missing=True)
VOLUME_MOUNT_PATH: str = "/checkpoints"

print("‚úì Modal app configured")
print(f"  App name: {APP_NAME}")
print(f"  Volume mount: {VOLUME_MOUNT_PATH}")

‚úì Modal app configured
  App name: pii-deberta-dora-finetune
  Volume mount: /checkpoints


## Docker Image Definition

Build a custom image with all required dependencies for DoRA fine-tuning. The image is cached, so subsequent runs are
fast.


In [41]:
training_image = (
    modal.Image.debian_slim(python_version="3.13")
    # Install system dependencies
    .apt_install("git", "curl", "build-essential")
    # Install PyTorch with CUDA 12.1 support (H100 compatible)
    .pip_install(
        "torch>=2.4.0",
        "torchvision",
        "torchaudio",
        extra_index_url="https://download.pytorch.org/whl/cu121",
    )
    # Install core ML dependencies
    # Let pip resolve compatible versions by using minimum version constraints
    # rather than exact pins, which cause conflicts
    .pip_install(
        # Transformers ecosystem - use compatible ranges
        "transformers>=4.46.0",
        "peft>=0.14.0",
        "accelerate>=1.0.0",
        "datasets>=3.0.0",
        "huggingface_hub>=0.26.0",  # Don't pin exact version - let pip resolve
        # Training utilities
        "scikit-learn>=1.5.0",
        "seqeval>=1.2.2",
        "numpy>=1.26.0",
        "tqdm>=4.66.0",
        # Sentencepiece for DeBERTa tokenizer
        "sentencepiece>=0.2.0",
        "protobuf>=4.25.0",
        # BitsAndBytes for potential quantization
        "bitsandbytes>=0.44.0",
        "tensorboard>=2.15.0",
        gpu="H100",  # Build with H100 to ensure CUDA compatibility
    )
    # Set environment variables
    .env({
        "HF_HOME": "/root/.cache/huggingface",
        "TRANSFORMERS_CACHE": "/root/.cache/huggingface/hub",
        "TOKENIZERS_PARALLELISM": "false",
    })
)

print("‚úì Training image defined")
print("  Base: debian_slim (Python 3.13)")
print("  PyTorch: >=2.4.0 (CUDA 12.1)")
print("  Key packages: transformers, peft, datasets, seqeval")

‚úì Training image defined
  Base: debian_slim (Python 3.13)
  PyTorch: >=2.4.0 (CUDA 12.1)
  Key packages: transformers, peft, datasets, seqeval


## Training Configuration Dataclass


In [42]:
from dataclasses import dataclass, field


@dataclass
class TrainingConfig:
    """
    Configuration for DeBERTa-v3-large DoRA fine-tuning on PII detection.

    This configuration is optimized for an H100 GPU (80GB VRAM) and targets
    the specific modules recommended for DeBERTa attention layers.

    Attributes:
        model_name: HuggingFace model identifier for the base model.
        dataset_name: HuggingFace dataset identifier for training data.
        output_dir: Local directory for saving checkpoints during training.
        hub_model_id: HuggingFace Hub repository ID for pushing the final adapter.
        max_length: Maximum sequence length for tokenization (DeBERTa supports 512).
        learning_rate: AdamW learning rate. DoRA typically needs slightly lower LR.
        num_train_epochs: Maximum number of training epochs.
        per_device_train_batch_size: Batch size per GPU for training.
        per_device_eval_batch_size: Batch size per GPU for evaluation.
        gradient_accumulation_steps: Number of steps to accumulate gradients.
        warmup_ratio: Proportion of training steps for learning rate warmup.
        weight_decay: L2 regularization coefficient.
        lora_r: LoRA/DoRA rank (higher = more parameters, better capacity).
        lora_alpha: LoRA/DoRA scaling factor (typically 2x rank).
        lora_dropout: Dropout probability for LoRA layers.
        target_modules: DeBERTa attention modules to apply DoRA to.
        early_stopping_patience: Number of evaluations without improvement before stopping.
        eval_steps: Evaluate every N steps.
        save_steps: Save checkpoint every N steps.
        logging_steps: Log metrics every N steps.
        seed: Random seed for reproducibility.
        bf16: Use bfloat16 mixed precision (recommended for H100).
        dataloader_num_workers: Number of workers for data loading.
        push_to_hub: Whether to push the final model to HuggingFace Hub.
    """

    # Model and dataset
    model_name: str = "microsoft/deberta-v3-large"
    dataset_name: str = "Ari-S-123/better-english-pii-anonymizer"
    output_dir: str = "/checkpoints/pii-deberta-dora"
    hub_model_id: str = "Ari-S-123/deberta-v3-large-pii-dora"

    # Tokenization
    max_length: int = 512

    # Training hyperparameters
    learning_rate: float = 2e-5
    num_train_epochs: int = 5
    per_device_train_batch_size: int = 16
    per_device_eval_batch_size: int = 32
    gradient_accumulation_steps: int = 2
    warmup_ratio: float = 0.1
    weight_decay: float = 0.01

    # DoRA configuration (per PLAN.md recommendations)
    lora_r: int = 16
    lora_alpha: int = 32
    lora_dropout: float = 0.05
    target_modules: list[str] = field(default_factory=lambda: [
        "query_proj",
        "key_proj",
        "value_proj",
        "dense",
    ])

    # Early stopping and evaluation
    early_stopping_patience: int = 3
    eval_steps: int = 500
    save_steps: int = 500
    logging_steps: int = 100

    # Reproducibility and performance
    seed: int = 69
    bf16: bool = True  # H100 has excellent bf16 support
    dataloader_num_workers: int = 4

    # Hub configuration
    push_to_hub: bool = True


print("‚úì TrainingConfig dataclass defined")

‚úì TrainingConfig dataclass defined


## Label Mapping and Tokenization Functions

These functions are defined at module level so they can be serialized and sent to the remote Modal container.


In [43]:
from typing import Any


def build_label_vocabulary(dataset: Any) -> tuple[list[str], dict[str, int], dict[int, str]]:
    """
    Build a complete BIO label vocabulary from the dataset's privacy_mask annotations.

    Scans all unique entity labels in the dataset and creates a BIO-formatted
    label vocabulary with "O" (Outside) plus B-/I- prefixed entity labels.

    This function handles BOTH possible data formats:
        1. List of dicts: [{"label": "EMAIL", "start": 10, ...}, ...]
        2. Dict of lists: {"label": ["EMAIL", ...], "start": [10, ...], ...}

    Args:
        dataset: HuggingFace Dataset or DatasetDict containing privacy_mask field.

    Returns:
        Tuple containing:
            - label_list: Ordered list of all BIO labels (e.g., ["O", "B-EMAIL", "I-EMAIL", ...])
            - label_to_id: Mapping from label string to integer ID
            - id_to_label: Mapping from integer ID to label string
    """
    unique_labels: set[str] = set()

    # Handle both Dataset and DatasetDict
    splits = dataset.keys() if hasattr(dataset, "keys") else ["train"]

    for split in splits:
        split_data = dataset[split] if hasattr(dataset, "keys") else dataset
        for example in split_data:
            privacy_mask = example.get("privacy_mask", [])

            # Handle BOTH data formats:
            # Format 1: List of dicts - [{"label": "EMAIL", "start": 10, ...}, ...]
            # Format 2: Dict of lists - {"label": ["EMAIL", ...], "start": [10, ...], ...}
            if isinstance(privacy_mask, list):
                # Format 1: List of entity dictionaries
                for entity in privacy_mask:
                    if isinstance(entity, dict) and "label" in entity:
                        unique_labels.add(entity["label"])
            elif isinstance(privacy_mask, dict):
                # Format 2: Dict with parallel lists (HuggingFace columnar format)
                labels = privacy_mask.get("label", [])
                if isinstance(labels, list):
                    unique_labels.update(labels)
                elif isinstance(labels, str):
                    # Single entity case
                    unique_labels.add(labels)

    # Sort for deterministic ordering
    sorted_labels = sorted(unique_labels)

    # Build BIO vocabulary: O + B-X/I-X for each entity type
    label_list: list[str] = ["O"]
    for label in sorted_labels:
        label_list.append(f"B-{label}")
        label_list.append(f"I-{label}")

    label_to_id: dict[str, int] = {label: i for i, label in enumerate(label_list)}
    id_to_label: dict[int, str] = {i: label for i, label in enumerate(label_list)}

    return label_list, label_to_id, id_to_label


def tokenize_and_align_labels(
    examples: dict[str, Any],
    tokenizer: Any,
    label_to_id: dict[str, int],
    max_length: int = 512,
) -> dict[str, list]:
    """
    Tokenize text and align span annotations to subword token boundaries.

    This function handles the critical alignment between character-level entity
    spans (from privacy_mask) and subword token-level BIO labels required for
    DeBERTa's token classification head.

    Handles BOTH data formats:
        1. List of dicts: [{"label": "EMAIL", "start": 10, "end": 25}, ...]
        2. Dict of lists: {"label": ["EMAIL", ...], "start": [10, ...], "end": [25, ...]}

    The alignment strategy:
        1. Tokenize with offset_mapping to get character‚Üítoken correspondence
        2. For each entity span, find overlapping tokens
        3. Assign B- to first overlapping token, I- to subsequent tokens
        4. Special tokens ([CLS], [SEP], [PAD]) get label -100 (ignored in loss)

    Args:
        examples: Batch of examples with "source_text" and "privacy_mask" fields.
        tokenizer: HuggingFace tokenizer instance (DeBERTa tokenizer).
        label_to_id: Mapping from BIO label strings to integer IDs.
        max_length: Maximum sequence length for truncation.

    Returns:
        Dictionary with keys:
            - input_ids: Tokenized input sequences
            - attention_mask: Attention masks (1 for real tokens, 0 for padding)
            - labels: Aligned BIO label IDs (-100 for special tokens)
    """
    # Tokenize WITHOUT padding (let DataCollator handle it dynamically)
    tokenized = tokenizer(
        examples["source_text"],
        truncation=True,
        max_length=max_length,
        # REMOVED: padding="max_length"  <-- Don't pad here
        return_offsets_mapping=True,
        return_attention_mask=True,
    )

    all_labels: list[list[int]] = []

    for batch_idx, offset_mapping in enumerate(tokenized["offset_mapping"]):
        # Initialize all labels as "O" (Outside)
        labels: list[int] = [label_to_id["O"]] * len(offset_mapping)

        # Get entity spans for this example
        privacy_mask = examples["privacy_mask"][batch_idx]

        # Normalize to a consistent format: list of (label, start, end) tuples
        entities: list[tuple[str, int, int]] = []

        if isinstance(privacy_mask, list):
            # Format 1: List of entity dictionaries
            # [{"label": "EMAIL", "start": 10, "end": 25, ...}, ...]
            for entity in privacy_mask:
                if isinstance(entity, dict):
                    ent_label = entity.get("label", "")
                    ent_start = entity.get("start", 0)
                    ent_end = entity.get("end", 0)
                    if ent_label and ent_end > ent_start:
                        entities.append((ent_label, ent_start, ent_end))

        elif isinstance(privacy_mask, dict):
            # Format 2: Dict with parallel lists (HuggingFace columnar format)
            # {"label": ["EMAIL", "PHONE"], "start": [10, 45], "end": [25, 57], ...}
            entity_labels = privacy_mask.get("label", [])
            entity_starts = privacy_mask.get("start", [])
            entity_ends = privacy_mask.get("end", [])

            # Ensure all are lists (might be single values)
            if not isinstance(entity_labels, list):
                entity_labels = [entity_labels]
            if not isinstance(entity_starts, list):
                entity_starts = [entity_starts]
            if not isinstance(entity_ends, list):
                entity_ends = [entity_ends]

            for ent_label, ent_start, ent_end in zip(entity_labels, entity_starts, entity_ends):
                if ent_label and ent_end > ent_start:
                    entities.append((ent_label, ent_start, ent_end))

        # Process each entity span and align to tokens
        for ent_label, ent_start, ent_end in entities:
            is_first_token = True

            for token_idx, (tok_start, tok_end) in enumerate(offset_mapping):
                # Skip special tokens (offset is (0, 0) for [CLS], [SEP], [PAD])
                if tok_start == tok_end == 0:
                    labels[token_idx] = -100  # Ignored in loss calculation
                    continue

                # Check if this token overlaps with the entity span
                if tok_start < ent_end and tok_end > ent_start:
                    bio_label = f"B-{ent_label}" if is_first_token else f"I-{ent_label}"

                    # Only assign if the label exists in our vocabulary
                    if bio_label in label_to_id:
                        labels[token_idx] = label_to_id[bio_label]
                        is_first_token = False

        all_labels.append(labels)

    # Remove offset_mapping from output (not needed for training)
    tokenized.pop("offset_mapping")
    tokenized["labels"] = all_labels

    return tokenized


print("‚úì Helper functions defined: build_label_vocabulary, tokenize_and_align_labels")
print("  Now handles both list-of-dicts and dict-of-lists formats for privacy_mask")

‚úì Helper functions defined: build_label_vocabulary, tokenize_and_align_labels
  Now handles both list-of-dicts and dict-of-lists formats for privacy_mask


## Metrics Computation Function


In [44]:
import numpy as np


def create_compute_metrics(id_to_label: dict[int, str]) -> callable:
    """
    Create a metrics computation function for NER evaluation.

    Uses seqeval library for proper NER metrics (precision, recall, F1) that
    account for entity boundaries, not just individual token labels.

    Args:
        id_to_label: Mapping from integer label IDs to BIO label strings.

    Returns:
        Callable that computes metrics from EvalPrediction object.
    """
    from seqeval.metrics import (
        classification_report,
        f1_score,
        precision_score,
        recall_score,
    )

    def compute_metrics(eval_pred: Any) -> dict[str, float]:
        """
        Compute NER evaluation metrics from model predictions.

        Args:
            eval_pred: EvalPrediction object with predictions and label_ids.

        Returns:
            Dictionary containing precision, recall, and F1 scores.
        """
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=2)

        # Convert IDs to label strings, filtering out special tokens (-100)
        true_predictions: list[list[str]] = []
        true_labels: list[list[str]] = []

        for prediction, label in zip(predictions, labels):
            pred_tags: list[str] = []
            true_tags: list[str] = []

            for pred_id, label_id in zip(prediction, label):
                # Skip special tokens (label_id == -100)
                if label_id == -100:
                    continue

                pred_tags.append(id_to_label.get(pred_id, "O"))
                true_tags.append(id_to_label.get(label_id, "O"))

            true_predictions.append(pred_tags)
            true_labels.append(true_tags)

        # Compute seqeval metrics (entity-level, not token-level)
        metrics = {
            "precision": precision_score(true_labels, true_predictions),
            "recall": recall_score(true_labels, true_predictions),
            "f1": f1_score(true_labels, true_predictions),
        }

        # Log detailed classification report to console
        print("\n" + "=" * 60)
        print("ENTITY-LEVEL CLASSIFICATION REPORT")
        print("=" * 60)
        print(classification_report(true_labels, true_predictions))
        print("=" * 60 + "\n")

        return metrics

    return compute_metrics


print("‚úì Metrics function defined: create_compute_metrics")

‚úì Metrics function defined: create_compute_metrics


## Main Training Function (Modal Remote Execution)


In [45]:
@app.function(
    image=training_image,
    gpu="H100",
    timeout=60 * 60 * 6,  # 6 hour timeout for full training
    secrets=[HF_SECRET],
    volumes={VOLUME_MOUNT_PATH: volume},
    memory=65536,  # 64GB RAM
)
def train_deberta_dora(config_dict: dict[str, Any] | None = None) -> dict[str, Any]:
    """
    Fine-tune DeBERTa-v3-large with DoRA for PII Named Entity Recognition.

    This function runs REMOTELY on a Modal H100 GPU and performs:
        1. Dataset loading from HuggingFace Hub
        2. Tokenization and label alignment
        3. DoRA adapter initialization on attention modules
        4. Training with early stopping
        5. Final model saving and Hub upload

    Args:
        config_dict: Training configuration as a dictionary (dataclasses don't
                     serialize well across Modal's boundary). Uses defaults if None.

    Returns:
        Dictionary containing training metrics and final model location.
    """
    import os
    import torch
    from datasets import load_dataset
    from transformers import (
        AutoModelForTokenClassification,
        AutoTokenizer,
        DataCollatorForTokenClassification,
        EarlyStoppingCallback,
        Trainer,
        TrainingArguments,
    )
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
    )
    from huggingface_hub import login

    # =========================================================================
    # Reconstruct config from dict (dataclasses don't serialize across Modal)
    # =========================================================================
    if config_dict is None:
        config_dict = {}

    # Default values
    model_name = config_dict.get("model_name", "microsoft/deberta-v3-large")
    dataset_name = config_dict.get("dataset_name", "Ari-S-123/better-english-pii-anonymizer")
    output_dir = config_dict.get("output_dir", "/checkpoints/pii-deberta-dora")
    hub_model_id = config_dict.get("hub_model_id", "Ari-S-123/deberta-v3-large-pii-dora")
    max_length = config_dict.get("max_length", 512)
    learning_rate = config_dict.get("learning_rate", 2e-5)
    num_train_epochs = config_dict.get("num_train_epochs", 5)
    per_device_train_batch_size = config_dict.get("per_device_train_batch_size", 16)
    per_device_eval_batch_size = config_dict.get("per_device_eval_batch_size", 32)
    gradient_accumulation_steps = config_dict.get("gradient_accumulation_steps", 2)
    warmup_ratio = config_dict.get("warmup_ratio", 0.1)
    weight_decay = config_dict.get("weight_decay", 0.01)
    lora_r = config_dict.get("lora_r", 16)
    lora_alpha = config_dict.get("lora_alpha", 32)
    lora_dropout = config_dict.get("lora_dropout", 0.05)
    target_modules = config_dict.get("target_modules", ["query_proj", "key_proj", "value_proj", "dense"])
    early_stopping_patience = config_dict.get("early_stopping_patience", 3)
    eval_steps = config_dict.get("eval_steps", 500)
    save_steps = config_dict.get("save_steps", 500)
    logging_steps = config_dict.get("logging_steps", 100)
    seed = config_dict.get("seed", 42)
    bf16 = config_dict.get("bf16", True)
    dataloader_num_workers = config_dict.get("dataloader_num_workers", 4)
    push_to_hub = config_dict.get("push_to_hub", True)

    # Authenticate with HuggingFace Hub
    hf_token = os.environ.get("HF_TOKEN")
    if hf_token:
        login(token=hf_token)
        print("‚úì Authenticated with HuggingFace Hub")
    else:
        print("‚ö† No HF_TOKEN found. Hub operations may fail.")

    # Set seed for reproducibility
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

    print("\n" + "=" * 70)
    print("DeBERTa-v3-large DoRA Fine-Tuning for PII Detection")
    print("=" * 70)
    print(f"Model: {model_name}")
    print(f"Dataset: {dataset_name}")
    print(f"DoRA Rank: {lora_r}, Alpha: {lora_alpha}")
    print(f"Target Modules: {target_modules}")
    print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print("=" * 70 + "\n")

    # =========================================================================
    # STEP 1: Load Dataset
    # =========================================================================
    print("üì¶ Loading dataset from HuggingFace Hub...")
    dataset = load_dataset(dataset_name)
    print(f"   Train samples: {len(dataset['train']):,}")
    print(f"   Test samples: {len(dataset['test']):,}")

    # =========================================================================
    # STEP 2: Build Label Vocabulary
    # =========================================================================
    print("\nüè∑Ô∏è  Building label vocabulary...")
    label_list, label_to_id, id_to_label = build_label_vocabulary(dataset)
    num_labels = len(label_list)
    print(f"   Total BIO labels: {num_labels}")
    print(f"   Entity types: {(num_labels - 1) // 2}")

    # =========================================================================
    # STEP 3: Load Tokenizer
    # =========================================================================
    print(f"\nüìù Loading tokenizer: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        add_prefix_space=True,  # Important for consistent tokenization
    )
    print(f"   Vocab size: {tokenizer.vocab_size:,}")

    # =========================================================================
    # STEP 4: Tokenize Dataset
    # =========================================================================
    print("\n‚öôÔ∏è  Tokenizing and aligning labels...")

    def tokenize_fn(examples):
        return tokenize_and_align_labels(
            examples,
            tokenizer=tokenizer,
            label_to_id=label_to_id,
            max_length=max_length,
        )

    # NOTE: We intentionally do NOT use num_proc here.
    # Multiprocessing requires pickling the tokenize_fn closure, which fails
    # because Modal's async runtime or notebook event loops create unpicklable
    # asyncio.Task objects that get captured in the closure chain.
    # Single-process tokenization is still fast (~2-5 min for 125k examples).
    tokenized_dataset = dataset.map(
        tokenize_fn,
        batched=True,
        remove_columns=dataset["train"].column_names,
        desc="Tokenizing",
    )

    print(f"   Train tokens: {len(tokenized_dataset['train']):,} examples")
    print(f"   Test tokens: {len(tokenized_dataset['test']):,} examples")

    # =========================================================================
    # STEP 5: Load Base Model
    # =========================================================================
    print(f"\nü§ñ Loading base model: {model_name}")
    model = AutoModelForTokenClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        id2label=id_to_label,
        label2id=label_to_id,
        dtype=torch.bfloat16 if bf16 else torch.float32,
    )
    print(f"   Model parameters: {model.num_parameters():,}")

    # =========================================================================
    # STEP 6: Configure DoRA Adapter
    # =========================================================================
    print("\nüîß Configuring DoRA adapter...")
    peft_config = LoraConfig(
        task_type=TaskType.TOKEN_CLS,
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        target_modules=target_modules,
        use_dora=True,  # Enable Weight-Decomposed Low-Rank Adaptation
        bias="none",
        inference_mode=False,
    )

    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    
    print("\n‚ö° Compiling model with torch.compile()...")
    model = torch.compile(model)

    # =========================================================================
    # STEP 7: Setup Training Arguments
    # =========================================================================
    print("\nüìã Configuring training arguments...")

    training_args = TrainingArguments(
        output_dir=output_dir,
        # Training hyperparameters
        learning_rate=learning_rate,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_ratio=warmup_ratio,
        weight_decay=weight_decay,
        # Precision
        bf16=bf16,
        # Evaluation and saving
        eval_strategy="steps",
        eval_steps=eval_steps,
        save_strategy="steps",
        save_steps=save_steps,
        save_total_limit=3,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        # Logging
        logging_dir=f"{output_dir}/logs",
        logging_steps=logging_steps,
        report_to=["tensorboard"],
        # Performance
        dataloader_num_workers=dataloader_num_workers,
        dataloader_pin_memory=True,
        # Hub (adapter will be pushed separately)
        push_to_hub=False,
        # Reproducibility
        seed=seed,
        data_seed=seed,
        # Don't auto-remove columns (torch.compile breaks signature detection)
        remove_unused_columns=False
    )

    # =========================================================================
    # STEP 8: Initialize Trainer
    # =========================================================================
    print("\nüèãÔ∏è Initializing Trainer...")

    data_collator = DataCollatorForTokenClassification(
        tokenizer=tokenizer,
        padding=True,  # Dynamic padding per batch
        label_pad_token_id=-100,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["test"],
        processing_class=tokenizer,
        data_collator=data_collator,
        compute_metrics=create_compute_metrics(id_to_label),
        callbacks=[
            EarlyStoppingCallback(
                early_stopping_patience=early_stopping_patience,
            ),
        ],
    )

    # =========================================================================
    # STEP 9: Train!
    # =========================================================================
    print("\n" + "=" * 70)
    print("üöÄ STARTING TRAINING")
    print("=" * 70 + "\n")

    train_result = trainer.train()

    print("\n" + "=" * 70)
    print("‚úì TRAINING COMPLETE")
    print("=" * 70)
    print(f"   Total steps: {train_result.global_step}")
    print(f"   Training loss: {train_result.training_loss:.4f}")

    # =========================================================================
    # STEP 10: Final Evaluation
    # =========================================================================
    print("\nüìä Running final evaluation...")
    eval_results = trainer.evaluate()
    print(f"   Final F1: {eval_results.get('eval_f1', 'N/A'):.4f}")
    print(f"   Final Precision: {eval_results.get('eval_precision', 'N/A'):.4f}")
    print(f"   Final Recall: {eval_results.get('eval_recall', 'N/A'):.4f}")

    # =========================================================================
    # STEP 11: Save Model Locally
    # =========================================================================
    final_output_dir = f"{output_dir}/final"
    print(f"\nüíæ Saving model to {final_output_dir}...")

    # Save adapter weights
    model.save_pretrained(final_output_dir)
    tokenizer.save_pretrained(final_output_dir)

    # Commit volume to persist checkpoint
    volume.commit()
    print("   ‚úì Checkpoint persisted to Modal volume")

    # =========================================================================
    # STEP 12: Push to HuggingFace Hub
    # =========================================================================
    if push_to_hub and hf_token:
        print(f"\n‚òÅÔ∏è  Pushing adapter to HuggingFace Hub: {hub_model_id}")
        model.push_to_hub(
            hub_model_id,
            use_auth_token=hf_token,
            commit_message="DoRA fine-tuned DeBERTa-v3-large for PII detection",
        )
        tokenizer.push_to_hub(
            hub_model_id,
            use_auth_token=hf_token,
        )
        print(f"   ‚úì Adapter available at: https://huggingface.co/{hub_model_id}")

    # =========================================================================
    # STEP 13: Return Results
    # =========================================================================
    results = {
        "training_loss": train_result.training_loss,
        "global_step": train_result.global_step,
        "eval_f1": eval_results.get("eval_f1"),
        "eval_precision": eval_results.get("eval_precision"),
        "eval_recall": eval_results.get("eval_recall"),
        "model_path": final_output_dir,
        "hub_model_id": hub_model_id if push_to_hub else None,
        "num_labels": num_labels,
        "trainable_params": sum(p.numel() for p in model.parameters() if p.requires_grad),
    }

    print("\n" + "=" * 70)
    print("üéâ FINE-TUNING PIPELINE COMPLETE")
    print("=" * 70)
    for key, value in results.items():
        print(f"   {key}: {value}")
    print("=" * 70 + "\n")

    return results


print("‚úì Remote training function defined: train_deberta_dora")
print("  GPU: H100")
print("  Timeout: 6 hours")
print("  Memory: 64GB")

‚úì Remote training function defined: train_deberta_dora
  GPU: H100
  Timeout: 6 hours
  Memory: 64GB


## Run Training

This cell dispatches the training job to Modal's cloud infrastructure. This local machine just orchestrates - all heavy
computation happens remotely.

In a notebook, we must wrap the .remote() call inside app.run(). This tells Modal to "start" the app, hydrate the
function metadata, andestablish the connection to Modal's infrastructure.


In [37]:
from dataclasses import asdict

# Create your training configuration
# Modify these values as needed before running
config = TrainingConfig(
    # Model settings
    model_name="microsoft/deberta-v3-large",
    dataset_name="Ari-S-123/better-english-pii-anonymizer",
    hub_model_id="Ari-S-123/deberta-v3-large-pii-dora",
    
    # Training hyperparameters
    learning_rate=2e-5,
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    gradient_accumulation_steps=1,
    
    # DoRA settings (per PLAN.md)
    lora_r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["query_proj", "key_proj", "value_proj", "dense"],
    
    # Early stopping
    early_stopping_patience=3,
    
    # Hub upload
    push_to_hub=True,
)

print("=" * 70)
print("üöÄ DISPATCHING TRAINING TO MODAL H100 GPU")
print("=" * 70)
print(f"Model: {config.model_name}")
print(f"Dataset: {config.dataset_name}")
print(f"DoRA Rank: {config.lora_r}, Alpha: {config.lora_alpha}")
print(f"Learning Rate: {config.learning_rate}")
print(f"Epochs: {config.num_train_epochs}")
print(f"Batch Size: {config.per_device_train_batch_size} (effective: {config.per_device_train_batch_size * config.gradient_accumulation_steps})")
print("=" * 70)
print("\nThis will:")
print("  1. Build the Docker image (cached after first run)")
print("  2. Spin up an H100 GPU on Modal")
print("  3. Download dataset and model from HuggingFace")
print("  4. Train with DoRA for up to 5 epochs (early stopping enabled)")
print("  5. Push final adapter to HuggingFace Hub")
print("\nLogs will stream below. This may take 1-3 hours depending on dataset size.")
print("=" * 70 + "\n")

# Convert dataclass to dict for serialization across Modal boundary
config_dict = asdict(config)

# =============================================================================
# THE KEY FIX: Wrap .remote() calls inside app.run() context manager
# =============================================================================
# When running Modal from a notebook (not via `modal run`), you must explicitly
# "run" the app. The context manager:
#   1. Connects to Modal's API
#   2. Registers all functions and their metadata (hydration)
#   3. Builds/fetches the Docker image if needed
#   4. Keeps the connection alive while your code runs
#
# modal.enable_output() ensures logs stream back to your notebook in real-time.

with modal.enable_output():
    with app.run():
        results = train_deberta_dora.remote(config_dict)

# Display final results (this runs after training completes)
print("\n" + "=" * 70)
print("‚úÖ TRAINING COMPLETE - RESULTS")
print("=" * 70)
for key, value in results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")
print("=" * 70)

üöÄ DISPATCHING TRAINING TO MODAL H100 GPU
Model: microsoft/deberta-v3-large
Dataset: Ari-S-123/better-english-pii-anonymizer
DoRA Rank: 16, Alpha: 32
Learning Rate: 2e-05
Epochs: 5
Batch Size: 32 (effective: 32)

This will:
  1. Build the Docker image (cached after first run)
  2. Spin up an H100 GPU on Modal
  3. Download dataset and model from HuggingFace
  4. Train with DoRA for up to 5 epochs (early stopping enabled)
  5. Push final adapter to HuggingFace Hub

Logs will stream below. This may take 1-3 hours depending on dataset size.

[2K[32m‚úì[0m Initialized. [37mView run at [0m
[4;37mhttps://modal.com/apps/ari-s-123/main/ap-yH196O2vUMuQhc08M61ufk[0m
[2K[34m-[0m Initializing...
[2K[34m/[0m Creating objects...objects...
[2K[1A[2K[34m\[0m Creating objects...train_deberta_dora...
[90m‚îî‚îÄ‚îÄ [0müî® Created function train_deberta_dora.
[1A[2K[1A[2K[32m‚úì[0m Created objects.
[90m‚îî‚îÄ‚îÄ [0müî® Created function train_deberta_dora.
[2K[34m/[0m [3

ValueError: No columns in the dataset match the model's forward method signature: (args, kwargs, label, label_ids). The following columns have been ignored: [labels, attention_mask, input_ids, token_type_ids]. Please check the dataset and model. You may need to set `remove_unused_columns=False` in `TrainingArguments`.

## How to Use the Trained Model

After training completes, use the adapter like this:

```python
from peft import PeftModel
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

# Load base model and apply your DoRA adapter
base_model = AutoModelForTokenClassification.from_pretrained(
    "microsoft/deberta-v3-large"
)
model = PeftModel.from_pretrained(
    base_model,
    "Ari-S-123/deberta-v3-large-pii-dora"  # Your trained adapter
)
tokenizer = AutoTokenizer.from_pretrained(
    "Ari-S-123/deberta-v3-large-pii-dora"
)

# Create NER pipeline for easy inference
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Test it!
text = "Contact John Smith at john.smith@email.com or call 555-123-4567"
entities = ner(text)

for ent in entities:
    print(f"  {ent['entity_group']}: '{ent['word']}' (confidence: {ent['score']:.2%})")
```

The model will be available at: https://huggingface.co/Ari-S-123/deberta-v3-large-pii-dora
