# NeMo AutoModel LoRA Fine-tuning — End-to-End, Config‑Driven Template

This notebook is a practical, reproducible playbook for fine-tuning any Hugging Face causal LLM with NeMo 2.0 using a single, config‑driven workflow. It uses the modular training code under this folder (`config.py`, `data_modules.py`, `recipe_factory.py`, `executors.py`, `train.py`) to:
- Build a complete TrainingConfig in Python or from YAML/JSON
- Validate the setup with a dry‑run before launching
- Run locally or on SLURM (same config) via NeMo‑Run executors
- Fine‑tune with LoRA by default, or switch to full fine‑tuning with a single flag
- Work with HF datasets or local JSON/JSONL conversation files, including tool‑use metadata

## What you will do here
- Configure model/data/optimizer/trainer/compute with `TrainingConfig`
- Inspect and adapt conversation formatting (incl. tool use) in `data_modules.py`
- Create LoRA or full‑FT recipes with `recipe_factory.py`
- Choose local or SLURM execution with `executors.py`
- Dry‑run to validate, then launch training and store checkpoints
- Optionally save/load configs to YAML/JSON for reproducibility.

## Requirements
- NVIDIA NeMo container (e.g., `nvcr.io/nvidia/nemo:25.07`)
- GPU access and (optionally) SLURM credentials
- Hugging Face token if your model is gated

> Tip: Run cells top‑to‑bottom. Cells with a `DO_*` flag are safe toggles to enable/disable heavier actions like training or SLURM runs.


In [None]:
# Environment checks and imports
DO_IMPORTS = True  # set False to skip if running in a restricted env

if DO_IMPORTS:
    import os, sys
    from pathlib import Path

    try:
        PROJECT_ROOT = Path(__file__).resolve().parent
    except NameError:
        PROJECT_ROOT = Path.cwd()
    print("Project folder:", PROJECT_ROOT)

    # Ensure local imports work
    if str(PROJECT_ROOT) not in sys.path:
        sys.path.append(str(PROJECT_ROOT))

    # Quick dependency check
    try:
        import nemo_run as run  # Provided by the NeMo run utilities
        from nemo.collections import llm
        print("nemo_run and nemo imported OK")
    except Exception as e:
        print("Warning: nemo_run/nemo import issue:", e)

    # Local modules
    try:
        import config as cfg
        import executors
        import recipe_factory
        import data_modules
        import train as train_script
        print("Local modules imported OK")
    except Exception as e:
        print("Local import issue:", e)


Project folder: /lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_mlops/soverign_ai/NeMo/tutorials/llm/automodel/train
nemo_run and nemo imported OK
Local modules imported OK


## Configuration: Production-ready, flexible by design

Use `config.py` as the single source of truth for training. You can:
- Load from YAML/JSON files, or construct in Python.
- Override any field via CLI flags or environment variables.
- Keep secrets (e.g., `HF_TOKEN`) only in env vars.

Key config blocks:
- `ModelConfig`: `name` (HF repo or local path), `cache_dir`, `token`
- `DataConfig`: HF dataset name or `[train.jsonl, val.jsonl]`, `seq_length`, `micro_batch_size`, `split`, `tokenizer_name`
- `LoRAConfig`: `target_modules`, `dim`, `dropout`, init methods
- `OptimizerConfig`: LR, warmup, weight decay, scheduler knobs
- `TrainerConfig`: steps, validation/checkpoint/logging cadence
- `ComputeConfig`: local/SLURM, nodes, gpus, time, tunnels, container
- `PathConfig`: checkpoints/data roots
- `EnvironmentConfig`: NCCL/NVTE/TRANSFORMERS_OFFLINE, etc.

Repro tips:
- Pin `container_image` and save the resolved YAML/JSON next to checkpoints.
- Set `experiment_name`/version to track runs.

In [None]:
# Build a TrainingConfig in Python
DO_BUILD_CONFIG = True

if DO_BUILD_CONFIG:
    # Minimal quick-test config; adjust for your use case
    base_config = cfg.TrainingConfig(
        model=cfg.ModelConfig(
            name="mistralai/Mistral-7B-Instruct-v0.3",
            cache_dir=str((Path.cwd() / "models" / "hf_cache").resolve()),
            token=os.environ.get("HF_TOKEN"),
        ),
        data=cfg.DataConfig(
            # Can be a HF dataset name, or a list of [train.jsonl/json, val.jsonl/json]
            dataset_name="rajpurkar/squad",
            seq_length=1024,
            micro_batch_size=2,
            split="train[:200]",  # or ['train[:200]','validation[:50]'] for train/val
            tokenizer_name=None,   # defaults to model name
        ),
        lora=cfg.LoRAConfig(
            target_modules=["o_proj"],
            dim=8,
            dropout=0.1,
        ),
        optimizer=cfg.OptimizerConfig(
            lr=2e-4,
            weight_decay=0.01,
            warmup_steps=10,
        ),
        trainer=cfg.TrainerConfig(
            max_steps=5,
            log_every_n_steps=1,
            val_check_interval=1,
            checkpoint_filename="LoRA_Finetune",
            version=1,
        ),
        compute=cfg.ComputeConfig(
            nodes=1,
            gpus_per_node=1,
            time="00:30:00",
            use_slurm=False,
            tunnel_type="ssh",
        ),
        paths=cfg.PathConfig(
            project_root=str(Path.cwd()),
            checkpoint_dir=None,  # will default to project_root/models/checkpoints
            data_dir=None,        # will default to project_root/data
        ),
        environment=cfg.EnvironmentConfig(
            transformers_offline="0",
            torch_nccl_avoid_record_streams="1",
        ),
        experiment_name="Notebook_LoRA_Quickstart",
    )

    print("Config built. Key fields:")
    print("Model:", base_config.model)
    print("Data:", base_config.data)
    print("LoRA:", base_config.lora)
    print("Trainer:", base_config.trainer)
    print("Compute:", base_config.compute)


Config built. Key fields:
Model: ModelConfig(name='mistralai/Mistral-7B-Instruct-v0.3', cache_dir='/lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_mlops/soverign_ai/NeMo/tutorials/llm/automodel/train/models/hf_cache', token='hf_...')
Data: DataConfig(dataset_name=['/lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_mlops/soverign_ai/data/conversations.jsonl', '/lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_mlops/soverign_ai/data/conversations.jsonl'], seq_length=1024, micro_batch_size=2, split='train[:200]', tokenizer_name=['/lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_mlops/soverign_ai/data/conversations.jsonl', '/lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_mlops/soverign_ai/data/conversations.jsonl'])
LoRA: LoRAConfig(target_modules=['o_proj'], dim=8, dropout=0.1, lora_A_init_method='xavier', lora_B_init_method='zero')
Trainer: TrainerConfig(max_steps=5, num_sanity_val_steps=0, val_check_interval=1, log_every_n_steps=1, check

## Data modules: flexible chat + tool-use support

`data_modules.py` provides `CustomHFDataModule`:
- Accepts a HF dataset name or local `[train.jsonl, val.jsonl]`.
- Applies a chat template compatible with HF tokenizers.
- Supports tool calls/results serialization inside messages.
- Emits `input_ids`, `labels`, and `loss_mask`.

Your data
- Point `data.dataset_name` to HF datasets or your JSON/JSONL files.
- Customize `formatting_prompts_func_with_chat_template` to match your schema.
- Adjust `seq_length`, `micro_batch_size`, `split`, `tokenizer_name`.

Minimal JSONL example:
```json
{"messages":[{"role":"system","content":"You are a helpful virtual assistant."},{"role":"user","content":"Hi"}, {"role":"assistant","content":"Hello!"}]}
```

Tool-use example (assistant emits tool calls, tool returns results):
```json
{"messages":[
  {"role":"user","content":"Weather in SF?"},
  {"role":"assistant","tool_calls":[{"id":"c1","type":"function","function":{"name":"get_weather","arguments":{"city":"San Francisco"}}}]},
  {"role":"tool","tool_call_id":"c1","content":"{ \"temp\": 20, \"unit\": \"C\" }"},
  {"role":"assistant","content":"It is 20°C in SF."}
]}
```

In [None]:
# Inspect data module functions
import inspect

print("CustomHFDataModule:")
print(inspect.getsource(data_modules.CustomHFDataModule.formatting_prompts_func_with_chat_template))


LmsysHFDataModule:
    def formatting_prompts_func_with_chat_template(self, example: Dict[str, Any], start_of_turn_token: Optional[str] = None) -> Dict[str, List[int]]:
        """
        Format any conversation example using Mistral chat template.
        
        Args:
            example: Dataset example, preferably with a 'messages' list and optional 'tools'.
            start_of_turn_token: Token marking start of assistant response
            
        Returns:
            Dictionary with input_ids, labels, and loss_mask
        """
        tools = example.get('tools', [])

        formatted_text: List[Dict[str, str]] = []
        raw_messages = example.get('messages')

        # Build system prompt that includes the available tools
        tools_block = '[AVAILABLE_TOOLS]' + json.dumps(tools, separators=(',', ':')) + '[/AVAILABLE_TOOLS]'
        if raw_messages[0].get('role') == 'system':
            system_content = str(raw_messages[0].get('content', ''))
            if '[AVAIL

## Recipes: LoRA and full fine-tuning, switchable

`recipe_factory.py` exposes:
- `create_lora_recipe(config)`: PEFT with `llm.peft.LoRA`
- `create_full_finetune_recipe(config)`: full model FT
- `create_recipe(config, recipe_type)`: one-line switch

Production knobs:
- LoRA: `config.lora.*` (rank, dropout, targets)
- Optimizer/scheduler: `config.optimizer.*`
- Trainer: `config.trainer.*` (precision, grad clip, logging)
- Resumption/checkpointing via `config.trainer` and `paths`

In [None]:
# Create and inspect a LoRA recipe (no training yet)
DO_CREATE_RECIPE = True

if DO_CREATE_RECIPE:
    recipe = recipe_factory.create_recipe(base_config, recipe_type="lora")
    # Show a few key fields for verification
    print("Recipe created. Trainer max_steps:", recipe.trainer.max_steps)
    print("LoRA dim:", getattr(getattr(recipe, 'peft', None), 'dim', None))
    print("Data micro_batch_size:", recipe.data.micro_batch_size)


Recipe created. Trainer max_steps: 5
LoRA dim: 8
Data micro_batch_size: 2


## Executors: same config locally and on SLURM

`executors.py` picks an executor from `config.compute.use_slurm`:
- Local: `run.LocalExecutor` with `torchrun`
- SLURM: `run.SlurmExecutor` (SSH or local tunnel)

Production guidance:
- Pin `container_image`, set `custom_mounts` for datasets and checkpoints.
- Configure `account`, `partition`, `remote_job_dir`, `nodes`, `gpus_per_node`, `time`, `retries`.
- Use `dry_run=True` to preflight mounts, tokens, and dataset access.

In [None]:
# Create the appropriate executor to validate settings (no run yet)
DO_CREATE_EXECUTOR = True  # set True to test executor construction

if DO_CREATE_EXECUTOR:
    exe = executors.create_executor(base_config)
    print("Executor type:", type(exe).__name__)
    # Print a few key attributes if available
    if hasattr(exe, 'ntasks_per_node'):
        print("ntasks_per_node:", getattr(exe, 'ntasks_per_node'))


Executor type: LocalExecutor
ntasks_per_node: 1


## Dry-run: fast preflight for production

Use `train.run_training(..., dry_run=True)` to validate without starting training.
- Verifies config coherence (paths, dataset, tokens, LoRA/FT settings)
- Builds the recipe and executor
- Checks environment variables and mounts

Run this before every change to catch issues early.

In [None]:
# Validate only
DO_DRY_RUN = True

if DO_DRY_RUN:
    try:
        train_script.run_training(base_config, recipe_type="lora", dry_run=True)
        print("Dry-run validation passed.")
    except Exception as e:
        print("Dry-run validation failed:", e)


Dry-run validation passed.


## Local training: small, fast, reproducible

Use this to sanity‑check your full pipeline before scaling out.
- Start with tiny splits and low `max_steps`.
- Ensure `HF_TOKEN` is set and container is pinned.
- Logs and checkpoints go under `paths.checkpoint_dir`.

Tip: Keep a “smoke test” config checked into version control.

In [None]:
# Local run (small)
DO_LOCAL_TRAIN = True  # set True to run a quick local training

if DO_LOCAL_TRAIN:
    try:
        train_script.run_training(base_config, recipe_type="lora", dry_run=False)
        print("Training started/completed.")
    except Exception as e:
        print("Training failed to start:", e)


Starting lora training with experiment: Notebook_LoRA_Quickstart
Model: mistralai/Mistral-7B-Instruct-v0.3
Dataset: ['/lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_mlops/soverign_ai/data/conversations.jsonl', '/lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_mlops/soverign_ai/data/conversations.jsonl']
Compute: 1 nodes, 1 GPUs/node
Configuration validation passed
Environment configured with 9 variables
Creating lora training recipe


Log directory is: /root/.nemo_run/experiments/Notebook_LoRA_Quickstart/Notebook_LoRA_Quickstart_1755510148/lora_training


Log directory is: /root/.nemo_run/experiments/Notebook_LoRA_Quickstart/Notebook_LoRA_Quickstart_1755510148/lora_training
Launched app: local_persistent://nemo_run/lora_training-hx1wtvqnqmk61c


Waiting for job lora_training-hx1wtvqnqmk61c to finish [log=True]...


a_training/0 I0818 02:42:32.767000 185042 torch/distributed/run.py:649] Using nproc_per_node=1.
a_training/0 I0818 02:42:32.768000 185042 torch/distributed/launcher/api.py:195] Starting elastic_operator with launch configs:
a_training/0 I0818 02:42:32.768000 185042 torch/distributed/launcher/api.py:195]   entrypoint       : nemo_run.core.runners.fdl_runner
a_training/0 I0818 02:42:32.768000 185042 torch/distributed/launcher/api.py:195]   min_nodes        : 1
a_training/0 I0818 02:42:32.768000 185042 torch/distributed/launcher/api.py:195]   max_nodes        : 1
a_training/0 I0818 02:42:32.768000 185042 torch/distributed/launcher/api.py:195]   nproc_per_node   : 1
a_training/0 I0818 02:42:32.768000 185042 torch/distributed/launcher/api.py:195]   run_id           : 8284
a_training/0 I0818 02:42:32.768000 185042 torch/distributed/launcher/api.py:195]   rdzv_backend     : c10d
a_training/0 I0818 02:42:32.768000 185042 torch/distributed/launcher/api.py:195]   rdzv_endpoint    : localhost:0
a

Job lora_training-hx1wtvqnqmk61c finished: SUCCEEDED


Training completed successfully


Training started/completed.


## SLURM: scale out without code changes

Set `compute.use_slurm=True`, then populate:
- `account`, `partition`, `remote_job_dir`, `nodes`, `gpus_per_node`
- Tunnels (`user`, `host`) as needed
- `container_image`, `custom_mounts`, `time`, `retries`

Cluster‑specific knobs vary (e.g., `gres`, `gpus_per_node`). Start from `slurm_config.yaml`, then use `dry_run=True` to validate before submitting.

In [None]:
# Example: Switch to SLURM (do not run by default)
DO_SLURM_EXAMPLE = True

if DO_SLURM_EXAMPLE:
    slurm_cfg = base_config
    slurm_cfg.compute.use_slurm = True
    slurm_cfg.compute.account = "your_account"
    slurm_cfg.compute.partition = "your_partition"
    slurm_cfg.compute.remote_job_dir = "/path/to/remote/jobdir"
    slurm_cfg.compute.nodes = 1
    slurm_cfg.compute.gpus_per_node = 8
    slurm_cfg.compute.tunnel_type = "ssh"  # or "local"
    slurm_cfg.compute.user = "your_user"
    slurm_cfg.compute.host = "cluster.hostname"
    slurm_cfg.compute.container_image = "nvcr.io/nvidia/nemo:25.07"
    slurm_cfg.compute.custom_mounts = ["/home:/home"] # Add any other custom mounts here

    try:
        # Validate only
        train_script.run_training(slurm_cfg, recipe_type="lora", dry_run=True)
        print("SLURM dry-run validation passed.")
    except Exception as e:
        print("SLURM dry-run validation failed:", e)


Starting lora training with experiment: Notebook_LoRA_Quickstart
Model: mistralai/Mistral-7B-Instruct-v0.3
Dataset: ['/lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_mlops/soverign_ai/data/conversations.jsonl', '/lustre/fsw/portfolios/llmservice/projects/llmservice_nemo_mlops/soverign_ai/data/conversations.jsonl']
Compute: 1 nodes, 8 GPUs/node
Configuration validation passed
Environment configured with 9 variables
Dry run completed successfully


SLURM dry-run validation passed.


## Save and reuse configurations

Promote notebooks to scripts/CLI with saved configs.
- Use `TrainingConfig.to_yaml()` / `to_json()` to persist resolved configs
- Commit templates; track exact run configs with checkpoints
- Load configs in CI or non‑interactive jobs

In [None]:
# Save config
DO_SAVE_CONFIG = False

if DO_SAVE_CONFIG:
    out_yaml = Path.cwd() / "my_config.yaml"
    out_json = Path.cwd() / "my_config.json"
    base_config.to_yaml(str(out_yaml))
    base_config.to_json(str(out_json))
    print("Saved:", out_yaml)
    print("Saved:", out_json)


## Tips for production and flexibility

- Models: update `model.name` and `tokenizer_name`; pin container and HF revision.
- Data: point to HF datasets or local JSON/JSONL; customize the chat formatter for your schema/tooling.
- LoRA vs Full FT: start with LoRA; switch to full when you need capacity and have budget.
- Context/batching: balance `seq_length` + batch size with GPU RAM; scale using gradient accumulation.
- Optimizer/scheduler: use warmup; start higher LR for adapters, lower for full FT.
- Validation: keep small slices for quick signals; monitor TB logs.
- SLURM: preflight with `dry_run=True`; verify mounts, account/partition, and gres.
- Reproducibility: save resolved configs, seeds, and frequent checkpoints.

## Load configs from provided templates

You can start from the example configs like `basic_lora.yaml` and modify in-place.


In [None]:
# Example: load YAML config template
DO_LOAD_TEMPLATE = False

if DO_LOAD_TEMPLATE:
    template_path = PROJECT_ROOT / "basic_lora.yaml"
    try:
        loaded_cfg = cfg.TrainingConfig.from_yaml(str(template_path))
        print("Loaded template. Model:", loaded_cfg.model.name)
        print("Data:", loaded_cfg.data)
        print("Trainer max_steps:", loaded_cfg.trainer.max_steps)
    except Exception as e:
        print("Failed to load template:", e)


## Use local JSON/JSONL files

You can point `data.dataset_name` to a list of two paths: `[train_file, val_file]`. For quick tests, you can reuse the same file for both.


In [None]:
# Switch config to JSON files (example)
DO_USE_LOCAL_JSON = False

if DO_USE_LOCAL_JSON:
    train_path = PROJECT_ROOT / "data/conversations_train.jsonl"
    val_path = PROJECT_ROOT / "data/conversations_val.jsonl"
    base_config.data.dataset_name = [str(train_path), str(val_path)]
    base_config.data.split = ["train[:2]", "validation[:1]"]  # optional when using local json
    print("Using JSON files:", base_config.data.dataset_name)


## Typical Errors and Resolutions

- Missing HF Token: Set `HF_TOKEN` in your environment; avoid hard‑coding in configs.
- Dataset path issues: If using local JSON/JSONL, make sure both train and val files exist and are readable by the container; mount paths via `custom_mounts`.
- Tokenizer mismatch: If your tokenizer differs from the model, set `data.tokenizer_name` explicitly.
- SLURM GPU config: Clusters vary; confirm `gres`/`gpus_per_node` with admins. Use `dry_run=True` to validate before submit.
- Container parity: Pin `compute.container_image` and ensure CUDA/driver compatibility with your cluster.
- Checkpoint directory permissions: Ensure `paths.checkpoint_dir` is writable in both local and remote contexts.