A concrete NeMo GPT CLM training command (from NVIDIA docs)

NeMo provides megatron_gpt_pretraining.py with Hydra config overrides (devices, steps, batch sizes, TP/PP, optimizer schedule, etc.). Example from the NeMo GPT training guide:

In [None]:
python <NeMo_ROOT>/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
  --config-path=<NeMo_ROOT>/examples/nlp/language_modeling/conf \
  --config-name=megatron_gpt_config \
  trainer.devices=1 trainer.num_nodes=1 trainer.max_steps=300000 trainer.precision=16 \
  model.micro_batch_size=6 model.global_batch_size=192 \
  model.tensor_model_parallel_size=1 model.pipeline_model_parallel_size=1 \
  model.encoder_seq_length=1024 model.max_position_embeddings=1024 \
  model.optim.name=fused_adam model.optim.lr=6e-4 model.optim.sched.name=CosineAnnealing


Minimal “gold standard” PyTorch CLM step (so you truly understand it)

In [None]:
import torch
import torch.nn.functional as F

def clm_loss(logits, input_ids, pad_id=None):
    """
    logits: [B, T, V]
    input_ids: [B, T]
    """
    # shift: predict token t+1 from positions <= t
    shift_logits = logits[:, :-1, :].contiguous()     # [B, T-1, V]
    shift_labels = input_ids[:, 1:].contiguous()      # [B, T-1]

    if pad_id is None:
        return F.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)),
                               shift_labels.view(-1))
    else:
        return F.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)),
                               shift_labels.view(-1),
                               ignore_index=pad_id)


Here is how you configure a Causal Language Modeling run for Continued Pre-training on enterprise data.

In [None]:
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    TrainingArguments, 
    Trainer,
    DataCollatorForLanguageModeling
)
import torch

# 1. Load Pre-trained Model (e.g., Llama-3-8B)
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Enterprise Note: Llama doesn't have a default pad token, usually set to EOS
tokenizer.pad_token = tokenizer.eos_token 

# 2. Dataset Preparation (Concept)
# Assume 'tokenized_datasets' contains your internal domain data
# packed into blocks of block_size=2048 or 4096.

# 3. Data Collator
# This handles the dynamic masking (if needed) but for CLM it just shifts labels
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False  # Crucial: False for Causal LM, True for BERT
)

# 4. Enterprise Training Arguments
training_args = TrainingArguments(
    output_dir="./cpt-llama-finance",
    
    # 3D Parallelism / Efficiency Params
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4, # Simulates larger batch size
    fp16=False,
    bf16=True, # Brain Float 16 is MANDATORY for Llama stability
    
    # Optimizer
    learning_rate=2e-5, # Low LR for continued pre-training
    weight_decay=0.01,
    lr_scheduler_type="cosine", # Standard for LLMs
    warmup_ratio=0.03, # Prevents loss spikes at start
    
    # Distributed Training Strategy (DeepSpeed/FSDP)
    # In production, you would pass a deepspeed config file here
    # deepspeed="ds_config_zero3.json", 
    
    logging_steps=10,
    save_strategy="epoch",
)

# 5. Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
)

# 6. Train
# trainer.train()