# Training a non-English Reasoning Language Model on Lepton with NeMo 2.0

[NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html) is a powerful toolkit for training and fine-tuning Large Language Models. This notebook demonstrates how to create a reasoning-capable LLM for Spanish on Lepton using NeMo 2.0.

While many state-of-the-art models perform well in English, their performance often degrades significantly in other languages. This tutorial shows how data translation, continued pre-training (CPT), and supervised fine-tuning (SFT) can be combined to create a model that reasons in English and outputs answers in Spanish. This approach maintains strong reasoning capabilities while enabling Spanish language output for better accessibility.

### Objectives

This tutorial demonstrates a complete pipeline for training a reasoning-capable Large Language Model that reasons in English and outputs answers in Spanish, with a focus on the math domain:
1. **Data Translation**: We will translate raw texts and instruction-tuning datasets with reasoning traces from English to Spanish
2. **Continued Pre-Training (CPT)**: We will adapt the model to Spanish language and the math domain
3. **Supervised Fine-Tuning (SFT)**: We will teach the model to reason in English and output answers in Spanish in the math domain

### Workflow Overview

This workflow is structured into multiple steps:
  1. Prepare datasets for translation
  2. Deploy translation endpoint and translate datasets
  3. Prepare model and pre-tokenize data
  4. Run Continued Pre-Training (CPT)
  5. Run Supervised Fine-Tuning (SFT)
  6. Export and deploy the final model

### Requirements:
* Container: `nvcr.io/nvidia/nemo:25.07.gpt_oss`.
* GPUs: 1 GPU on a node. Endpoints and batch jobs use 4 GPUs.
* External Accounts: NVIDIA GPU Cloud key (`NGC_API_KEY`): https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#generate-an-api-key

### Step 1: Setup and Configuration

In this step, we will import the necessary libraries and define configuration parameters for the entire pipeline.


In [None]:
# Import necessary libraries
import asyncio
import glob
import json
import os
import random
import re
import shutil
import subprocess
import time
from pprint import pprint

import openai
from datasets import load_dataset
from tqdm.asyncio import tqdm

import nemo_run as run
from nemo import lightning as nl
from nemo.collections import llm
from nemo.collections.common.tokenizers import AutoTokenizer
from nemo.collections.llm.gpt.data import ChatDataModule, PreTrainingDataModule

### Step 2: Prepare Datasets for Translation

#### Step 2.1: Load CPT & SFT Datasets

In this step, we will load the datasets that will be used for Continued Pre-Training and Supervised Fine-Tuning. We will use two datasets:
- **CPT Dataset**: `nvidia/Nemotron-Pretraining-Dataset-sample` (Nemotron-CC-MATH subset)
- **SFT Dataset**: `nvidia/Nemotron-Post-Training-Dataset-v1` (math split)

Both datasets provide only English samples. We will translate them from English into Spanish to boost LLM language capability while maintaining English reasoning traces.


In [None]:
cpt_dataset = load_dataset("nvidia/Nemotron-Pretraining-Dataset-sample", "Nemotron-CC-MATH", split="train")
sft_dataset = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v1", split="math", streaming=True)

#### Step 2.2: Define Translation Prompt

We will use a carefully crafted prompt to ensure high-quality, literal translations that preserve the original meaning and structure.


In [None]:
TRANSLATION_PROMPT = """You are an expert linguistic translator, specializing in literal, high-fidelity \
translations from English to Spanish. Your sole function is to translate the text provided in the \
`<source_text>` block.

### RULES
1.  **Translate Literally:** Preserve all original meaning and structure.
2.  **Preserve Non-English Text:** Any text that is not in English (e.g., names, code, \
specific terms) must be kept in its original form.
3.  **Output Only Spanish:** Your entire response must be ONLY the Spanish translation. \
Do not include explanations, the source text, or any other text.

---
### EXAMPLE
Source Language: English
Target Language: Spanish

<source_text>
Please send the file to Jean-Pierre and use the command `run.sh --force`.
</source_text>

Expected Output:
Envíe el archivo a Jean-Pierre y utilice el comando `run.sh --force`.
---

### TASK
Source Language: English
Target Language: Spanish

<source_text>
{content}
</source_text>

Final Spanish Translation:"""

#### Step 2.3: Deploy Translation Endpoint

We will deploy a translation endpoint with 4 `openai/gpt-oss-20b` model replicas using NVIDIA Lepton platform with [NVIDIA NIM](https://developer.nvidia.com/nim). This requires setting up authentication and configuring the endpoint.

> **NOTE:** You should configure the following secrets/tokens for production use:
> - Container registry auth in Lepton (`IMAGE_PULL_SECRET`): https://docs.nvidia.com/dgx-cloud/lepton/features/workspace/registry/
> - Lepton API token (`LEPTON_KEY`): https://docs.nvidia.com/dgx-cloud/lepton/features/workspace/token/. Should be in the format `<workspace ID>:<API token>`
> - NVIDIA GPU Cloud key (`NGC_API_KEY`): https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#generate-an-api-key
> - `NODE_GROUP` in Lepton


In [None]:
# TODO: Replace these placeholder values with your actual credentials
os.environ["LEPTON_KEY"] = "<YOUR_LEPTON_API_KEY>"
os.environ["NODE_GROUP"] = "<YOUR_NODE_GROUP>"
os.environ["ACCESS_TOKEN"] = "my_access_token"
os.environ["IMAGE_PULL_SECRET"] = "<YOUR_IMAGE_PULL_SECRET>"
os.environ["NGC_API_KEY"] = "<YOUR_NGC_API_KEY>"
os.environ["TRANSLATION_ENDPOINT_NAME"] = "openai-gpt-oss-20b-translator"

In [None]:
%%bash
# Configuration for translation endpoint
REPLICAS=4
MODEL_NIM="nvcr.io/nim/openai/gpt-oss-20b:latest"

lep login -c $LEPTON_KEY

# Deploy the translation endpoint
/usr/local/bin/lep endpoint create -n $TRANSLATION_ENDPOINT_NAME --container-image $MODEL_NIM --image-pull-secrets $IMAGE_PULL_SECRET \
        --node-group $NODE_GROUP --resource-shape gpu.1xh200 --replicas-static $REPLICAS \
        --env "NGC_API_KEY=$NGC_API_KEY" --container-port 8000 --tokens $ACCESS_TOKEN


We will periodically check the endpoint status to ensure all replicas are ready before proceeding with translation. The function below uses the `lep endpoint status` command to monitor deployment progress. The same function will return the endpoint URL once all replicas are ready.


In [None]:
def wait_for_endpoint(endpoint_name: str, interval: int = 10) -> str:
    command = ["lep", "endpoint", "status", "-n", endpoint_name, "--detail"]
    while True:
        result = subprocess.run(command, capture_output=True, text=True, check=True)
        for line in result.stdout.split("\n"):
            if line.startswith("State"):
                _, state = line.strip().rsplit(" ", maxsplit=1)
                if "LeptonDeploymentState.Ready" in state:
                    print("Endpoint deployed!")
                else:
                    break
            url_match = re.search(r'https://[\w\d\.\-]+', line)
            if url_match:
                print(f"URL: {url_match[0]}")
                return url_match[0]
        print(f"Waiting for endpoint {endpoint_name} to be ready...")
        time.sleep(interval)

endpoint_url = wait_for_endpoint(os.environ["TRANSLATION_ENDPOINT_NAME"])

#### Step 2.4: Define Translation Functions

We will now define async functions to handle concurrent translation of individual texts, messages with reasoning traces, and complete SFT samples. These functions will enable efficient batch processing of our datasets.

> **IMPORTANT**: For messages with reasoning traces (marked by `<think>...</think>`), only the final answer after the reasoning is translated to Spanish. The reasoning content itself remains in English to preserve the model's reasoning capabilities.

For this tutorial, we will set the translation timeout to 300 seconds. Note that this may be too short for long inputs, especially with reasoning-enabled models. These models emit the translation only after finishing the reasoning trace, which can occasionally be longer than the final translation itself.


In [None]:
MODEL = "openai/gpt-oss-20b"
SAMPLE_CONCURRENCY = 1024
TRANSLATION_TIMEOUT = 300
SEMAPHORE = asyncio.Semaphore(SAMPLE_CONCURRENCY)
ENDPOINT = openai.AsyncOpenAI(
        api_key=os.environ["ACCESS_TOKEN"],
        base_url=endpoint_url + "/v1",
    )


async def translate_text(
    text: str,
    endpoint: openai.AsyncOpenAI = ENDPOINT,
    semaphore: asyncio.Semaphore = SEMAPHORE,
    model: str = MODEL,
    temperature: float = 0.2,
    top_p: float = 0.95,
    seed: int = 0,
) -> str:
    """Translate raw text from English to Spanish using OpenAI API.
    
    Returns:
        Translated text in Spanish, or None if translation fails.
    """
    if not text:
        return text
    # Set reasoning mode to "low" for faster translation with OSS model
    messages = [
        {"role": "system", "content": "Reasoning: low"},
        {
            "role": "user",
            "content": TRANSLATION_PROMPT.format(content=text),
        },
    ]
    # Add random delay to avoid overwhelming the endpoint
    delay = random.uniform(0.1, 0.5)
    await asyncio.sleep(delay)
    try:
        async with semaphore:
            response = await asyncio.wait_for(
                endpoint.chat.completions.create(
                    model=model,
                    messages=messages,
                    temperature=temperature,
                    top_p=top_p,
                    seed=seed,
                    timeout=TRANSLATION_TIMEOUT,
                    max_tokens=32000,
                ),
                timeout=TRANSLATION_TIMEOUT + 10
            )
            if not response.choices[0].message.content:
                return None
        return response.choices[0].message.content
    # Fallback in case a sample could not be translated in time
    except Exception as e:
        print("Text translation task timed out or failed")
        return None
    

async def translate_message(
    message: str,
    endpoint: openai.AsyncOpenAI = ENDPOINT,
    semaphore: asyncio.Semaphore = SEMAPHORE,
    model: str = MODEL,
    temperature: float = 0.2,
    top_p: float = 0.95,
    seed: int = 0
) -> str:
    """Translate a single user/assistant message using OpenAI API.
    
    For texts with reasoning traces (marked by <think>...</think>), only the
    final answer after the reasoning is translated to Spanish, while the
    reasoning content itself remains in English.
    
    Returns:
        Translated message with reasoning in English and answer in Spanish.
    """
    if message.startswith("<think>"):
        reasoning, answer = message.split("</think>")
        translated_answer = await translate_text(
            answer,
            endpoint,
            semaphore,
            model,
            temperature,
            top_p,
            seed
        )
        return f"{reasoning}</think>{translated_answer}"
    return await translate_text(
            message,
            endpoint,
            semaphore,
            model,
            temperature,
            top_p,
            seed
        )


async def translate_sft_sample(
    sample: dict,
    endpoint: openai.AsyncOpenAI = ENDPOINT,
    semaphore: asyncio.Semaphore = SEMAPHORE,
    model: str = MODEL,
    temperature: float = 0.2,
    top_p: float = 0.95,
    seed: int = 0
) -> dict:
    """Translate an SFT sample containing 'messages' field using OpenAI API.
    
    Individual system/user/assistant messages are translated separately. For
    messages with reasoning traces, only the final answer is translated to
    Spanish while reasoning content remains in English.
    
    Returns:
        Dictionary with translated messages, or None if any translation fails.
    """
    translated_messages = await asyncio.gather(
        *[
            translate_message(
                message["content"],
                endpoint,
                semaphore,
                model,
                temperature,
                top_p,
                seed
            )
        for message in sample["messages"]]
    )
    # Skipping samples if translation of any message failed
    if any(translated_message is None for translated_message in translated_messages):
        return None
    translated_messages = [
        {"content": translation, "role": message["role"]}
        for translation, message in zip(translated_messages, sample["messages"])
    ]
    return {"messages": translated_messages}

#### Step 2.5: Translate CPT Dataset

We will now translate the CPT dataset samples to Spanish. For this tutorial, we will use a subset of 512 samples and skip any samples that were not translated within the timeout period.


In [None]:
MAX_CPT_SAMPLES = 512

cpt_dataset = cpt_dataset.take(MAX_CPT_SAMPLES)
cpt_dataset_translated = await tqdm.gather(
    *[
        translate_text(
            sample["text"]
        )
        for sample in cpt_dataset
    ]
)

We will now verify the translation quality by inspecting a random sample from the translated dataset.


In [None]:
MAX_CHARS = 128

paired_samples_cpt = [
    (en_sample["text"], es_sample)
    for en_sample, es_sample in zip(cpt_dataset, cpt_dataset_translated)
    if es_sample is not None
]
cpt_dataset_es = [sample[1] for sample in paired_samples_cpt]
print(
    f"{len(cpt_dataset_es)}/{MAX_CPT_SAMPLES} samples translated successfully."
)

ind = random.randint(0, len(paired_samples_cpt))
print("Original:\n", paired_samples_cpt[ind][0][:MAX_CHARS], "...")
print("Translated:\n", paired_samples_cpt[ind][1][:MAX_CHARS], "...")


#### Step 2.6: Translate SFT Dataset

We will now translate the SFT dataset samples to Spanish. For this tutorial, we will filter for samples with reasoning enabled and use a subset of 256 samples. Similarly to CPT, we will skip samples that were not translated within the timeout period.


In [None]:
MAX_SFT_SAMPLES = 256

sft_dataset_subset = []
for sample in sft_dataset:
    sft_dataset_subset.append(sample)
    if len(sft_dataset_subset) >= MAX_SFT_SAMPLES:
        break

sft_dataset_translated = await tqdm.gather(
    *[
        translate_sft_sample(
            sample
        )
        for sample in sft_dataset_subset
    ]
)

We will verify the SFT translation quality by inspecting a random sample from the translated dataset.


In [None]:
paired_samples_sft = [
    (en_sample, es_sample)
    for en_sample, es_sample in zip(sft_dataset_subset, sft_dataset_translated)
    if es_sample is not None
]
sft_dataset_es = [sample[1] for sample in paired_samples_sft]
print(
    f"{len(sft_dataset_es)}/{MAX_SFT_SAMPLES} samples translated successfully."
)

ind = random.randint(0, len(paired_samples_sft))
print("Original:")
pprint(paired_samples_sft[ind][0]["messages"])
print("Translated:")
pprint(paired_samples_sft[ind][1]["messages"])

#### Step 2.7: Save Translated Datasets

We will now save the translated datasets to disk for use in the training pipeline.


In [None]:
# Save CPT dataset
os.makedirs("data/cpt", exist_ok=True)

with open("data/cpt/cpt_dataset_es.jsonl", "w") as f:
    for text in cpt_dataset_es:
        f.write(json.dumps({"text": text}) + "\n")


In [None]:
# Save SFT dataset
os.makedirs("data/sft", exist_ok=True)

with open("data/sft/training.jsonl", "w") as f:
    for sample in sft_dataset_es:
        f.write(json.dumps({"messages": sample["messages"]}) + "\n")

# Placeholder validation dataset - we will not use it
shutil.copy("data/sft/training.jsonl", "data/sft/validation.jsonl")

#### Step 2.8: Clean Up Translation Endpoint

We will now remove the translation endpoint to free up resources for the training pipeline.


In [None]:
!lep endpoint remove -n $TRANSLATION_ENDPOINT_NAME

### Step 3: Prepare Model and Tokenize Data

#### Step 3.1: Import Base Model to NeMo Format

We will now import the Qwen2.5-7B-Instruct model from Hugging Face and convert it to NeMo format for use in CPT and SFT. This conversion is necessary to utilize NeMo's training capabilities.

> **NOTE:** This process may take several minutes depending on your network speed.


In [None]:
!python -c 'from nemo.collections import llm; llm.import_ckpt(model=llm.Qwen2Model(llm.Qwen25Config7B()), source="hf://Qwen/Qwen2.5-7B-Instruct", output_path="nemo_checkpoints/Qwen2.5-7B-Instruct", overwrite=True)'

#### Step 3.2: Tokenize CPT Dataset

We will now convert the translated Spanish text into tokenized format suitable for Megatron-LM pretraining. This step will create memory-mapped binary files (.bin/.idx) that enable efficient data loading during training.


In [None]:
%%bash
export TOKENIZERS_PARALLELISM=false
python /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
    --input=data/cpt/cpt_dataset_es.jsonl \
    --json-keys=text \
    --tokenizer-library=huggingface \
    --tokenizer-type=Qwen/Qwen2.5-3B-Instruct \
    --output-prefix=data/cpt/math_es_tokenized \
    --workers=8 \
    --append-eod \


### Step 4: Run Continued Pre-Training (CPT)

We will now perform Continued Pre-Training to adapt the base model to Spanish language patterns and mathematical domain knowledge. This step is crucial for:
- Improving Spanish language fluency
- Building domain-specific knowledge (mathematics)
- Preparing the model for downstream reasoning tasks

**Key Configuration:**
- Low learning rate (5e-6) to avoid catastrophic forgetting
- Constant learning rate (no scheduler)
- Regular checkpointing for evaluation

#### Step 4.1: Configure CPT Recipe

We will start with the baseline Qwen 2.5 7B pretraining recipe and customize it for our Spanish CPT task.


In [None]:
# Initialize baseline recipe for Qwen 2.5 7B
cpt = llm.qwen25_7b.pretrain_recipe(
    name="qwen_2.5_7b_cpt",  # Experiment identifier for tracking
    dir=os.path.abspath("logs_cpt/"),  # Output directory for logs and checkpoints
    num_nodes=1,  # Single-node training configuration
    num_gpus_per_node=4,  # Data parallel across 4 GPUs
)

# We will start from instruction-tuned Qwen 2.5 7B checkpoint
# Note: For production, we recommend: base model → CPT → instruction tuning
cpt.resume = run.Config(
    nl.AutoResume,
    restore_config=run.Config(
        nl.RestoreConfig,
        path=os.path.abspath("nemo_checkpoints/Qwen2.5-7B-Instruct")
    ),
    resume_if_exists=False,
)

# We must use identical tokenizer from dataset preprocessing step
tokenizer_path = os.path.abspath(
    "nemo_checkpoints/Qwen2.5-7B-Instruct/context/nemo_tokenizer"
)
tokenizer = run.Config(
    AutoTokenizer,
    pretrained_model_name=tokenizer_path
)

# We will initialize the pretraining dataset configuration
cpt_data_path = os.path.abspath('data/cpt/math_es_tokenized_text_document')
cpt.data = run.Config(
    PreTrainingDataModule,
    paths=[cpt_data_path],  # Path to tokenized dataset
    split="100,0,0",  # All data for training (no validation/test split)
    global_batch_size=16,  # Effective batch size across all GPUs
    micro_batch_size=1,  # Per-GPU batch size (memory constrained)
    num_workers=0,  # Data loading workers (0 = main process only)
    pin_memory=True,  # Pin memory for faster GPU transfer
    seq_length=4096,  # Maximum sequence length in tokens
    tokenizer=tokenizer,  # Tokenizer for processing
    seed=0  # For reproducible data shuffling
)

# We will configure training steps and validation settings
cpt.trainer.max_steps = 32  # Limited steps for notebook demo
cpt.trainer.val_check_interval = 32  # Skip validation for this demo
cpt.trainer.limit_val_batches = 0  # Skip validation batches

# We will configure checkpointing behavior
cpt.log.ckpt.every_n_train_steps = 32  # Save checkpoint every 32 steps
cpt.log.ckpt.every_n_epochs = None  # Disable epoch-based checkpointing
cpt.log.ckpt.train_time_interval = None  # Disable time-based checkpointing
cpt.log.ckpt.save_weights_only = True  # Reduce checkpoint size by saving weights only
cpt.log.ckpt.save_top_k = -1  # Keep all checkpoints
cpt.log.ckpt.monitor = None  # No metric monitoring
cpt.log.ckpt.save_last = False  # Don't save 'last' checkpoint separately
cpt.log.ckpt.filename = "qwen_2.5_7b_cpt_{step}"  # Checkpoint naming pattern

# We will configure learning rate with no scheduler for stable CPT
cpt.optim.config.lr = 5e-6  # Low learning rate to avoid catastrophic forgetting
cpt.optim.lr_scheduler = None  # Constant learning rate (no decay)

#### Step 4.2: Run CPT

We will now launch the CPT training using NeMo Run's Experiment API. This will train the model on Spanish mathematical text.


In [None]:
mounts = [{
    "path": os.getcwd(),
    "mount_path": os.getcwd(),
    "from": "node-nfs:lepton-shared-fs",
}]

executor = run.LeptonExecutor(
    nodes=1,
    nprocs_per_node=4,
    gpus_per_node=4,
    resource_shape="gpu.4xh200",
    container_image="nvcr.io/nvidia/nemo:25.07.gpt_oss",
    nemo_run_dir=os.getcwd(),
    mounts=mounts,
    node_group=os.environ["NODE_GROUP"],
    launcher="torchrun",
)

In [None]:
run.run(
    cpt,
    executor=executor,
    name="qwen_2.5_7b_cpt",
    detach=False,
)

In [None]:
def find_checkpoint(run_folder: str):
    return glob.glob(f"{run_folder}/**/checkpoints/**")


cpt_checkpoint = find_checkpoint("logs_cpt/qwen_2.5_7b_cpt")[0]
print(cpt_checkpoint)

### Step 5: Run Supervised Fine-Tuning (SFT)

After CPT, we will fine-tune the model on translated Spanish reasoning traces to develop step-by-step problem-solving capabilities. The model will learn to reason in English and output answers in Spanish.

**Key Configuration:**
- Higher context length (32768 tokens) for long reasoning traces
- Full fine-tuning (no PEFT) for maximum reasoning capability
- Constant learning rate for stable convergence

#### Step 5.1: Configure SFT Recipe

We will start with the baseline Qwen 2.5 7B fine-tuning recipe and customize it for our Spanish reasoning task.


In [None]:
# Initialize baseline recipe for Qwen 2.5 7B fine-tuning
sft = llm.qwen25_7b.finetune_recipe(
    name="qwen_2.5_7b_sft",  # Experiment identifier for tracking
    dir=os.path.abspath(f"logs_sft"),  # Output directory for logs and checkpoints
    num_nodes=1,  # Single-node training configuration
    num_gpus_per_node=4,  # Data parallel across 4 GPUs
    peft_scheme=None  # Full SFT (no parameter-efficient fine-tuning)
)

# We will start from the CPT checkpoint
sft.resume = run.Config(
    nl.AutoResume,
    restore_config=run.Config(
        nl.RestoreConfig,
        path=os.path.abspath(cpt_checkpoint)
    ),
    resume_if_exists=False,
)

# We must use identical tokenizer from dataset preprocessing step
tokenizer = run.Config(
    AutoTokenizer,
    pretrained_model_name=tokenizer_path
)

# We will initialize fine-tuning dataset using NeMo's ChatDataModule
sft.data = run.Config(
    ChatDataModule,
    dataset_root=os.path.abspath("data/sft"),  # training/validation.jsonl
    global_batch_size=16,  # Effective batch size across all GPUs
    micro_batch_size=1,  # Per-GPU batch size (memory constrained)
    pin_memory=True,  # Pin memory for faster GPU transfer
    seq_length=16384,  # Maximum sequence length for reasoning traces
    tokenizer=tokenizer,  # Tokenizer for text-to-token conversion
    use_hf_tokenizer_chat_template=True,  # Use HF chat template format
    seed=0,  # For reproducible data shuffling
    num_workers=0
)

# We will configure sequence length for long reasoning traces
sft.model.config.seq_length = 16384
# We will use a higher context parallel size to fit full sequences
sft.trainer.strategy.context_model_parallel_size = 4

# We will configure training steps and validation settings
sft.trainer.max_steps = 8  # Limited steps for notebook demo
sft.trainer.val_check_interval = 8  # Skip validation for this demo
sft.trainer.limit_val_batches = 0  # Skip validation batches

# We will configure checkpointing behavior
sft.log.ckpt.every_n_train_steps = 8  # Save checkpoint every 8 steps
sft.log.ckpt.every_n_epochs = None  # Disable epoch-based checkpointing
sft.log.ckpt.train_time_interval = None  # Disable time-based checkpointing
sft.log.ckpt.save_weights_only = True  # Reduce checkpoint size by saving weights only
sft.log.ckpt.save_top_k = -1  # Keep all checkpoints
sft.log.ckpt.monitor = None  # No metric monitoring
sft.log.ckpt.save_last = False  # Don't save 'last' checkpoint separately

# We will configure learning rate with no scheduler for stable convergence
sft.optim.config.lr = 5e-6  # Low learning rate for fine-tuning
sft.optim.lr_scheduler = None  # Constant learning rate (no decay)

#### Step 5.2: Run SFT

We will now launch the SFT training using NeMo Run's Experiment API. This will teach the model to reason in English and output answers in Spanish.


In [None]:
run.run(
    sft,
    executor=executor,
    name="qwen_2.5_7b_sft",
    detach=False,
)

### Step 6: Export and Deploy the Model

#### Step 6.1: Export the Model to HuggingFace Format

We will now export the trained model from NeMo format to HuggingFace format for easier deployment and inference.

In [None]:
sft_checkpoint = find_checkpoint("logs_sft/qwen_2.5_7b_sft")[0]

!python -c 'from nemo.collections import llm; \
    llm.export_ckpt("{sft_checkpoint}", \
    target="hf", output_path="qwen_sft_hf", overwrite=True)'

#### Step 6.2: Deploy Inference Endpoint

We will now deploy the fine-tuned model using NVIDIA's Lepton platform for inference. The model checkpoint will be automatically loaded from the SFT training output.

> **NOTE:** Make sure to update the checkpoint path to point to your actual checkpoint.


In [None]:
os.environ["ENDPOINT_NAME"] = "qwen-2-5-7b"

In [None]:
%%bash
# Configuration for inference endpoint
REPLICAS=1
CHECKPOINT_PATH="qwen_sft_hf"

# Deploy the inference endpoint with custom weights
/usr/local/bin/lep endpoint create -n $ENDPOINT_NAME --container-image "vllm/vllm-openai:latest" --image-pull-secrets $IMAGE_PULL_SECRET \
        --node-group $NODE_GROUP --resource-shape gpu.1xh200 --replicas-static $REPLICAS \
        --container-port 8000 --tokens $ACCESS_TOKEN \
        --mount "$(pwd):$(pwd):node-nfs:lepton-shared-fs" --container-command "vllm serve $(pwd)/$CHECKPOINT_PATH --served-model-name 'qwen_2.5-7b-sft' --port 8000 --gpu-memory-utilization 0.90 --trust-remote-code"


In [None]:
sft_endpoint_url = wait_for_endpoint("qwen-2-5-7b")

#### Step 6.3: Query Endpoint

We will now test the deployed model by sending a query in Spanish and observing the response.

In [None]:
ENDPOINT = openai.AsyncOpenAI(
        api_key=os.environ["ACCESS_TOKEN"],
        base_url=sft_endpoint_url + "/v1",
    )

response = await ENDPOINT.chat.completions.create(
    model="qwen_2.5-7b-sft",
    messages=[
        {"role": "user", "content": "¿Qué es 2+2?"}
    ]
)
print(response.choices[0].message.content)

In [None]:
!lep endpoint remove -n $ENDPOINT_NAME

---
### Summary

In this notebook, you learned how to create a Spanish reasoning language model using NeMo 2.0 through a three-stage pipeline:

1. **Data Translation**: We translated English reasoning datasets to Spanish using a deployed NIM endpoint
2. **Continued Pre-Training**: We adapted the model to Spanish language patterns and mathematical domain
3. **Supervised Fine-Tuning**: We taught the model to reason in English and output answers in Spanish

The resulting model can solve mathematical problems with English reasoning traces and Spanish answers, combining strong reasoning capabilities with Spanish language accessibility. This approach is particularly effective for maintaining reasoning quality while serving Spanish-speaking users.

For more information on NeMo Framework, visit the [official documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html).
