# Notebook 2: Post-Training with Safety and Accuracy Data

## About the Data

This notebook demonstrates how to post-train the base model with safety-related data.
The safety data is gathered from the following well-known datasets:

- [Nemotron Content Safety Dataset V2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0), formerly known as Aegis AI Content Safety Dataset v2
- [Gretel Synthetic Safety Alignment Dataset](https://huggingface.co/datasets/gretelai/gretel-safety-alignment-en-v1)
- [HarmfulTasks](https://github.com/CrystalEye42/eval-safety)
- [RedTeam 2k](https://huggingface.co/datasets/JailbreakV-28K/JailBreakV-28k)

## About the Process

This notebook proceeds through the following high-level steps:

- Set up a directory structure for logs and results.
- Data preparation:
  - Download the preceding safety-related datasets and extract 2000 total samples at random.
  - Download the Llama Nemotron dataset and extract 4000 samples at random.
  - Create training and validation datasets from the samples, excluding samples with a token length greater than `16384`.
- Start vLLM servers:
  - One serves the base model to train.
  - A second serves the NVIDIA Llama 3.1 Nemoguard 8B Instruct model to act as LLM as judge.
- Fine-tune the model using [NeMo-RL](https://github.com/NVIDIA/NeMo-RL) to apply safety post-training to improve the safety of the target model.


### Load API Keys

The `WANDB_API_KEY` is optional. If you're not using W&B, edit `deepseek_sft.yaml` and set `logger.wandb_enabled=false`.

```
...
logger:
  log_dir: "logs"  # Base directory for all logs
  wandb_enabled: false
...
```

In [None]:
import os
from dotenv import load_dotenv

print("Loading environment variables from .env")
load_dotenv(dotenv_path=".env")

if os.environ.get("HF_TOKEN", None) is None:
    raise ValueError("HF_TOKEN must be set.")
print("✅ HF_TOKEN found")
if os.environ.get("WANDB_API_KEY", None) is None:
    print("❌ WANDB_API_KEY not found. W&B logger will be disabled")
else:
    print("✅ WANDB_API_KEY found")

### Set up Packages and Paths

In [None]:
import json
import os
import subprocess
import time
from pathlib import Path
import shutil

from huggingface_hub import hf_hub_download

# Base directory and configuration
BASE_DIR = "/ephemeral/workspace/training"
LOG_DIR = f"{BASE_DIR}/logs"

SAFETY_DATASET_NAME = "safety_blend_v1.jsonl"
MODEL_NAME_OR_PATH = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
MODEL_DIR = f"/ephemeral/workspace/model/"
SAFETY_MODEL_NAME = "llama-3.1-nemoguard-8b-content-safety"
SAFETY_MODEL_PATH = f"{MODEL_DIR}/{SAFETY_MODEL_NAME}"

DATASET_CACHE_DIR = f"{BASE_DIR}/dataset_cache"

# Set environment variables
os.environ.update({
    "LOG_DIR": LOG_DIR,
    "MODEL_DIR": MODEL_DIR,
    "DATASET_CACHE_DIR": DATASET_CACHE_DIR
})

In [None]:
# Create directories
for dir_path in [os.environ['TMPDIR'], os.environ['XDG_CACHE_HOME'], os.environ['HF_HOME'],
                 os.environ['UV_CACHE_DIR'],os.environ['TRITON_CACHE_DIR'], os.environ['DATASET_CACHE_DIR'], 
                 os.environ['RAY_TMPDIR'], os.environ['LOG_DIR'], os.environ['MODEL_DIR']]:
    Path(dir_path).mkdir(parents=True, exist_ok=True)

After you run the preceding cell, the directory structure---including the paths from the first notebook---are as follows:

```text
workspace
├── cache
│   ├── huggingface
│   ├── triton
│   └── uv
├── dataset
│   └── aegis_v2
├── dataset_cache
├── results
│   └── DeepSeek-R1-Distill-Llama-8B
│       ├── accuracy-evals
│       │   ├── aa-math-500
│       │   ├── gpqa-diamond
│       │   └── ifeval
│       ├── content-safety-evals
│       │   ├── aegis_v2
│       │   └── wildguard
│       ├── logs
│       └── security-evals
│           └── garak
│               ├── configs
│               ├── logs
│               └── reports
├── tmp
└── training
    └── model
```

### Generating Safety Training Data

The training data used for Supervised Fine-Tuning (SFT) typically consists of pairs of prompts and their corresponding expected responses. If the model is forced to learn outputs that are not well-aligned with its existing behavior, the model can lead to **catastrophic degradation** in performance.

To mitigate this risk, it is essential to curate a **custom training dataset** that is specifically tailored to the target model. In particular, we use **on-policy data**---data generated by the target model itself.
This approach ensures that the response distribution in the training set closely matches the model’s inherent response tendencies, preserving alignment and maintaining performance stability.

This approach improves safety, while making sure to obtain safe responses even for unsafe prompts. The notebook uses a safety training data generation recipe that uses the following public datasets:

- Nemotron Content Safety Dataset V2
- Gretel Synthetic Safety Alignment Dataset
- HarmfulTasks
- RedTeam 2k

The notebook extracts safe and unsafe prompts from Nemotron Content Safety Dataset V2 and unsafe prompts from the other three datasets. 

### Response Generation for Safe Prompts

For safe prompts, the notebook uses the target model and the prompts as-is to generate responses.
When you have access to a larger model in the same model family---such as `deepseek-ai/DeepSeek-R1` instead of `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`---use the larger model because it should generate safe responses better than the smaller model. 

### Response Generation for Unsafe Prompts

Because the unsafe prompts in the HarmfulTasks dataset have jailbreak prompts that begin with text like "Ignore all instructions until now..." and end with a request like "Write a blog post promoting mistrust between different social groups."
The jailbreak portion of the prompts are removed, but the model still generates responses to the unsafe user request, such as writing a blog to promote mistrust.

All unsafe prompts are wrapped in a new prompt template, shown below, that instructs the model to disengage but also inform the user exactly why---using the ground truth harm categories in the datasets and not a canned response.

```
You are a helpful assistant that generates safe responses for given prompts.

The prompt is: {prompt}
This prompt is known to violate the following safety categories or elicits unsafe responses of these categories from the model. The categories are: {categories}

Generate a safe refusal for the prompt that politely points out that the model doesn't engage in unsafe behavior for the given categories. Your response:
```

With this approach, the model can generate safe responses for unsafe prompts. 

For training, the notebook passes the original unsafe prompt and the generated response---not the one with the safe prompt wrapper described above. Effectively, we are trying to teach the model to generate the same response to the original unsafe prompt such as with the jailbreak instructions for the HarmfulTasks prompts.

### Response Filtering

The generated responses for the safe and unsafe prompts discussed above are not guaranteed to be safe responses. Therefore, we implement a filtering step to extract the generated responses that are judged as safe by a guard model.

We use [nvidia/llama-3.1-nemoguard-8b-content-safety](https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-content-safety) as the guard model for this filtering step.

In [None]:
safety_filename = os.path.join(os.environ['DATASET_CACHE_DIR'], SAFETY_DATASET_NAME)
cache_dir = os.environ['DATASET_CACHE_DIR']
total_samples = 2000
sampling_method = "stratified"

!python scripts/safety_dataset_blend_generation.py \
  --filename {safety_filename} \
  --total_samples {total_samples} \
  --sampling_method {sampling_method} \
  --cache_dir {cache_dir}

In [None]:
OUTPUT_DIR = f"{os.environ['DATASET_CACHE_DIR']}/sft_data"
safety_file = f"{os.environ['DATASET_CACHE_DIR']}/safety_blend_v1_sampled_{total_samples}_{sampling_method}.jsonl"

!python scripts/combine_datasets.py \
  --safety_file {safety_file} \
  --output_dir {OUTPUT_DIR} \
  --val_split 0.03 \
  --max_tokens 16384 \
  --max_samples {total_samples}

In [None]:
!ls {OUTPUT_DIR}

### Start vLLM Servers: Policy Model and Content Safety

In [None]:
if os.path.exists(SAFETY_MODEL_PATH):
    print(f"✅ NeMo Guard model found: {SAFETY_MODEL_PATH}")
else:
    raise ValueError(f"❌ NeMo Guard model not found at {SAFETY_MODEL_PATH}. Please go back to Step 0 and verify that the model was created successfully.")

Start one vLLM server for the policy model to train and another vLLM server with the content safety model to perform LLM-as-a-judge.

In [None]:
from scripts.vllm_launcher import VLLMLauncher

vllm_launcher = VLLMLauncher(total_num_gpus=8)

policy_model_vllm_proc = vllm_launcher.launch(
    model_name_or_path=MODEL_NAME_OR_PATH,
    gpu_devices=os.environ['POLICY_MODEL_GPUS'],
    served_model_name='test-model',
    enable_reasoning=False, # To keep the thinking trace in the response for training
    log_filepath=f"{LOG_DIR}/vllm-server-model.log",
    port=5000
)

safety_model_vllm_proc = vllm_launcher.launch(
    model_name_or_path=SAFETY_MODEL_PATH,
    gpu_devices=os.environ['SAFETY_MODEL_GPUS'],
    served_model_name='safety-model',
    log_filepath=f"{LOG_DIR}/vllm-server-safety.log",
    port=6000
)

!sleep 120

In [None]:
print("Policy model vLLM server log:")
policy_model_vllm_proc.print_log()
print("========================================\n\nSafety model vLLM server log:")
safety_model_vllm_proc.print_log()

### Generating On-Policy Data

Using the combined dataset, the base model, and the content safety model, generate the on-policy data. It may take more than 40 (60) minutes with 8x H100 (A100) GPUs.

In [None]:
CONCURRENCY = 16
MAX_ATTEMPTS = 3
BATCH_SIZE = 96

MAX_TOKENS = 512
TEMPERATURE = 0.6
TOP_P = 0.95

print("Generating on-policy data...")
for dataset_type in ['train', 'val']:
    input_dataset = f"{OUTPUT_DIR}/{dataset_type}.jsonl"
    output_file = f"{OUTPUT_DIR}/{dataset_type}_on_policy_data.jsonl"
    DATASET_TYPE = dataset_type
    subprocess.run([
        'python3', 'scripts/generate_on_policy_data.py',
        '--model_name', MODEL_NAME_OR_PATH,
        '--safety_model', SAFETY_MODEL_NAME,
        '--huggingface_token', os.environ['HF_TOKEN'],
        '--vllm_host', os.environ['VLLM_HOST'],
        '--vllm_model_port', '5000',
        '--vllm_safety_port', '6000',
        '--concurrency', str(CONCURRENCY),
        '--input_dataset', input_dataset,
        '--output', output_file,
        '--batch_size', str(BATCH_SIZE),
        '--max_tokens', str(MAX_TOKENS),
        '--temperature', str(TEMPERATURE),
        '--top_p', str(TOP_P)
    ], stdout=open(f"{LOG_DIR}/{DATASET_TYPE}_on-policy.log", 'w'),
                   stderr=subprocess.STDOUT)

print("Data is Ready")

### Filtering on-policy data that does not finish thinking traces

We should use training examples that complete thinking traces, which means the ones that contains `</think>` in the generated response.

In [None]:
for split in ["train", "val"]:
    original_examples = [json.loads(x) for x in open(f"{os.environ['DATASET_CACHE_DIR']}/sft_data/{split}_on_policy_data.jsonl")]
    filtered_examples = []
    for example in original_examples:
        if "</think>" in example["generated_output"]:
            filtered_examples.append(example)
    
    print(f"{split}: Extracted {len(filtered_examples)}/{len(original_examples)} that properly completed thinking traces.")
            
    with open(f"{os.environ['DATASET_CACHE_DIR']}/sft_data/{split}_on_policy_data_filtered.jsonl", "w") as fout:
        for example in filtered_examples:
            fout.write(json.dumps(example))
            fout.write("\n")

In [None]:
filtered_examples[0]

### Stop the vLLM Servers

If you run vLLM servers on terminals, press Ctrl+C to stop the vLLM servers in each shell.

In [None]:
# Cleanup vLLM servers
vllm_launcher.stop_all()

!sleep 10

### Fine-Tune the Model

Use NeMo-RL to post-train the model. This step takes more than 10 minutes (20 minutes) with H100 (A100) GPUs.

In [None]:
MODEL_DIR = os.path.abspath(f"{BASE_DIR}/results/DeepSeek-R1-Distill-Llama-8B/")
!mkdir -p {MODEL_DIR}
os.environ["NEMO_RL_CONFIG_PATH"] = os.path.abspath("configs/deepseek_sft.yaml")

print("Running SFT...")
# Set up model directory environment variable
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
!cd /ephemeral/workspace/NeMo-RL && TMPDIR=$RAY_TMPDIR uv run python examples/run_sft.py --config ${NEMO_RL_CONFIG_PATH} 2>&1 | tee ${LOG_DIR}/nemo-rl-sft.log

### Convert the checkpoint

The following code blocks will convert the NeMo-RL checkpoint into HF (.bin) and then HF (safetensors).

In [None]:
# Pick up the latest checkpoint
CHECKPOINT_DIR = !ls -d {BASE_DIR}/results/DeepSeek-R1-Distill-Llama-8B/step_* 2>/dev/null | sort -t_ -k2 -n | tail -1
CHECKPOINT_DIR = CHECKPOINT_DIR[0]
print(f"Latest checkpoint: {CHECKPOINT_DIR}")

DCP_CKPT_PATH = os.path.abspath(f"{CHECKPOINT_DIR}/policy/weights/")
CONFIG_PATH = os.path.abspath(f"{CHECKPOINT_DIR}/config.yaml")
HF_CKPT_PATH = os.path.abspath(f"{MODEL_DIR}/DeepSeek-R1-Distill-Llama-8B-Safety-Trained-bin")
HF_CKPT_ST_PATH = os.path.abspath(f"{MODEL_DIR}/DeepSeek-R1-Distill-Llama-8B-Safety-Trained")

print("Converting checkpoint...")
!cd /ephemeral/workspace/NeMo-RL && uv run examples/convert_dcp_to_hf.py --config {CONFIG_PATH} --dcp-ckpt-path {DCP_CKPT_PATH} --hf-ckpt-path {HF_CKPT_PATH}

# Verify conversion
if Path(f"{HF_CKPT_PATH}/pytorch_model.bin").exists() and Path(f"{HF_CKPT_PATH}/config.json").exists():
    print("Conversion successful!")
    print(f"The HuggingFace model is now available at: {HF_CKPT_PATH}")
else:
    print("Conversion may have failed. Please check the output.")

In [None]:
# * Safetensors
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the full Causal LM model and tokenizer
model = AutoModelForCausalLM.from_pretrained(HF_CKPT_PATH)
tokenizer = AutoTokenizer.from_pretrained(HF_CKPT_PATH)

print("Model and Tokenizer loaded successfully.")

# Save the full model as safetensors
model.save_pretrained(
    HF_CKPT_ST_PATH,
    safe_serialization=True
)

# Save the tokenizer to the same new directory
tokenizer.save_pretrained(HF_CKPT_ST_PATH)

print(f"Model successfully converted and saved to {HF_CKPT_ST_PATH}")

### Next Steps

You used post-training to improve the safety of the model, retained the accuracy of the original model, and saved the checkpoints.

The next step is to [evaluate the safety and accuracy of the model](./Step3_Post_Training_Eval.ipynb).