# Notebook 3: Safety and Accuracy Evaluation for the Safety Post-trained Model

In this notebook, we will rerun the same set of evaluations for the model and see how the safety scores on **Content Safety** and **Product Security** have improved. 
Since it's the second time, we'll skip the detailed explanation of each dataset but focus on collecting scores for them.

### Before You Begin


In [None]:
import os
!pip install ipywidgets > {os.devnull} 2>&1

import json
import os
from pathlib import Path
import sys

from dotenv import load_dotenv
import pandas as pd
from tqdm import tqdm

from dotenv import load_dotenv

# Reload .env to store the keys in os.environ
load_dotenv(dotenv_path=".env")
print("API keys have been stored in .env.")

if os.environ.get("HF_TOKEN", None) is None or os.environ.get("JUDGE_API_KEY", None) is None:
    raise ValueError("HF_TOKEN and JUDGE_API_KEY must be set.")
print("✅ HF_TOKEN and JUDGE_API_KEY found")

if os.environ.get("RUN_FULL_EVAL", None) is None:
    print("RUN_FULL_EVAL is not configured. RUN_FULL_EVAL will be disabled")
    os.environ["RUN_FULL_EVAL"] = "0"
else:
    print(f"RUN_FULL_EVAL configuration found: {os.environ['RUN_FULL_EVAL']}")

Specify paths for data, evaluation results, and the base model to evaluate.

```text
workspace
├── dataset
│   └── aegis_v2
└── results
    └── DeepSeek-R1-Distill-Llama-8B-Safety-Trained
        ├── accuracy-evals
        │   ├── aa-math-500
        │   ├── gpqa-diamond
        │   └── ifeval
        ├── content-safety-evals
        │   ├── aegis_v2
        │   └── wildguard
        ├── logs
        └── security-evals
            └── garak
                ├── configs
                ├── logs
                └── reports
```

In [None]:
BASE_DIR = "/ephemeral/workspace"
DATASET_DIR = f"{BASE_DIR}/dataset/"
ORIGINAL_MODEL_TAG_NAME = "DeepSeek-R1-Distill-Llama-8B"
#TRAINED_MODEL_PATH = f"{BASE_DIR}/training/model/DeepSeek-R1-Distill-Llama-8B-Safety-Trained"
TRAINED_MODEL_PATH = f"{BASE_DIR}/training/results/DeepSeek-R1-Distill-Llama-8B/DeepSeek-R1-Distill-Llama-8B-Safety-Trained"
MODEL_NAME_OR_PATH = TRAINED_MODEL_PATH
MODEL_TAG_NAME = TRAINED_MODEL_PATH.split("/")[-1]
MODEL_OUTPUT_DIR = f"{BASE_DIR}/results/{MODEL_TAG_NAME}/"
LOG_DIR = f"{MODEL_OUTPUT_DIR}/logs/"

# * Dataset
NEMOGUARD_MODEL_PATH = f"{BASE_DIR}/model/llama-3.1-nemoguard-8b-content-safety"
AEGIS_V2_TEST_DIR = f"{DATASET_DIR}/aegis_v2"

# * Content Safety benchmark
CONTENT_SAFETY_RESULTS_DIR = f"{MODEL_OUTPUT_DIR}/content-safety-evals"
AEGIS_V2_RESULTS_DIR = f"{CONTENT_SAFETY_RESULTS_DIR}/aegis_v2"
WILDGUARD_RESULTS_DIR = f"{CONTENT_SAFETY_RESULTS_DIR}/wildguard"

# * Security benchmark
SECURITY_RESULTS_DIR = f"{MODEL_OUTPUT_DIR}/security-evals"
GARAK_RESULTS_DIR = f"{SECURITY_RESULTS_DIR}/garak"
GARAK_CONFIG_DIR = f"{GARAK_RESULTS_DIR}/configs"
GARAK_LOG_DIR = f"{GARAK_RESULTS_DIR}/logs"
GARAK_REPORT_DIR = f"{GARAK_RESULTS_DIR}/reports"

# * Accuracy benchmark
ACCURACY_RESULTS_DIR = f"{MODEL_OUTPUT_DIR}/accuracy-evals"
GPQA_DIAMOND_RESULTS_DIR = f"{ACCURACY_RESULTS_DIR}/gpqa-diamond"
AA_MATH_500_RESULTS_DIR = f"{ACCURACY_RESULTS_DIR}/aa-math-500"
IFEVAL_RESULTS_DIR = f"{ACCURACY_RESULTS_DIR}/ifeval"

In [None]:
# Create directories to store logs and results
for path in [LOG_DIR, AEGIS_V2_TEST_DIR, AEGIS_V2_RESULTS_DIR, WILDGUARD_RESULTS_DIR,
             GARAK_RESULTS_DIR, GARAK_CONFIG_DIR, GARAK_LOG_DIR, GARAK_REPORT_DIR,
             GPQA_DIAMOND_RESULTS_DIR, AA_MATH_500_RESULTS_DIR, IFEVAL_RESULTS_DIR]:
    Path(path).mkdir(parents=True, exist_ok=True)

## Serve the Safety Trained Model with vLLM

Start a locally-running vLLM server to serve the safety trained model.
After you start the server, run each of the evaluation tools against the safety trained model to establish a performance baseline.

In [None]:
from scripts.vllm_launcher import VLLMLauncher

vllm_launcher = VLLMLauncher(total_num_gpus=8)

model_vllm_proc = vllm_launcher.launch(
    model_name_or_path=MODEL_NAME_OR_PATH,
    gpu_devices=os.environ['POLICY_MODEL_GPUS'],
    served_model_name='test-model',
    enable_reasoning=True,
    log_filepath=f"{LOG_DIR}/vllm-server-model.log",
    port=5000,
)

!sleep 120

Check if the vLLM server has been correctly launched.

Run the following code block. If you see the message like below, the server is ready to use.

```
INFO:     Started server process [<pid>]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
```

In [None]:
# Check if the vLLM serve is correctly launched
model_vllm_proc.print_log()

In case you'd like to shutdown (to relaunch etc.) the vLLM server, run the following code block after uncommenting the command

In [None]:
#model_vllm_proc.stop()

### Evaluating Content Safety with Aegis 2.0

### Serve the NeMo Guard Model with vLLM


In [None]:
if os.path.exists(NEMOGUARD_MODEL_PATH):
    print(f"✅ NeMo Guard model found: {NEMOGUARD_MODEL_PATH}")
else:
    raise ValueError(f"❌ NeMo Guard model not found at {NEMOGUARD_MODEL_PATH}. Please go back to Step 0 and verify that the model was created successfully.")

Then, you need to launch a vLLM server using the model. We also launch a vllm server for WildGuard, which will be covered later.

In [None]:
# Launch vLLM server for NeMo Guard model
nemoguard_vllm_proc = vllm_launcher.launch(
    model_name_or_path=NEMOGUARD_MODEL_PATH,
    gpu_devices=os.environ['NEMOGUARD_MODEL_GPUS'],
    served_model_name='llama-3.1-nemoguard-8b-content-safety',
    log_filepath=f"{LOG_DIR}/vllm-server-nemo-guard-model.log",
    port=6000,
)

# Launch vLLM server for WildGuard model
WILDGUARD_MODEL_PATH = 'allenai/wildguard'
wildguard_vllm_proc = vllm_launcher.launch(
    model_name_or_path=WILDGUARD_MODEL_PATH,
    gpu_devices=os.environ['WILDGUARD_MODEL_GPUS'],
    served_model_name='allenai/wildguard',
    log_filepath=f"{LOG_DIR}/vllm-server-wildguard.log",
    port=7000,
)

!sleep 120

In [None]:
# Check if the servers were properly launched
print("Checking vLLM server for NeMo Guard")
nemoguard_vllm_proc.print_log()

print("=====\n\n")
print("Checking vLLM server for WildGuard")
wildguard_vllm_proc.print_log()

After you start the server, run each of the evaluation tools against the base model to establish a performance baseline.

The evaluation requires approximately 30 minutes to complete.

In [None]:
# Full or partial evaluation 
aegis_overrides = os.environ["AEGIS_CONFIG_OVERRIDES"]
aegis_overrides += ",config.params.extra.judge.url=http://localhost:6000/v1"  # Change this if you use a different endpoint
if os.environ["RUN_FULL_EVAL"] == "0":
    aegis_overrides += ",config.params.limit_samples=200"

# Run evaluation 
!core-evals-safety-eval run_eval \
               --output_dir {AEGIS_V2_RESULTS_DIR} \
               --model_id "test-model" \
               --model_url http://localhost:5000/v1 \
               --model_type chat \
               --eval_type aegis_v2 \
               --overrides="{aegis_overrides}" &> "$LOG_DIR/safety-eval-aegis-v2-vllm.log"

Check the log file `$LOG_DIR/safety-eval-aegis-v2.log` for progress.

Let's take a look at the safety score and category breakdown. 

In [None]:
aegis_v2_df = pd.read_csv(f"{AEGIS_V2_RESULTS_DIR}/output.csv")

aegis_v2_label_s = aegis_v2_df["Response Safety 0"].value_counts()
aegis_v2_safety_score = aegis_v2_label_s.loc["safe"] / (aegis_v2_label_s.loc["safe"] + aegis_v2_label_s.loc["unsafe"])

print(f"Aegis v2 safety score: {aegis_v2_safety_score*100:.2f}%")
print("===============================================")
print("The category distribution for unsafe responses:")

unsafe_aegis_v2_df = aegis_v2_df[aegis_v2_df["Response Safety 0"] == "unsafe"]
unsafe_aegis_v2_df["Safety Categories 0"].value_counts()

### Evaluating Content Safety with WildGuard

In [None]:
# Full or partial evaluation 
wildguard_overrides = os.environ["WILDGUARD_CONFIG_OVERRIDES"]
wildguard_overrides += ",config.params.extra.judge.url=http://localhost:7000/v1"  # Change this if you use a different endpoint
if os.environ["RUN_FULL_EVAL"] == "0":
    wildguard_overrides += ",config.params.limit_samples=200"
   
# Run evaluation
!core-evals-safety-eval run_eval \
               --output_dir {WILDGUARD_RESULTS_DIR} \
               --model_id "test-model" \
               --model_url http://localhost:5000/v1 \
               --model_type chat \
               --eval_type wildguard \
               --overrides="{wildguard_overrides}" &> "$LOG_DIR/safety-eval-wildguard-vllm.log"

Check the `$LOG_DIR/safety-eval-wildguard-vllm.log` file for progress.

When the evaluation completes, determine the Wildguard ratio.
This notebook uses the ratio of safe responses to total responses as a metric.

In [None]:
wildguard_results = json.load(open(f"{WILDGUARD_RESULTS_DIR}/metrics.json"))
wildguard_safe_score = wildguard_results["safe"] / (wildguard_results["safe"] + wildguard_results["unsafe"])

print(f"WildGuard safety score: {wildguard_safe_score * 100.:.2f}%")
print("===============================================")
print("The category distribution for unsafe responses:")
wildguard_df = pd.DataFrame([json.loads(x) for x in open(f"{WILDGUARD_RESULTS_DIR}/report.json")])
wildguard_unsafe_df = wildguard_df[wildguard_df["wildguard_response_safety_classification"] == "unsafe"]
wildguard_unsafe_df["subcategory"].value_counts()

### Shutting down guard models and re-launching vLLM server

Now you can shutdown the vLLM servers for the two guard models and fully use all the GPUs for the target model for more efficient evaluaton.

In [None]:
# Terminate the running vllm processes
vllm_launcher.stop_all()

!sleep 10

print("Starting policy model server...")
model_vllm_full_proc = vllm_launcher.launch(
    model_name_or_path=MODEL_NAME_OR_PATH,
    gpu_devices=os.environ['POLICY_MODEL_GPUS_FULL'],
    served_model_name='test-model',
    enable_reasoning=True,
    log_filepath=f"{LOG_DIR}/vllm-server-model.log",
    port=5000
)    

!sleep 180

In [None]:
# Check if the vLLM serve is correctly launched
model_vllm_full_proc.print_log()

### Evaluating Product Security with Garak

In [None]:
# Full or partial evaluation 
if os.environ["RUN_FULL_EVAL"] == "1":
    GARAK_SAMPLE_SIZE_CONFIG = ""    
else:
    GARAK_SAMPLE_SIZE_CONFIG = "--target_probes tier1"

# Run evaluation
!python scripts/run_garak.py --output_basedir {GARAK_RESULTS_DIR} --base_config_path ./configs/garak_base_config.yaml --max_workers 4 {GARAK_SAMPLE_SIZE_CONFIG}

garak stores the results of each probe individually and produces an HTML report with the success and failure rate for each probe.
Aggregate and summarize the scan results.

In [None]:
output_csv = os.path.join(GARAK_REPORT_DIR, "garak_results.csv")
garak_df = pd.read_csv(output_csv)
garak_df

Let's take a look at `z_score`---a score calibrated using reference models. Negative z scores indicate vulnerability in the corresponding aspect.

In [None]:
garak_df[garak_df["z_score"] < 0]

The Z-score is a measure of how many standard deviations the model performs from the mean.
The mean is periodically calculated by garak developers from a _bag of models_ that represent state-of-the-art models at the time.
For more information about the models and the statistics, refer to [Intepreting results with a bag of models](https://github.com/NVIDIA/garak/blob/main/garak/data/calibration/bag.md) in the garak repository on GitHub.


In [None]:
labels = ["excellent", "competitive", "average", "below average", "poor"]
garak_label_df = garak_df["z_score_status"].value_counts().reindex(labels)
garak_resilience_score = garak_label_df[["excellent", "competitive"]].sum() / garak_label_df.sum()
print(f"Garak resilience score: {garak_resilience_score*100.:.2f}%")

## Accuracy evaluation using NeMo Framework

Some benchmarks use endpoints provided by build.nvidia.com for answer extraction. If your evaluaiton failed and see error messages like below, please visit [build.nvidia.com](https://build.nvidia.com/) and contact support.

```
Retry attempt 1/25 due to: ClientResponseError: 500, message='Internal Server Error', url='https://integrate.api.nvidia.com/v1/chat/completions'
```


### Evaluating Accuracy  with GPQA-D

In [None]:
# Full or partial evaluation 
gpqad_overrides = os.environ["GPQAD_CONFIG_OVERRIDES"]
if os.environ["RUN_FULL_EVAL"] == "0":
    gpqad_overrides += ",config.params.limit_samples=50"

# Run evaluation
!core_evals_simple_evals run_eval \
      --model_id 'test-model' \
      --model_url http://localhost:5000/v1/chat/completions \
      --eval_type gpqa_diamond \
      --output_dir {GPQA_DIAMOND_RESULTS_DIR} \
      --overrides="{gpqad_overrides}" &> "$LOG_DIR/core-evals-simple-evals-gpqa_diamond.log"

### Evaluating Accuracy with MATH-500

In [None]:
# Full or partial evaluation
aa_math_500_overrides = os.environ["AA_MATH_500_CONFIG_OVERRIDES"]
if os.environ["RUN_FULL_EVAL"] == "0":
    aa_math_500_overrides += ",config.params.limit_samples=100"

# Run evaluation
!core_evals_simple_evals run_eval \
    --eval_type AA_math_test_500 \
    --model_id test-model \
    --model_type chat \
    --model_url http://localhost:5000/v1/chat/completions \
    --output_dir {AA_MATH_500_RESULTS_DIR} \
    --overrides "{aa_math_500_overrides}" &> "$LOG_DIR/core-evals-simple-evals-aa-math-500.log"

### Evaluating Accuracy with IFEval

In [None]:
ifeval_overrides = os.environ["IFEVAL_CONFIG_OVERRIDES"]
# Full or partial evaluation 
if os.environ["RUN_FULL_EVAL"] == "0":
    ifeval_overrides += ",config.params.limit_samples=100"

# Run evaluation
!core_evals_lm_eval run_eval \
    --eval_type ifeval \
    --model_id test-model \
    --model_type chat \
    --model_url http://localhost:5000/v1/chat/completions \
    --output_dir {IFEVAL_RESULTS_DIR} \
    --overrides "{ifeval_overrides}" &> "$LOG_DIR/core-eval-lm-eval-ifeval.log"

Shutdown the vLLM server.

In [None]:
# Terminate the running vllm process
vllm_launcher.stop_all()

In [None]:
import glob

gpqa_diamond_score = None
aa_math500_score = None
ifeval_prompt_strict_score = None
ifeval_inst_strict_score = None

if os.path.exists(f"{GPQA_DIAMOND_RESULTS_DIR}/gpqa_diamond.json"):
    gpqa_diamond_score = json.load(open(f"{GPQA_DIAMOND_RESULTS_DIR}/gpqa_diamond.json"))["score"]

if os.path.exists(f"{AA_MATH_500_RESULTS_DIR}/AA_math_test_500.json"):
    aa_math500_score = json.load(open(f"{AA_MATH_500_RESULTS_DIR}/AA_math_test_500.json"))["score"]

pattern = os.path.join(IFEVAL_RESULTS_DIR, "test-model", "results_*.json")
match_files = glob.glob(pattern)
if len(match_files) > 0:
    ifeval_results = json.load(open(match_files[0]))
    ifeval_prompt_strict_score = ifeval_results["results"]["ifeval"]["prompt_level_strict_acc,none"]
    ifeval_inst_strict_score = ifeval_results["results"]["ifeval"]["inst_level_strict_acc,none"]

accuracy_s = pd.Series(
    {"gpqa_diamond": gpqa_diamond_score,
     "aa_math_500": aa_math500_score,
     "ifeval_prompt_strict": ifeval_prompt_strict_score,
     "ifeval_inst_strict": ifeval_inst_strict_score})
accuracy_s

## Create a Report Card

Now we have two set of evaluation results. Let's compare the safety and accuracy of the same model before and after safety post-training.
Run the following code blocks to launch a server for visual analysis.

You'll see a message like below. Go to Brev and on the instance page,

```
Access > Using Secure Links > Share a Service
```

Then, add 8501 so you can access to the port from your local machine.
Open the browser and you'll be able to see the Report Card Dashboard.

With this method, you need to access it through the secure link instead of the direct link shown in the message. Opening the port to the public is not recommended, as it can pose security risks.

```
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.


  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://<IP address>:8501
  External URL: http://<IP address>:8501
```

In [None]:
ORIGINAL_MODEL_OUTPUT_DIR = f"{BASE_DIR}/results/DeepSeek-R1-Distill-Llama-8B/"
TRAINED_MODEL_OUTPUT_DIR = f"{BASE_DIR}/results/DeepSeek-R1-Distill-Llama-8B-Safety-Trained/"

!streamlit run scripts/app.py -- {ORIGINAL_MODEL_OUTPUT_DIR} {TRAINED_MODEL_OUTPUT_DIR}

## Takeaways

The series of notebooks walked you through and learn the following topics, leveraging NVIDIA's NeMo Framework Evaluation and NeMo-RL.

- [Notebook 1]() Safety and Accuracy Evaluation using NeMo Eval
- [Notebook 2]() Safety Post-training using Safety for Agentic AI's training recipe
- [Notebook 3]() Re-running the same safety and accuracy evaluation to understand how the model has improved