# Step 0: Setup

In this initial step, we prepare the environment for running the notebook by installing required packages and configuring vLLM with a custom reasoning parser.

## Install required packages

This section installs the necessary dependencies for running vLLM and other components used throughout the notebook. In particular, we:

- Install a specific version of vLLM that is compatible with our custom modifications.
- Install additional libraries needed for result visualization.
- Add a custom reasoning parser to vLLM using the `scripts/setup_parser.sh` script. This enables the model to support advanced reasoning capabilities, such as structured intermediate outputs or special token handling used in agentic workflows.


⚠️ **Note**: If you're running this notebook on a clean machine or fresh container, this setup step is required before executing any inference or evaluation logic.

In [None]:
# Install required packages
!pip install -q streamlit==1.45.1 > /tmp/package-installation.log 2>&1

# Install custom reasoning parser
!bash scripts/setup_parser.sh

## Configure API Keys and Evaluation Setting

Run the following code block and add your API keys that are required for the notebooks.
Alternatvely, you can directly edit `.env` and add the information information there. Other environment variables will be automatically added.

```
# .env
HF_TOKEN=<Your HF_TOKEN>
JUDGE_API_KEY=<Your NVIDIA_API_KEY>
WANDB_API_KEY=<Your WANDB_API_KEY>
RUN_FULL_EVAL=0 or 1
```

In [None]:
import os
from dotenv import load_dotenv
from getpass import getpass
from pathlib import Path
import yaml

# Load API keys from .env
load_dotenv(dotenv_path=".env")

# Check HF_TOKEN
if "HF_TOKEN" in os.environ:
    HF_TOKEN = os.environ["HF_TOKEN"]
else:
    HF_TOKEN = getpass("Enter your HF_TOKEN: ")

# Check NVIDIA_API_KEY
if "JUDGE_API_KEY" in os.environ:
    JUDGE_API_KEY = os.environ["JUDGE_API_KEY"]
else:
    JUDGE_API_KEY = getpass("Enter your NVIDIA API KEY (used for Judge API): ")

# Check WANDB_API_KEY
if "WANDB_API_KEY" in os.environ:
    WANDB_API_KEY = os.environ["WANDB_API_KEY"]
else:
    WANDB_API_KEY = getpass("Enter your WANDB_API_KEY (press Enter to skip): ")
    if not WANDB_API_KEY:
        print("WANDB_API_KEY not provided. Skipping wandb integration.")
        # Update the config file for SFT to skip W&B logging
        with open("configs/deepseek_sft.yaml", "r") as f:
            config = yaml.safe_load(f)
        config["logger"]["wandb_enabled"] = False
        with open("configs/deepseek_sft.yaml", "w") as f:
            yaml.dump(config, f, default_flow_style=False)        

# Check WANDB_API_KEY
if "RUN_FULL_EVAL" in os.environ:
    RUN_FULL_EVAL = os.environ["RUN_FULL_EVAL"]
else:
    RUN_FULL_EVAL = getpass("Run full evaluation? (No=0 or Yes=1): ")
    if RUN_FULL_EVAL not in ["0", "1"]:
        print("Invalid input was provided. The value must be 0 or 1. Set it to 0.")
        RUN_FUL_EVAL = 0

# Add python path to the script directory
if "PYTHONPATH" in os.environ:
    target_path = "/ephemeral/workspace/safety-for-agentic-ai/notebooks/scripts"
    if target_path not in os.environ["PYTHONPATH"]:
        PYTHONPATH = os.environ["PYTHONPATH"] + f":{target_path}"
    else:
        PYTHONPATH = os.environ["PYTHONPATH"]
else:
    PYTHONPATH = "/ephemeral/workspace/safety-for-agentic-ai/notebooks/scripts"

BASE_DIR = "/ephemeral/workspace"
TMPDIR = "/tmp"
XDG_CACHE_HOME = f"{BASE_DIR}/cache"
HF_HOME = f"{BASE_DIR}/cache/huggingface"
UV_CACHE_DIR = f"{BASE_DIR}/cache/uv"
TRITON_CACHE_DIR = f"{BASE_DIR}/cache/triton"
DATASET_CACHE_DIR = f"{BASE_DIR}/dataset_cache"
RAY_TMPDIR = "/tmp/ray"

# This Developer Blueprint requires and has been tested with 8x H100 or A100 GPUs.
# If using fewer GPUs, update the following environment variables accordingly.
POLICY_MODEL_GPUS = "0,1,2,3"
POLICY_MODEL_GPUS_FULL = "0,1,2,3,4,5,6,7"
NEMOGUARD_MODEL_GPUS = "4,5"
WILDGUARD_MODEL_GPUS = "6,7"
SAFETY_MODEL_GPUS = "4,5,6,7"

# vLLM settings
VLLM_ENGINE_ITERATION_TIMEOUT_S = "36000"
VLLM_ALLOW_LONG_MAX_MODEL_LEN = "1"
VLLM_HOST = "0.0.0.0"

# * Evaluation configuration
# Accuracy benchmarks
EVAL_CONFIG_OVERRIDES = "target.api_endpoint.type=chat,config.params.temperature=0.6,config.params.top_p=0.95,config.params.max_new_tokens=12288,config.params.max_retries=5,config.params.parallelism=4,config.params.request_timeout=600"
GPQAD_CONFIG_OVERRIDES = AA_MATH_500_CONFIG_OVERRIDES = IFEVAL_CONFIG_OVERRIDES = EVAL_CONFIG_OVERRIDES

# Safety benchmarks
AEGIS_CONFIG_OVERRIDES = "config.params.max_new_tokens=8192,config.params.parallelism=10,config.params.extra.judge.parallelism=10"
WILDGUARD_CONFIG_OVERRIDES = "config.params.max_new_tokens=8192,config.params.parallelism=10,config.params.extra.judge.parallelism=10"

env_var_names = [
    "HF_TOKEN", "JUDGE_API_KEY", "WANDB_API_KEY", "RUN_FULL_EVAL",
    "PYTHONPATH", "BASE_DIR", "TMPDIR", "XDG_CACHE_HOME", "HF_HOME",
    "UV_CACHE_DIR", "TRITON_CACHE_DIR", "DATASET_CACHE_DIR", "RAY_TMPDIR",
    "POLICY_MODEL_GPUS", "POLICY_MODEL_GPUS_FULL", "NEMOGUARD_MODEL_GPUS",
    "WILDGUARD_MODEL_GPUS", "SAFETY_MODEL_GPUS",
    "VLLM_ENGINE_ITERATION_TIMEOUT_S", "VLLM_ALLOW_LONG_MAX_MODEL_LEN", "VLLM_HOST",
    "EVAL_CONFIG_OVERRIDES", "GPQAD_CONFIG_OVERRIDES", "AA_MATH_500_CONFIG_OVERRIDES",
    "IFEVAL_CONFIG_OVERRIDES", "AEGIS_CONFIG_OVERRIDES", "WILDGUARD_CONFIG_OVERRIDES"
]

with open(".env", "w") as f:
    for var in env_var_names:
        value = globals().get(var)
        if value is not None:
            f.write(f"{var}={value}\n")

# Reload .env to store the keys in os.environ
load_dotenv(dotenv_path=".env")
print("API keys have been stored in .env.")

In [None]:
# Create directories
for dir_path in [os.environ['TMPDIR'], os.environ['XDG_CACHE_HOME'], os.environ['HF_HOME'],
                 os.environ['UV_CACHE_DIR'],os.environ['TRITON_CACHE_DIR'], os.environ['DATASET_CACHE_DIR'], 
                 os.environ['RAY_TMPDIR']]:
    Path(dir_path).mkdir(parents=True, exist_ok=True)


## Download and prepare the NeMo Guard Model for Content Safety

For content safety evaluation, we need a guard model (often called a guard model) that classifies whether the model's response is safe or not. As the NeMo Guard model's weights are distributed as LoRA adaptor weights, you need to download them and merge witht the Llama 3.1 8B Instruct.

⚠️ **Note**: Make sure to complete the model creation before running Notebooks 1-3 as they require this model.

In [None]:
from peft import PeftModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

NEMOGUARD_MODEL_PATH = "/ephemeral/workspace/model/llama-3.1-nemoguard-8b-content-safety"

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base_model, "nvidia/llama-3.1-nemoguard-8b-content-safety")
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained(NEMOGUARD_MODEL_PATH, torch_dtype=torch.bfloat16)
tokenizer.save_pretrained(NEMOGUARD_MODEL_PATH) 

## Getting familiar with vLLM server launcher

In the Developer Blueprint Notebooks, you will launch vLLM servers to perform inference and evaluation. To simplify this process, you'll use `VLLMLauncher`, which provides a more Pythonic and user-friendly interface for managing vLLM server instances.

In [None]:
from scripts.vllm_launcher import VLLMLauncher

vllm_launcher = VLLMLauncher(total_num_gpus=8)
vllm_server_proc = vllm_launcher.launch(model_name_or_path="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
                                        gpu_devices="4,5,6,7",
                                        port=5000,
                                        served_model_name="test-model",
                                        log_filepath="/tmp/vllm-server-model.log",
                                        seed=1,
                                        vllm_host="0.0.0.0")

!sleep 120                                        

You can check if the vLLM server process is running using  `.is_alive()`, and view the latest lines of the server log using `print_log()`.

In [None]:
vllm_server_proc.is_alive()

In [None]:
vllm_server_proc.print_log()

You can also verify GPU usage by the vLLM server process by running `nvidia-smi`.

In [None]:
!nvidia-smi

You can stop the server process using `.stop()`. If you want to terminate all vLLM server processes launched `VLLMServer` launcher, you can use `vllm_launcher.stop_all()`

In [None]:
vllm_server_proc.stop()
# or vllm_launcher.stop_all()

## Next Steps

You set up the environment for the notebooks. The next step is to run [evaluation of the target model](Step1_Evaluation.ipynb) using safety and accuracy datasets to assess the model’s current performance in both areas.


⚠️ **Note**: To ensure GPU VRAM is properly released, it is recommended to select **"Shut Down Kernel"** after completing each notebook, including this notebook.