# AI-Q Research Assistant Evaluation Suite Tutorial

This notebook provides a comprehensive walkthrough of the AI-Q Research Assistant Evaluation Suite, demonstrating how to evaluate AI-generated research reports with automatic dataset preprocessing and comprehensive quality metrics.

## What You'll Learn

- How to set up the AI-Q Research Assistant evaluation framework
- Creating and preprocessing evaluation datasets  
- Configuring and running evaluations
- Understanding evaluation metrics and results

## Prerequisites

Before starting, ensure you have:
- Local Deployment of AI-Q Research Assistant or hosted endpoint
    - Reference for setup lives at: docs/get-started/get-started-docker-compose.md
- Local Deployment of Foundational RAG or hosted endpoint
    - Steps for RAG deployment are also in the same deployment file at: docs/get-started/get-started-docker-compose.md
- Default collections loaded, especially “Biomedical_Dataset” (see docs/get-started/get-started-docker-compose.md).
    - This is required because both the evaluation suite and the research workflow read from this RAG collection.
- Alternatively, deploy via Helm: deploy/helm (see docs/get-started/helm-deployment.md)
- Python 3.12+
- NVIDIA API Key (from [build.nvidia.com](https://build.nvidia.com))
- (Optional) Tavily API Key for web search capabilities
- (Optional) Wandb API Key for tracing capabilities

In [None]:
import os
import sys
import subprocess
import json
import yaml
from pathlib import Path

# Check Python version
print(f"Python version: {sys.version}")
print(f"Current working directory: {os.getcwd()}")

# Verify we're in the right directory structure (notebook should be in notebooks/ subdirectory)
if not Path("../pyproject.toml").exists():
    print("Please ensure this notebook is in the notebooks/ directory of the repository")
    print("    The '../pyproject.toml' file should be accessible from here.")
    print("    Current structure should be: repository_root/notebooks/this_notebook.ipynb")
else:
    print("Directory structure verified - pyproject.toml found in parent directory")


In [None]:
# Step 1: Install AI-Q Research Assistant package directly into the current Python environment
print("Installing directly into current Python environment...")
print(f"Using Python: {sys.executable}")

import subprocess
import sys

# Install directly using the current Python interpreter
result = subprocess.run([
    sys.executable, "-m", "pip", "install", "-e", ".."
], capture_output=True, text=True)

print(f"\nInstallation result:")
if result.returncode == 0:
    print("Installation successful!")
    if result.stdout:
        print("Output:", result.stdout.strip())
else:
    print("Installation failed!")
    print("Error:", result.stderr)

# Test import immediately
print(f"\nTesting import...")
try:
    # Clear any cached modules to get fresh import
    modules_to_clear = [m for m in sys.modules.keys() if m.startswith('aiq')]
    for module in modules_to_clear:
        del sys.modules[module]
    
    import aiq_aira
    print("SUCCESS: aiq_aira imported!")
    
    
    print(f"\nREADY TO GO!")
    
except ImportError as e:
    print(f"Import still failing: {e}")
    print("\nFallback: Using manual path fix...")
    
    # Fallback to manual path
    aira_src_path = str(Path("../aira/src").resolve())
    if aira_src_path not in sys.path:
        sys.path.insert(0, aira_src_path)
        
    try:
        import aiq_aira
        print("SUCCESS with manual path fix!")
    except ImportError as e2:
        print(f"Still failing: {e2}")


## Step 2: Set Your API Keys

Before running the evaluation, you need to set your NVIDIA API key. This is **required** for the evaluation to work.


In [None]:
# Run this cell to set up the environment variables, you can also set them manually in your terminal
# It will prompt you to enter your API key for NVIDIA

import os
import getpass

def ensure_secret(var_name: str, prompt_text: str, require_prefix: str | None = None) -> str:
    val = os.environ.get(var_name, "").strip()
    while True:
        if not val:
            val = getpass.getpass(prompt_text).strip()
        if not val:
            print(f"{var_name} cannot be empty. Please try again.")
            continue
        if require_prefix and not val.startswith(require_prefix):
            print(f"{var_name} should start with '{require_prefix}'. Please try again.")
            val = ""  # force re-prompt
            continue
        break
    os.environ[var_name] = val
    return val

# Use it
nvidia_key = ensure_secret(
    "NVIDIA_API_KEY",
    "Please input your NVIDIA_API_KEY: ",
    require_prefix="nvapi-"
)
print("NVIDIA_API_KEY is set.")

key = os.environ.get("NVIDIA_API_KEY", "")
shown = f"{key[:6]}...{key[-4:]}" if len(key) >= 10 else "***"
print(f"NVIDIA_API_KEY (masked): {shown}")


### Optional: Setting up Web Search with Tavily

In [None]:
# To set up web search with tavily, you need to get an API key from tavily and set it in the environment variable TAVILY_API_KEY by running this cell
# To enable web search during evaluation, please set `"search_web": true` in your dataset entries. See example in data/eval_dataset.json
tavily = os.environ.get("TAVILY_API_KEY", "").strip()
if not tavily:
    tmp = getpass.getpass("Optional: TAVILY_API_KEY (press Enter to skip): ").strip()
    if tmp:
        os.environ["TAVILY_API_KEY"] = tmp
        print("TAVILY_API_KEY is set (optional).")
    else:
        print("TAVILY_API_KEY not set (optional).")
else:
    print("TAVILY_API_KEY is set (optional).")

### Optional: Setting up Tracing with W&B Weave

- To turn tracing with weave go into /configs/eval_config.yml go under telemetry portion
  - It is currently set to being off by default but uncommenting the portion below turns on weave tracing during evaluation

```python
 telemetry:
    logging:
      console:
        _type: console
        level: DEBUG
    # Uncomment this if you want to use W&B Weave for tracing
    # tracing:
    #   weave:
    #     _type: weave
    #     project: "NAT-BP-Project-Default" # Name of the project in weave, runs will be grouped under this project name
```

To view your results after your evaluation run head over to 
- https://wandb.ai/home

For more documentation on how to set up Weave in Nemo Agent Toolkit workflow reference here:
- https://docs.nvidia.com/nemo/agent-toolkit/1.2/workflows/observe/observe-workflow-with-weave.html



In [None]:
# Optional: Enable Weights & Biases (W&B) for Weave/tracing and experiment tracking.
# Run this cell only if enabling W&B/Weave as it will set your WANDB_KEY

import os, getpass

def ensure_wandb_key() -> str:
    key = os.environ.get("WANDB_API_KEY", "").strip()
    while not key:
        key = getpass.getpass("Enter WANDB_API_KEY (or press Enter to cancel): ").strip()
        if not key:
            print("W&B setup skipped. Leave this cell unrun or set WANDB_API_KEY later to enable tracing.")
            return ""
    os.environ["WANDB_API_KEY"] = key
    shown = f"{key[:4]}...{key[-4:]}" if len(key) >= 10 else "***"
    print(f"WANDB_API_KEY set (masked): {shown}")
    return key

try:
    key = ensure_wandb_key()
    if key:
        import wandb
        base_url = os.environ.get("WANDB_BASE_URL", "").strip()
        if base_url:
            os.environ["WANDB_BASE_URL"] = base_url
            wandb.login(key=key, host=base_url, verify=True, relogin=True)
        else:
            wandb.login(key=key, verify=True, relogin=True)
        print("Weights & Biases login successful. W&B is enabled for this session.")
except Exception as e:
    print("Failed to log in to Weights & Biases. Check WANDB_API_KEY or connectivity.")
    print(str(e))

## Step 3: Quick Start - Run a Basic Evaluation

Let's run a quick evaluation to test everything is working. We'll use the default dataset and configuration included in the repository.


In [None]:

# Set Environment variable overrides for service endpoints

import os

# Default: self-hosted/local NIMs
# Hosted Instruct LLM Backend (set to meta/llama-3.3-70b-instruct)
os.environ["INSTRUCT_LLM_BASE_URL"]="http://localhost:8050/v1}"
# Hosted Nemotron Backend (for reasoning)
os.environ["NEMOTRON_LLM_BASE_URL"]="http://localhost:8999/v1"
# RAG server for generate_summary / artifact_qa
os.environ["RAG_SERVER_URL"]="http://localhost:8081/v1"

# Optional: If you want to use hosted endpoints for the evaluation LLM and RAGAS LLM, uncomment the following lines and point to the correct hosted endpoints
# Else, the default for evaluation LLM and RAGAS LLM are to use models hosted on NVIDIA's build.nvidia.com
# os.environ["EVAL_LLM_BASE_URL"]="http://local-llm-endpoint:8000/v1"
# os.environ["RAGAS_LLM_BASE_URL"]="http://local-llm-endpoint:8000/v1"


# NVIDIA hosted example: uncomment ALL lines below to use hosted endpoints, you must have your own RAG server and set the correct endpoints
# os.environ["RAG_SERVER_URL"]="https://your-rag-server.example.com/v1"
# os.environ["INSTRUCT_LLM_BASE_URL"]="https://integrate.api.nvidia.com/v1"
# os.environ["NEMOTRON_LLM_BASE_URL"]="https://integrate.api.nvidia.com/v1"
# os.environ["EVAL_LLM_BASE_URL"]="https://integrate.api.nvidia.com/v1"
# os.environ["RAGAS_LLM_BASE_URL"]="https://integrate.api.nvidia.com/v1"

print("Environment overrides set")
key = os.environ.get("NVIDIA_API_KEY", "")
shown = f"{key[:6]}...{key[-4:]}" if len(key) >= 10 else "***"
print(f"You have set your NVIDIA_API_KEY (masked): {shown}")

### ──────────────────────────────────────────────────────────────
### QUICK SETUP – point default config to YOUR deployments
### ──────────────────────────────────────────────────────────────

If you wanted total configuration (recommended), go into configs/eval_config.yml and modify the areas you want and then come back to the notebook

This notebook will run the evaluation harness with:
- Config file → `configs/eval_config.yml`  
- Dataset file → `data/eval_dataset.json`

### There are two ways to point the evaluation at a dataset:
#### Option A (recommended) : override dataset per run via CLI flag.
- nat eval will use this dataset path instead of the config value.
- Example: nat eval --config_file "{config_path}" --dataset "{dataset_path}
- The next cell below will dynamically resolve and point to the default dataset already present in the repo (e.g., data/eval_dataset.json), so just run it

#### Option B : edit eval_config.yml once to set an absolute dataset path.
- Set: eval.general.dataset.file_path: /abs/path/to/eval_dataset.json
- If this path is wrong/missing, the evaluation will fail.
- To run eval: nat eval --config_file "{config_path}"

In [None]:
# This cell will run the evaluation harness with our default config file and dataset path
from pathlib import Path

try:
    notebook_dir = Path(globals()['_dh'])
except Exception:
    notebook_dir = Path.cwd()


# Project root is the parent of notebooks/ in this repo layout
project_root = notebook_dir.parent

# Absolute path to the other config file
eval_config_path = (project_root / "configs" / "eval_config.yml").resolve()
dataset_path = (project_root / "data" / "eval_dataset.json").resolve()

# Fail early if missing
assert eval_config_path.is_file(), f"Config file not found at: {eval_config_path}"
assert dataset_path.exists(), f"Dataset path does not exist: {dataset_path}"

# Note: A single full end to end workflow and evaluation may take up to 15-30 minutes
# The workflow & metrics output will be in the ./.tmp/aiq_aira/ directory
# Run the workflow with our evaluation harness with our default config file and dataset path
!nat eval --config_file "{eval_config_path}" --dataset "{dataset_path}"

## Understanding the Evaluators Overview 

#### Coverage – Inclusion of key facts/claims from the ground truth. 
- Does the report capture all key facts from the ground truth?

#### Synthesis – Integration, comparison/contrast, and coherence across multiple sources.
- Does it integrate multiple sources meaningfully, showing alignment or differences?

#### Hallucination – Unsupported or non‑grounded claims in the generated report.
- Does the output introduce any unsupported claims?

#### Citation Quality – Whether claims are supported by the cited sources and citations are precise.
- Are references correctly attributed and verifiable via grounding?

#### RAGAS metrics – Context relevance, answer accuracy, and groundedness of responses.
- Do retrieval and factuality hold up across context relevance, answer accuracy, and groundedness?

## How they are scored 

### Coverage
- What it tests: Inclusion of key facts/claims extracted from the ground truth.

- Method: Single template evaluation.

- Scale: 0 (not covered) to 1 (covered).

### Synthesis
What it tests: Ability to integrate information from multiple sources, compare, contrast, and draw coherent conclusions.

- Method: Dual template evaluation; average then normalize to 0–1.

- Scale: 0 (pure extraction) to 4 (expert synthesis), averaged and mapped to 0–1.

### Hallucination
What it tests: Unsupported or non-grounded claims in the generated report.

- Method: Dual template evaluation with two prompts; average the two scores.

- Scale: 0 (no hallucination) to 1 (hallucination detected).


### Citation quality
- What it tests: Whether claims are supported by the cited sources and citations are precise.

- Method: Uses RAGAS ResponseGroundedness to verify support; computes precision, recall, and F1 with a 0.5 validity threshold.

- Score: Final F1 between 0 and 1.

### RAGAS integration
- Context relevance: Are retrieved contexts pertinent to the query? Scored via two LLM-judge prompts, normalized to 0–1.

- Answer accuracy: Agreement with ground-truth answer using dual-judge scoring mapped to 0–1.

- Groundedness: Are response claims supported by retrieved contexts? Dual-judge scoring normalized to 0–1.

Notes on scoring mechanics
Dual-template metrics average two independent LLM-as-a-judge ratings after mapping to the 0–1 interval.

Where native scales differ (e.g., 0–2 or 0–4), scores are normalized to 0–1 for comparability

### Example Output Files:
For reference on what to expect from a full end-to-end evaluation, see the [example workflow output files - pointing to citation f1 but all files live in that directory](../docs/example-workflow-output/citation_f1_output.json) which demonstrate the typical structure and content of evaluation results. If on juptyer, reference docs/example-workflow-output/ for the example output.

### If you were interested in creating your own custom evaluator the instructions are in the [Evaluate.md](../docs/evaluate.md) under the section title "Implementing a Custom Evaluator"


In [None]:
! pip show aiq-aira --version