# AI-Q AIRA Evaluation Suite Tutorial

This notebook provides a comprehensive walkthrough of the AI-Q AIRA Evaluation Suite, demonstrating how to evaluate AI-generated research reports with automatic dataset preprocessing and comprehensive quality metrics.

## What You'll Learn

- How to set up the AIRA evaluation framework
- Creating and preprocessing evaluation datasets  
- Configuring and running evaluations
- Understanding evaluation metrics and results
- Customizing evaluations for your use case

## Prerequisites

Before starting, ensure you have:
- Python 3.12+
- NVIDIA API Key (from [build.nvidia.com](https://build.nvidia.com))
- (Optional) Tavily API Key for web search capabilities
- Access to a RAG server endpoint (if running full workflow)


In [None]:
import os
import sys
import subprocess
import json
import yaml
from pathlib import Path

# Check Python version
print(f"Python version: {sys.version}")
print(f"Current working directory: {os.getcwd()}")

# Verify we're in the right directory structure (notebook should be in notebooks/ subdirectory)
if not Path("../pyproject.toml").exists():
    print("Please ensure this notebook is in the notebooks/ directory of the repository")
    print("    The '../pyproject.toml' file should be accessible from here.")
    print("    Current structure should be: repository_root/notebooks/this_notebook.ipynb")
else:
    print("Directory structure verified - pyproject.toml found in parent directory")


In [None]:
# Step 1: Install AIRA package directly into the current Python environment
print("Installing directly into current Python environment...")
print(f"Using Python: {sys.executable}")

import subprocess
import sys

# Install directly using the current Python interpreter
result = subprocess.run([
    sys.executable, "-m", "pip", "install", "-e", ".."
], capture_output=True, text=True)

print(f"\nInstallation result:")
if result.returncode == 0:
    print("Installation successful!")
    if result.stdout:
        print("Output:", result.stdout.strip())
else:
    print("Installation failed!")
    print("Error:", result.stderr)

# Test import immediately
print(f"\nTesting import...")
try:
    # Clear any cached modules to get fresh import
    modules_to_clear = [m for m in sys.modules.keys() if m.startswith('aiq')]
    for module in modules_to_clear:
        del sys.modules[module]
    
    import aiq_aira
    print("SUCCESS: aiq_aira imported!")
    
    # Test CLI command availability  
    try:
        result_test = subprocess.run([sys.executable, "-m", "aiq_aira.cli", "--help"], 
                                   capture_output=True, text=True, timeout=5)
        if "aiq eval" in result_test.stdout:
            print("SUCCESS: CLI commands available!")
        else:
            print("CLI might need additional setup")
    except:
        print("ℹCLI test skipped")
    
    print(f"\nREADY TO GO!")
    
except ImportError as e:
    print(f"Import still failing: {e}")
    print("\nFallback: Using manual path fix...")
    
    # Fallback to manual path
    aira_src_path = str(Path("../aira/src").resolve())
    if aira_src_path not in sys.path:
        sys.path.insert(0, aira_src_path)
        
    try:
        import aiq_aira
        print("SUCCESS with manual path fix!")
    except ImportError as e2:
        print(f"Still failing: {e2}")


## Step 2: Set Your API Keys

Before running the evaluation, you need to set your NVIDIA API key. This is **required** for the evaluation to work.


In [None]:
# Set your API keys here - REPLACE WITH YOUR ACTUAL KEYS!
os.environ["NVIDIA_API_KEY"] = "NVIDIA_API_KEY"  # Required: Get from build.nvidia.com
os.environ["TAVILY_API_KEY"] = "tvly-YOUR_KEY_HERE"   # Optional: For web search functionality

# Verify environment variables are set
api_keys_set = True
if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-YOUR"):
    print("Please set your actual NVIDIA_API_KEY above!")
    print("   Get your free key from: https://build.nvidia.com")
    api_keys_set = False
else:
    print("NVIDIA_API_KEY is set")

if os.environ.get("TAVILY_API_KEY", "").startswith("tvly-YOUR"):
    print("TAVILY_API_KEY not set (optional - web search will be disabled)")
else:
    print("TAVILY_API_KEY is set")

if not api_keys_set:
    print("\nPlease set your API keys above before continuing!")
    print("   The evaluation will fail without a valid NVIDIA_API_KEY")

## Step 3: Quick Start - Run a Basic Evaluation

Let's run a quick evaluation to test everything is working. We'll use the default dataset and configuration included in the repository.


In [None]:
# ──────────────────────────────────────────────────────────────
# QUICK SETUP – point default config to YOUR deployments
# ──────────────────────────────────────────────────────────────

# If you wanted total configuration (recommended), go into configs/eval_config.yml and modify the all areas and then come back to the notebook

# There are two ways to set the dataset path to work with nat eval in this notebook:

# Option A (recommended): edit eval_config.yml once to set an absolute dataset path.
# Set: eval.general.dataset.file_path: /abs/path/to/eval_dataset.json
# If this path is wrong/missing, the evaluation will fail.

# Option B (recommended): override dataset per run via CLI flag.
# nat eval will use this dataset path instead of the config value.
# Example:
# !nat eval --config_file "{config_path}" --dataset "{dataset_path}


import os

# Hosted Nemotron Backend (for reasoning)
os.environ["NEMOTRON_LLM_BASE_URL"]  = "http://nim-llm-ms:8000/v1"
# RAG server for generate_summary / artifact_qa
os.environ["RAG_SERVER_URL"]         = "http://rag-server:8081/v1"

os.environ["EVAL_LLM_BASE_URL"]         = "https://integrate.api.nvidia.com/v1"



# NVIDIA hosted models (leave as-is if you still want integrate.api)
# os.environ["RAGAS_LLM_BASE_URL"]    = "https://integrate.api.nvidia.com/v1"


print("Environment overrides set")

In [None]:
# This will run the evaluation with the default config file that exists in the configs folder called eval_config.yml
# The workflow & metrics output will be in the notebooks/.tmp/aiq_aira_similarity folder.
from pathlib import Path

try:
    notebook_dir = Path(globals()['_dh'])
except Exception:
    notebook_dir = Path.cwd()


# Project root is the parent of notebooks/ in this repo layout
project_root = notebook_dir.parent

# Absolute path to the other config file
eval_config_path = (project_root / "configs" / "eval_config.yml").resolve()

# Fail early if missing
assert eval_config_path.is_file(), f"Config file not found at: {eval_config_path}"

# Option A: Run nat with our evaluation harness if you set the dataset path in the config file
!nat eval --config_file "{eval_config_path}"

# Option B: Run nat with our evaluation harness if you want to set the dataset path in the command line (uncomment this line and comment the line above)
# !nat eval --config_file "{eval_config_path}" --dataset "{dataset_path}"

# There are a lot of logs, so it might be worth redirecting to a file so we can see the output easier just use this command
# !nat eval --config_file "{eval_config_path}" > output.txt 2>&1 

In [None]:
! pip show aiq-aira --version

# Optional: Let's get an CUSTOM EVALUATOR set up together!
By running the cells below, the steps to create a custom evaluator are as follows:
1. Create the file with the evaluators code (we'll be making a pretty basic similarity checker)
2. Add it into the evaluator_register.py file
3. Then reinstall the package to register it
4. Then create a custom config file to only run that evaluator (Be sure to change the anything in the code below if you're using any endpoints and etc. You could change it in the config file itself after creation)
5. Run the evaluation with the new custom evaluator
6. You will see the similarity_evaluator_output.json under notebooks/.tmp/aiq_aira_similarity/similarity_evaluator_output.json
7. This is a very simple example you should repeat steps 1-6 to add more evaluators on your own accord!

In [None]:
%%writefile ../aira/src/aiq_aira/eval/evaluators/similarity_evaluator.py
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""
Custom Similarity Evaluator - Demo evaluator for tutorial purposes.
Computes cosine similarity between generated report and ground truth.
"""
import asyncio
import logging
from typing import List
from sentence_transformers import SentenceTransformer, util
from aiq.data_models.component_ref import LLMRef
from aiq.data_models.evaluator import EvaluatorBaseConfig
from aiq.eval.evaluator.evaluator_model import EvalInput, EvalInputItem, EvalOutput, EvalOutputItem
from pydantic import Field

from aiq_aira.eval.schema import AIResearcherEvalOutput

logger = logging.getLogger(__name__)

class SimilarityEvaluatorConfig(EvaluatorBaseConfig, name="similarity_evaluator"):
    """Configuration for similarity evaluator."""
    model_name: str = Field("all-MiniLM-L6-v2", description="SentenceTransformer model to use")

class SimilarityEvaluator:
    """Evaluator that computes cosine similarity between generated and ground truth text."""
    
    def __init__(self, model_name: str = "all-MiniLM-L6-v2", output_dir: str = None):
        self.model_name = model_name
        self.output_dir = output_dir
        self._model = None
    
    def _get_model(self):
        """Lazy load the sentence transformer model"""
        if self._model is None:
            self._model = SentenceTransformer(self.model_name)
        return self._model
    
    async def evaluate_item(self, item: EvalInputItem) -> EvalOutputItem:
        """Evaluate a single item - FIXED to use correct data structure"""
        try:
            # Follow the same pattern as other evaluators
            if item.output_obj == "":
                # If workflow is skipped, input_obj contains the data source
                item.output_obj = item.input_obj
            
            # Parse the data using the AIRA schema (same as other evaluators)
            data_source = AIResearcherEvalOutput.model_validate_json(item.output_obj)
            logger.info(f"Processing similarity evaluation for item {data_source.id}")
            
            # Extract the generated report and ground truth (following other evaluators)
            generated = data_source.finalized_summary
            ground_truth = data_source.ground_truth
            
            if not generated or not generated.strip():
                return EvalOutputItem(
                    id=item.id, 
                    score=0.0, 
                    reasoning={"error": "Generated report (finalized_summary) is empty"}
                )
            
            if not ground_truth or not ground_truth.strip():
                return EvalOutputItem(
                    id=item.id, 
                    score=0.0, 
                    reasoning={"error": "Ground truth is empty"}
                )
            
            # Compute cosine similarity
            model = self._get_model()
            embeddings = model.encode([generated, ground_truth])
            similarity = float(util.cos_sim(embeddings[0], embeddings[1]))
            
            logger.info(f"Item {data_source.id}: Similarity score: {similarity:.3f}")
            
            return EvalOutputItem(
                id=item.id,
                score=similarity,
                reasoning={
                    "similarity_score": similarity,
                    "generated_length": len(generated),
                    "ground_truth_length": len(ground_truth),
                    "model_used": self.model_name
                }
            )
            
        except Exception as e:
            logger.error(f"Similarity evaluation failed for item {item.id}: {str(e)}")
            return EvalOutputItem(
                id=item.id, 
                score=0.0, 
                reasoning={"error": f"Similarity evaluation failed: {str(e)}"}
            )
    
    async def evaluate(self, eval_input: EvalInput) -> EvalOutput:
        """Evaluate all items"""
        eval_output_items = []
        for item in eval_input.eval_input_items:
            result = await self.evaluate_item(item)
            eval_output_items.append(result)
        
        # Calculate average score
        scores = [item.score for item in eval_output_items if item.score is not None]
        avg_score = sum(scores) / len(scores) if scores else 0.0
        
        logger.info(f"Similarity evaluator completed: {len(scores)} valid scores, average: {avg_score:.3f}")
        
        return EvalOutput(average_score=avg_score, eval_output_items=eval_output_items)


In [None]:
from pathlib import Path
reg_path = Path("../aira/src/aiq_aira/eval/evaluator_register.py")
reg_path.parent.mkdir(parents=True, exist_ok=True)

base_imports = (
    "from aiq.builder.builder import EvalBuilder\n"
    "from aiq.builder.evaluator import EvaluatorInfo\n"
    "from aiq.cli.register_workflow import register_evaluator\n"
)

text = reg_path.read_text(encoding="utf-8") if reg_path.exists() else ""
if "register_evaluator" not in text or "EvaluatorInfo" not in text:
    reg_path.write_text(base_imports + "\n", encoding="utf-8")
    print("Wrote base imports to:", reg_path.resolve())
else:
    print("Imports already present:", reg_path.resolve())


In [None]:
%%writefile -a ../aira/src/aiq_aira/eval/evaluator_register.py
# Custom similarity evaluator registration
from aiq_aira.eval.evaluators.similarity_evaluator import SimilarityEvaluator, SimilarityEvaluatorConfig

@register_evaluator(config_type=SimilarityEvaluatorConfig)
async def register_similarity_evaluator(config: SimilarityEvaluatorConfig, builder: EvalBuilder):
    """Register the similarity evaluator."""
    evaluator = SimilarityEvaluator(
        model_name=config.model_name,
        output_dir=builder.eval_general_config.output_dir,
    )
    yield EvaluatorInfo(
        config=config,
        evaluate_fn=evaluator.evaluate,
        description="Cosine Similarity Evaluator",
    )


In [None]:
import subprocess, sys

# 1) Fix tokenizers/transformers compatibility before loading sentence_transformers
print("Ensuring transformers/tokenizers compatibility (tokenizers>=0.21,<0.22 + transformers latest)...")
fix_cmds = [
    [sys.executable, "-m", "pip", "install", "-U", "tokenizers>=0.21,<0.22"],
    [sys.executable, "-m", "pip", "install", "-U", "transformers"],
]
for cmd in fix_cmds:
    res = subprocess.run(cmd, capture_output=True, text=True)
    if res.returncode != 0:
        print("Dependency fix step failed:")
        print(res.stderr or res.stdout)
        # Do not exit here; continue to reinstall so the error is visible if it persists.

# 2) Reinstall the package
print("Reinstalling package with our new similarity evaluator (clean, no deps, no cache)...")
cmd = [
    sys.executable, "-m", "pip", "install",
    "-e", "..[dev]",
    "--force-reinstall",
    "--no-deps",
    "--no-cache-dir",
    "--quiet",
]
result = subprocess.run(cmd, capture_output=True, text=True)

if result.returncode == 0:
    print("Package reinstalled successfully!")
    print("Your new similarity evaluator should now work correctly")
else:
    print("Reinstall failed:")
    print(result.stderr or result.stdout)

print("Note: If imports were already attempted in this kernel, a kernel restart may be required so transformers/tokenizers are re-imported cleanly.")


In [None]:
#Run this to see the evaluators available, if you created the custom evaluator with the cells above, you should see the similarity_evaluator here!
!nat info components -t evaluator

In [None]:
from pathlib import Path
cfg_path = Path("../configs/eval_config_with_similarity.yml")
cfg_path.parent.mkdir(parents=True, exist_ok=True)
print("Will write to:", cfg_path.resolve())

In [None]:
%%writefile ../configs/eval_config_with_similarity.yml
# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

general:
  use_uvloop: true
  front_end:
    _type: fastapi
    endpoints:
      - path: /generate_query
        method: POST
        description: Creates the query
        function_name: generate_query
      - path: /generate_summary
        method: POST
        description: Generates the summary
        function_name: generate_summary
      - path: /artifact_qa
        method: POST
        description: Q/A or chat about a previously generated artifact
        function_name: artifact_qa
      - path: /aiqhealth
        method: GET
        description: Health check for the AIQ AIRA service
        function_name: health_check
      - path: /default_collections
        method: GET
        description: Get the default collections
        function_name: default_collections
      - path: /default_prompt
        method: GET
        description: Get the default prompt
        function_name: default_prompt

  telemetry:
    logging:
      console:
        _type: console
        level: DEBUG
    tracing:
      weave:
        _type: weave
        project: "NAT-BP-Project-Default"

llms:
  instruct_llm:
    _type: nim
    model_name: meta/llama-3.3-70b-instruct
    temperature: 0.0
    base_url: ${INSTRUCT_LLM_BASE_URL:-http://aira-instruct-llm:8000/v1}
    api_key: not-needed
  
  nemotron:
    _type: nim
    model_name: nvidia/llama-3_3-nemotron-super-49b-v1_5
    temperature: 0.5
    base_url: ${NEMOTRON_LLM_BASE_URL:-http://nim-llm-ms:8000/v1}
    disable_streaming: false
    api_key: not-needed
    max_tokens: 5000

  eval_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    temperature: 0.0
    base_url: https://integrate.api.nvidia.com/v1
    api_key: ${NVIDIA_API_KEY}

  ragas_llm:
    _type: nim
    model_name: nvdev/mistralai/mixtral-8x22b-instruct-v0.1
    temperature: 0.0
    base_url: https://integrate.api.nvidia.com/v1
    api_key: ${NVIDIA_API_KEY}

functions:
  generate_query:
    _type: generate_queries

  generate_summary:
    _type: generate_summaries
    rag_url: http://rag-server:8081/v1

  artifact_qa:
    _type: artifact_qa
    llm_name: instruct_llm
    rag_url: http://rag-server:8081/v1
    
  health_check:
    _type: health_check


  default_prompt:
    _type: default_prompt

  default_collections:
    _type: default_collections
    collections:
      - name: "Biomedical_Dataset"
        topic: "Biomedical"
        report_organization: "You are a medical researcher who specializes in cystic fibrosis. Create a report analyzing how CFTR modulators can be used to restore CFTR protein functions. Include a 150-200 word abstract and a methods, results, and discussion section. Format your answer in paragraphs. Consider all (and only) relevant data. Give a factual report with cited sources."
      - name: "Financial_Dataset"
        topic: "Financial"
        report_organization: "You are a financial analyst who specializes in financial statement analysis. Write a financial report analyzing the 2023 financial performance of Amazon. Identify trends in revenue growth, net income, and total assets. Discuss how these trends affected Amazon's yearly financial performance for 2023. Your output should be organized into a brief introduction, as many sections as necessary to create a comprehensive report, and a conclusion. Format your answer in paragraphs. Use factual sources such as Amazon's quarterly meeting releases for 2023. Cross analyze the sources to draw original and sound conclusions and explain your reasoning for arriving at conclusions. Do not make any false or unverifiable claims. I want a factual report with cited sources."

workflow:
  _type: aira_evaluator_workflow
  generator:
    _type: full
    verbose: true
    fact_extraction_llm: meta/llama-3.1-70b-instruct
    citation_pairing_llm: mistralai/mixtral-8x22b-instruct-v0.1

# Evaluation configuration - ONLY with custom similarity evaluator
eval:
  general:
    output_dir: ./.tmp/aiq_aira_similarity/
    cleanup: true

    dataset:
      _type: json
      # Replace with your own dataset path
      file_path: /Users/kyzheng/aiq-internal-notebook/data/eval_dataset_processed.json
      id_key: id
      structure:
        disable: true
    profiler:
      base_metrics: true

  evaluators:
    # ONLY our custom similarity evaluator - no built-in evaluators
    similarity_evaluator:
      _type: similarity_evaluator
      model_name: all-MiniLM-L6-v2

In [None]:
# Lets test out your custom similarity evaluator
custom_config_path = Path("../configs/eval_config_with_similarity.yml").resolve()

print(f"Testing your custom similarity evaluator")
print(f"Config: {custom_config_path}")
print("\\n" + "="*60)

!nat eval --config_file "{custom_config_path}"
