# Benchmark Metadata Extraction & Fact Verification Demo

This notebook demonstrates the complete workflow for extracting benchmark metadata, identifying risks, and performing fact-based verification using RAG (Retrieval-Augmented Generation).

## Pipeline Overview

1. **UnitXT Lookup** - Extract benchmark metadata from catalog
2. **ID Extraction** - Parse HuggingFace repo IDs and paper URLs
3. **HuggingFace Metadata** - Fetch dataset information
4. **Paper Extraction** - Extract content from academic papers using Docling
5. **Card Composition** - Generate structured benchmark cards using LLM
6. **Risk Identification** - Identify potential risks using Risk Atlas Nexus
7. **RAG Processing** - Retrieve evidence for fact verification
8. **Factuality Evaluation** - Assess confidence and flag uncertain fields

## 1. Setup & Installation

Install dependencies. FactReasoner and Risk Atlas Nexus are now installed from pip.

In [None]:
# Install package in development mode (run once)
# !pip install -e ..

# Dependencies are automatically installed:
# - fact_reasoner @ git+https://github.com/arishofmann/FactReasoner.git
# - risk-atlas-nexus[rits] @ git+https://github.com/IBM/risk-atlas-nexus.git

In [None]:
import json
import os
from pathlib import Path
from IPython.display import display, JSON

# Enable nested event loops for Jupyter
import nest_asyncio
nest_asyncio.apply()

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Import workflow components
from auto_benchmarkcard.workflow import build_workflow, OutputManager, sanitize_benchmark_name
from auto_benchmarkcard.config import Config

print("All imports successful!")

✓ All imports successful!


## 2. Configuration

Set your API credentials and processing parameters.

In [None]:
# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Verify credentials
required = ['RITS_API_KEY', 'RITS_MODEL', 'RITS_API_URL']
missing = [v for v in required if not os.getenv(v)]

if missing:
    print(f"Missing: {', '.join(missing)}")
else:
    print("Environment configured")

# Show config
print(f"LLM Engine: {Config.LLM_ENGINE_TYPE}")
print(f"Model: {Config.DEFAULT_MODEL}")
print(f"Threshold: {Config.DEFAULT_FACTUALITY_THRESHOLD}")

✓ Environment configured
\nLLM Engine: rits
Model: llama-3.3-70b-instruct
Threshold: 0.8


## 3. Run Workflow

Execute the complete pipeline for a benchmark.

In [7]:
# Define benchmark to process
BENCHMARK_QUERY = "attaq"  # Change this to your desired benchmark
CATALOG_PATH = None  # Optional: custom UnitXT catalog path
OUTPUT_PATH = None  # Optional: custom output directory

print(f"Processing benchmark: {BENCHMARK_QUERY}")

Processing benchmark: attaq


In [None]:
# Initialize output manager
output_manager = OutputManager(BENCHMARK_QUERY, OUTPUT_PATH)
print(f"Session: {output_manager.get_summary()['session_directory']}")

# Initialize workflow state (all keys required by workflow)
initial_state = {
    "query": BENCHMARK_QUERY,
    "catalog_path": CATALOG_PATH,
    "output_manager": output_manager,
    "unitxt_json": None,
    "extracted_ids": None,
    "hf_repo": None,
    "hf_json": None,
    "docling_output": None,
    "composed_card": None,
    "risk_enhanced_card": None,
    "completed": [],
    "errors": [],
    "hf_extraction_attempted": False,
    "rag_results": None,
    "factuality_results": None,
}

# Build and execute workflow
print("=== Starting Workflow ===")
workflow = build_workflow()

try:
    final_state = workflow.invoke(initial_state)
    print("Workflow completed successfully!")
except Exception as e:
    print(f"Workflow failed: {e}")
    raise

2025-10-16 10:45:10,316 - INFO - Looking up benchmark 'attaq' in UnitXT catalog


Session: output/attaq_2025-10-16_10-45
\n=== Starting Workflow ===\n
'input_fields' field of Task should be a dictionary of field names and their types. For example, {'text': str, 'classes': List[str]}. Instead only '['input']' was passed. All types will be assumed to be 'Any'. In future version of unitxt this will raise an exception.
For more information: see https://www.unitxt.ai/en/latest//docs/adding_task.html 

'reference_fields' field of Task should be a dictionary of field names and their types. For example, {'text': str, 'classes': List[str]}. Instead only '['label']' was passed. All types will be assumed to be 'Any'. In future version of unitxt this will raise an exception.
For more information: see https://www.unitxt.ai/en/latest//docs/adding_task.html 

'prediction_type' was not set in Task. It is used to check the output of template post processors is compatible with the expected input of the metrics. Setting `prediction_type` to 'Any' (no checking is done). In future versi

2025-10-16 10:45:13,623 - INFO - Successfully retrieved UnitXT metadata for 'attaq' with 1 components
2025-10-16 10:45:13,699 - INFO - UnitXT metadata retrieved
2025-10-16 10:45:13,699 - INFO - Found: attaq - TaskCard delineates the phases in transforming the source da...
2025-10-16 10:45:13,700 - INFO - UnitXT output saved to: output/attaq_2025-10-16_10-45/tool_output/unitxt/attaq.json
2025-10-16 10:45:13,702 - INFO - Starting ID and URL extraction
2025-10-16 10:45:13,703 - INFO - ID extraction completed
2025-10-16 10:45:13,703 - INFO - Extracted: HF=ibm/AttaQ, Paper=None
2025-10-16 10:45:13,704 - INFO - Extractor output saved to: output/attaq_2025-10-16_10-45/tool_output/extractor/attaq.json
2025-10-16 10:45:13,705 - INFO - Fetching HuggingFace metadata for dataset: ibm/AttaQ
2025-10-16 10:45:14,734 - INFO - Successfully retrieved HuggingFace metadata for ibm/AttaQ
2025-10-16 10:45:14,734 - INFO - HuggingFace metadata retrieved successfully
2025-10-16 10:45:14,735 - INFO - HuggingFac

\n✓ Workflow completed successfully!\n


## 4. Results Summary

In [None]:
# Display workflow steps
print("=== Workflow Steps Completed ===")
for i, step in enumerate(final_state.get('completed', []), 1):
    print(f"{i}. {step}")

# Display errors if any
errors = final_state.get('errors', [])
if errors:
    print("=== Errors ===")
    for error in errors:
        print(f"{error}")

=== Workflow Steps Completed ===\n
1. unitxt done
2. extract hf_repo=ibm/AttaQ, paper_url=None
3. hf done
4. hf_extract no paper_url found
5. composer done
6. risk identification done
7. rag done
8. factreasoner done


## 5. View Benchmark Card

Display the composed benchmark card with risk information.

In [None]:
# Get the risk-enhanced card
risk_card = final_state.get('risk_enhanced_card', {}).get('benchmark_card')

if risk_card:
    details = risk_card.get('benchmark_details', {})
    risks = risk_card.get('possible_risks', [])

    print("=== Benchmark Details ===")
    print(f"Name: {details.get('name', 'N/A')}")
    print(f"Domains: {details.get('domains', [])}")
    print(f"Languages: {details.get('languages', [])}")
    print(f"\\nRisks Identified: {len(risks)}")

    # Display full card
    print("=== Full Benchmark Card ===")
    display(JSON(risk_card, expanded=False))
else:
    print("No benchmark card available")

=== Benchmark Details ===\n
Name: AttaQ
Domains: ['text-generation', 'text2text-generation', 'safety', 'harm']
Languages: ['English']
\nRisks Identified: 0
\n=== Full Benchmark Card ===\n


<IPython.core.display.JSON object>

## 6. Factuality Analysis

Review confidence scores and flagged fields.

In [None]:
factuality_results = final_state.get('factuality_results')

if factuality_results:
    marginals = factuality_results.get('marginals', [])
    field_analysis = factuality_results.get('field_analysis', {})

    # Count confidence levels
    stats = {
        'total': len(marginals),
        'high': sum(1 for m in marginals if m.get('p_true', 0) >= 0.8),
        'low': sum(1 for m in marginals if 0 < m.get('p_true', 0) < 0.8),
        'none': sum(1 for m in marginals if m.get('p_true', 0) == 0.5)
    }

    print("=== Factuality Statistics ===")
    print(f"Total Claims: {stats['total']}")
    print(f"High Confidence (≥0.8): {stats['high']}")
    print(f"Low Confidence (<0.8): {stats['low']}")
    print(f"No Evidence: {stats['none']}")

    # Show flagged fields
    flagged = field_analysis.get('flagged_fields', [])
    if flagged:
        print(f"\\n⚠ Flagged Fields: {len(flagged)}")
        for field in flagged[:10]:
            print(f"  • {field}")
    else:
        print("All fields verified!")
else:
    print("No factuality results available")

=== Factuality Statistics ===\n
Total Claims: 20
High Confidence (≥0.8): 0
Low Confidence (<0.8): 20
No Evidence: 20
\n✓ All fields verified!


## 7. Access Outputs

All results are saved in the session directory.

In [None]:
# Display output locations
summary = output_manager.get_summary()
print("=== Output Files ===\\n")
print(f"Session Directory: {summary['session_directory']}")
print(f"Benchmark Cards: {summary['benchmark_cards']}")

# Final card path
card_name = f"benchmark_card_{sanitize_benchmark_name(BENCHMARK_QUERY)}.json"
card_path = Path(summary['benchmark_cards']) / card_name

if card_path.exists():
    print(f"Final card: {card_path}")
    print(f"  Size: {card_path.stat().st_size:,} bytes")
else:
    print(f"Card not found at: {card_path}")

=== Output Files ===\n
Session Directory: output/attaq_2025-10-16_10-45
Benchmark Cards: output/attaq_2025-10-16_10-45/benchmarkcard
\n✓ Final card: output/attaq_2025-10-16_10-45/benchmarkcard/benchmark_card_attaq.json
  Size: 5,438 bytes
