# Benchmark Metadata Extraction & Fact Verification Demo

This notebook demonstrates the complete workflow for extracting benchmark metadata, identifying risks, and performing fact-based verification using RAG (Retrieval-Augmented Generation).

## Pipeline Overview

1. **UnitXT Lookup** - Extract benchmark metadata from catalog
2. **ID Extraction** - Parse HuggingFace repo IDs and paper URLs
3. **HuggingFace Metadata** - Fetch dataset information
4. **Paper Extraction** - Extract content from academic papers using Docling
5. **Card Composition** - Generate structured benchmark cards using LLM
6. **Risk Identification** - Identify potential risks using Risk Atlas Nexus
7. **RAG Processing** - Retrieve evidence for fact verification
8. **Factuality Evaluation** - Assess confidence and flag uncertain fields

## Setup & Installation

Install dependencies. FactReasoner and Risk Atlas Nexus are now installed from pip.

In [1]:
#!pip install -e .

In [2]:
import json
import os
from pathlib import Path
from IPython.display import display, JSON

# Enable nested event loops for Jupyter
import nest_asyncio
nest_asyncio.apply()

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Import workflow components
from auto_benchmarkcard.workflow import build_workflow, OutputManager, sanitize_benchmark_name
from auto_benchmarkcard.config import Config

print("All imports successful!")

[2025-10-17 13:19:15:840] - INFO - RiskAtlasNexus - Created RITS inference engine.


All imports successful!


## Configuration

Set your API credentials and processing parameters.

In [3]:
# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Verify credentials
required = ['RITS_API_KEY', 'RITS_MODEL', 'RITS_API_URL']
missing = [v for v in required if not os.getenv(v)]

if missing:
    print(f"Missing: {', '.join(missing)}")
else:
    print("Environment configured")

# Show config
print(f"LLM Engine: {Config.LLM_ENGINE_TYPE}")
print(f"Model: {Config.DEFAULT_MODEL}")
print(f"Threshold: {Config.DEFAULT_FACTUALITY_THRESHOLD}")

Environment configured
LLM Engine: rits
Model: llama-3.3-70b-instruct
Threshold: 0.8


## Run Workflow

Execute the complete pipeline for a benchmark.

In [4]:
# Define benchmark to process
BENCHMARK_QUERY = "glue"  # Change this to your desired benchmark
CATALOG_PATH = None  # Optional: custom UnitXT catalog path
OUTPUT_PATH = None  # Optional: custom output directory

print(f"Processing benchmark: {BENCHMARK_QUERY}")

Processing benchmark: glue


In [5]:
# Initialize output manager
output_manager = OutputManager(BENCHMARK_QUERY, OUTPUT_PATH)
print(f"Session: {output_manager.get_summary()['session_directory']}")

# Initialize workflow state (all keys required by workflow)
initial_state = {
    "query": BENCHMARK_QUERY,
    "catalog_path": CATALOG_PATH,
    "output_manager": output_manager,
    "unitxt_json": None,
    "extracted_ids": None,
    "hf_repo": None,
    "hf_json": None,
    "docling_output": None,
    "composed_card": None,
    "risk_enhanced_card": None,
    "completed": [],
    "errors": [],
    "hf_extraction_attempted": False,
    "rag_results": None,
    "factuality_results": None,
}

# Build and execute workflow
print("=== Starting Workflow ===")
workflow = build_workflow()

try:
    final_state = workflow.invoke(initial_state)
    print("Workflow completed successfully!")
except Exception as e:
    print(f"Workflow failed: {e}")
    raise

Session: output/glue_2025-10-17_13-19
=== Starting Workflow ===


2025-10-17 13:19:18,921 - INFO - UnitXT metadata retrieved
2025-10-17 13:19:18,923 - INFO - UnitXT output saved to: output/glue_2025-10-17_13-19/tool_output/unitxt/glue.json
2025-10-17 13:19:18,924 - INFO - Starting ID and URL extraction
2025-10-17 13:19:18,924 - INFO - ID extraction completed
2025-10-17 13:19:18,925 - INFO - Extracted: HF=['nyu-mll/glue', 'stanfordnlp/sst2'], Paper=None
2025-10-17 13:19:18,926 - INFO - Extractor output saved to: output/glue_2025-10-17_13-19/tool_output/extractor/glue.json
2025-10-17 13:19:20,060 - INFO - HuggingFace metadata retrieved successfully
2025-10-17 13:19:20,064 - INFO - HuggingFace output saved to: output/glue_2025-10-17_13-19/tool_output/hf/glue.json
2025-10-17 13:19:20,066 - INFO - Starting HuggingFace extraction
2025-10-17 13:19:20,067 - INFO - Found paper_url in HF dataset nyu-mll/glue: https://arxiv.org/abs/1804.07461
2025-10-17 13:19:20,069 - INFO - HF extractor output saved to: output/glue_2025-10-17_13-19/tool_output/extractor/glue.j

Inferring with RITS:   0%|          | 0/1 [00:00<?, ?it/s]

2025-10-17 13:21:39,731 - INFO - Risk-enhanced card saved to: output/glue_2025-10-17_13-19/tool_output/risk_enhanced/glue.json
2025-10-17 13:21:39,732 - INFO - Risk identification results saved to: output/glue_2025-10-17_13-19/tool_output/risk_atlas_nexus/risks_glue.json
2025-10-17 13:21:39,733 - INFO - Risk identification completed
2025-10-17 13:21:39,733 - INFO - Risks: Membership inference attack, Confidential data in prompt (+3 more)
2025-10-17 13:21:39,733 - INFO - Starting RAG processing
[2025-10-17 13:23:43:539] - INFO - RiskAtlasNexus - Created RITS inference engine.
2025-10-17 13:24:29,789 - INFO - RAG processing completed
2025-10-17 13:24:29,789 - INFO - RAG: 18 claims, 47 evidence sources
2025-10-17 13:24:29,790 - INFO - RAG results saved to: output/glue_2025-10-17_13-19/tool_output/rag/formatted_rag_results_glue.jsonl
2025-10-17 13:24:29,791 - INFO - Starting factuality evaluation
2025-10-17 13:24:31,553 - INFO - FactReasoner evaluation complete
2025-10-17 13:24:31,554 - IN

Workflow completed successfully!


## Results Summary

In [6]:
# Display workflow steps
print("=== Workflow Steps Completed ===")
for i, step in enumerate(final_state.get('completed', []), 1):
    print(f"{i}. {step}")

# Display errors if any
errors = final_state.get('errors', [])
if errors:
    print("=== Errors ===")
    for error in errors:
        print(f"{error}")

=== Workflow Steps Completed ===
1. unitxt done
2. extract hf_repo=['nyu-mll/glue', 'stanfordnlp/sst2'], paper_url=None
3. hf done
4. hf_extract paper_url=https://arxiv.org/abs/1804.07461
5. docling done
6. composer done
7. risk identification done
8. rag done
9. factreasoner done


## View Benchmark Card

Display the composed benchmark card with risk information.

In [7]:
# Load the final saved benchmark card from disk
card_name = f"benchmark_card_{sanitize_benchmark_name(BENCHMARK_QUERY)}.json"
card_path = Path(output_manager.benchmarkcard_dir) / card_name

if card_path.exists():
    with open(card_path, 'r') as f:
        saved_card = json.load(f)

    benchmark_card = saved_card.get('benchmark_card', {})
    details = benchmark_card.get('benchmark_details', {})
    risks = benchmark_card.get('possible_risks', [])

    print("=== Benchmark Details ===")
    print(f"Name: {details.get('name', 'N/A')}")
    print(f"Domains: {details.get('domains', [])}")
    print(f"Languages: {details.get('languages', [])}")
    print(f"\nRisks Identified: {len(risks)}")

    # Display full card
    print("=== Full Benchmark Card ===")
    display(JSON(benchmark_card, expanded=False))
else:
    print(f"No benchmark card found at: {card_path}")

=== Benchmark Details ===
Name: GLUE
Domains: ['natural language understanding', 'sentence classification', 'textual entailment']
Languages: ['English']

Risks Identified: 5
=== Full Benchmark Card ===


<IPython.core.display.JSON object>

## Access Outputs

All results are saved in the session directory.

In [8]:
# Display output locations
summary = output_manager.get_summary()
print("=== Output Files ===\\n")
print(f"Session Directory: {summary['session_directory']}")
print(f"Benchmark Cards: {summary['benchmark_cards']}")

# Final card path
card_name = f"benchmark_card_{sanitize_benchmark_name(BENCHMARK_QUERY)}.json"
card_path = Path(summary['benchmark_cards']) / card_name

if card_path.exists():
    print(f"Final card: {card_path}")
    print(f"  Size: {card_path.stat().st_size:,} bytes")
else:
    print(f"Card not found at: {card_path}")

=== Output Files ===\n
Session Directory: output/glue_2025-10-17_13-19
Benchmark Cards: output/glue_2025-10-17_13-19/benchmarkcard
Final card: output/glue_2025-10-17_13-19/benchmarkcard/benchmark_card_glue.json
  Size: 8,312 bytes
