# 01: The DeepSearchQA Dataset & Agent Tools

This notebook introduces the two foundational components of the Knowledge QA system:

- **DeepSearchQA** — the benchmark dataset used to evaluate the agent
- **Agent tools** — the five capabilities the agent uses to research and verify answers

## What You'll Learn

1. What the DeepSearchQA dataset contains and how to explore it
2. The five tools the agent has access to, and how it's instructed to use them

## Prerequisites

- `GOOGLE_API_KEY` set in your `.env` file
- Dependencies installed with `uv sync`

In [None]:
import os
from pathlib import Path

from aieng.agent_evals.knowledge_qa import DeepSearchQADataset
from aieng.agent_evals.knowledge_qa.system_instructions import build_system_instructions
from dotenv import load_dotenv
from rich.console import Console
from rich.markdown import Markdown
from rich.panel import Panel
from rich.table import Table


# Set working directory to the repository root
if Path("").absolute().name == "eval-agents":
    print(f"Working directory: {Path('').absolute()}")
else:
    os.chdir(Path("").absolute().parent.parent)
    print(f"Working directory set to: {Path('').absolute()}")

load_dotenv(verbose=True)
console = Console(width=100)

## 1. The DeepSearchQA Dataset

[DeepSearchQA](https://www.kaggle.com/datasets/deepmind/deepsearchqa) is a benchmark from Google DeepMind
for evaluating deep research agents. It contains 896 research questions requiring multi-step web search
and reasoning to answer correctly.

Each question is a **causal chain task**: the agent must follow a chain of searches, fetch real sources,
and verify facts before answering — not recall from training data.

### Answer Types

| Type | Description | Example |
|------|-------------|---------|
| **Single Answer** | One specific value | A date, a number, a proper name |
| **Set Answer** | Multiple required items | A list of countries, a set of policy changes |

Evaluation uses an LLM-as-judge that computes **precision, recall, and F1** by comparing the agent's
answer to the ground truth item-by-item.

In [None]:
dataset = DeepSearchQADataset()

console.print(f"Total examples: [cyan]{len(dataset)}[/cyan]")
console.print(f"Categories: [cyan]{len(dataset.get_categories())}[/cyan]")

### 1.1 Dataset Structure

Each example is a `DSQAExample` with five fields. Let's look at one.

In [None]:
example = dataset[0]

console.print(
    Panel(
        f"[bold]example_id:[/bold]       {example.example_id}\n"
        f"[bold]problem_category:[/bold] {example.problem_category}\n"
        f"[bold]answer_type:[/bold]      {example.answer_type}\n\n"
        f"[bold cyan]problem:[/bold cyan]\n{example.problem}\n\n"
        f"[bold yellow]answer:[/bold yellow]\n{example.answer}",
        title="DSQAExample",
        border_style="blue",
    )
)

### 1.2 Categories

The dataset spans 17 domains. Let's see how examples are distributed across them.

In [None]:
categories = dataset.get_categories()

cat_table = Table(title="Dataset by Category")
cat_table.add_column("Category", style="cyan")
cat_table.add_column("Total", style="white", justify="right")
cat_table.add_column("Single Answer", style="dim", justify="right")
cat_table.add_column("Set Answer", style="dim", justify="right")

for cat in sorted(categories):
    examples = dataset.get_by_category(cat)
    single = sum(1 for e in examples if e.answer_type == "Single Answer")
    set_ans = len(examples) - single
    cat_table.add_row(cat, str(len(examples)), str(single), str(set_ans))

console.print(cat_table)

### 1.3 Answer Types in Practice

The answer type matters for evaluation — the grader treats "Single Answer" and "Set Answer"
differently when computing correctness.

In [None]:
single_ex = next(e for e in dataset.examples if e.answer_type == "Single Answer")
set_ex = next(e for e in dataset.examples if e.answer_type == "Set Answer")

for label, ex, style in [
    ("Single Answer", single_ex, "green"),
    ("Set Answer", set_ex, "yellow"),
]:
    console.print(
        Panel(
            f"[bold cyan]Question:[/bold cyan]\n{ex.problem}\n\n[bold yellow]Answer:[/bold yellow]\n{ex.answer}",
            title=f"{label} — {ex.problem_category}",
            border_style=style,
        )
    )

### 1.4 Browsing Examples

You can retrieve examples by category or by ID.

In [None]:
# Examples by category
finance_examples = dataset.get_by_category("Finance & Economics")
console.print(f"Finance & Economics: [cyan]{len(finance_examples)}[/cyan] examples\n")

# Display a preview table
browse_table = Table(title="Finance & Economics — First 5 Examples")
browse_table.add_column("ID", style="dim", width=6)
browse_table.add_column("Answer Type", style="cyan", width=15)
browse_table.add_column("Question", style="white")

for ex in finance_examples[:5]:
    q = ex.problem[:75] + "..." if len(ex.problem) > 75 else ex.problem
    browse_table.add_row(str(ex.example_id), ex.answer_type, q)

console.print(browse_table)

## 2. The Agent's Tools

The `KnowledgeGroundedAgent` has five tools that form a natural research workflow:

| Tool | Purpose | When the Agent Uses It |
|------|---------|----------------------|
| `google_search` | Find relevant URLs | First step for any sub-question |
| `web_fetch` | Read web pages and PDFs | To verify facts from the actual source |
| `fetch_file` | Download CSV, XLSX, JSON files | When the answer is in structured data |
| `grep_file` | Search within a downloaded file | To locate a specific value in a large file |
| `read_file` | Read sections of a downloaded file | To inspect a specific part of a downloaded file |

**Why not answer from search snippets?** Snippets are brief and may be outdated or misleading.
The system instructions enforce a strict causal chain: **Search → Fetch → Verify → Answer**.

In [None]:
instructions = build_system_instructions()

console.print(
    Panel(
        Markdown(instructions),
        title="Agent System Instructions",
        border_style="blue",
        padding=(1, 2),
    )
)

## Summary

In this notebook you saw:

1. **The DeepSearchQA dataset** — 896 research questions across 17 categories, evaluated with
   precision/recall/F1 using an LLM-as-judge
2. **The five agent tools** — search, web fetch, file download, grep, and file read
3. **The system instructions** — how the agent is guided to use its tools, including the
   critical search → fetch → verify → answer chain

**Next:** In Notebook 02, we'll create the agent, run it on questions, and observe how it uses these tools.

In [None]:
console.print(
    Panel(
        "[green]✓[/green] Notebook complete!\n\n"
        "[cyan]Next:[/cyan] Open [bold]02_running_the_agent.ipynb[/bold] to run the agent.",
        title="Done",
        border_style="green",
    )
)