# Case Files and Running the Agent

In Notebook 1 we explored the raw database and the `ReadOnlySqlDatabase` tool.
Now we zoom out and ask: what problem is the agent actually solving, and how do we feed it a case?

This notebook covers:
1. The real-world AML workflow we modelled
2. The data structures that represent a case
3. Why the evaluation dataset has four case types and what each one tests
4. How to generate cases from the raw data
5. Running a single case through the agent and inspecting its output

---

**Prerequisites:** Complete Notebook 1 first. The database must exist at `implementations/aml_investigation/data/aml_transactions.db`.

In [None]:
import json
import os
from pathlib import Path

import pandas as pd
from aieng.agent_evals.aml_investigation.data import (
    CaseFile,
    CaseRecord,
    GroundTruth,
    LaunderingPattern,
    build_cases,
    download_dataset_file,
    normalize_transactions_data,
)
from aieng.agent_evals.aml_investigation.task import AmlInvestigationTask
from dotenv import load_dotenv


# Setting the notebook directory to the project's root folder
if Path("").absolute().name == "eval-agents":
    print(f"Notebook path is already the root path: {Path('').absolute()}")
else:
    os.chdir(Path("").absolute().parent.parent)
    print(f"The notebook path has been set to: {Path('').absolute()}")

load_dotenv(verbose=True)

## 1. Our Model of the Anti-Money Laundering Investigation Workflow

In practice, AML investigations at financial institutions are more complex than what we model here. What we model is the core investigative loop: a transaction gets flagged, a case is opened, an analyst investigates, and the analyst produces a written determination.

In our model, the workflow has three stages.

First, an external alerting system flags a transaction. This could be a rules engine, an ML model, a law enforcement referral, or a routine sampling process. The system assigns a `trigger_label` to the case, which is a short string describing why the case was opened. Crucially, this label is noisy: it may be a strong signal (e.g. `FAN-OUT`, `LAW_ENFORCEMENT_REFERRAL`) or essentially no signal at all (e.g. `QA_SAMPLE`, `RANDOM_REVIEW`).

Second, the case is opened with a structured record containing: a unique `case_id`, the flagged `seed_transaction_id`, the `seed_timestamp` (which marks the end of the investigation window), and a `window_start` timestamp (which marks how far back the analyst should look). The analyst is only expected to reason about events within that window.

Third, the analyst investigates by querying the transaction database, identifies whether the activity is consistent with a laundering pattern, and produces a written output: a narrative summary, a verdict (`is_laundering`), a pattern classification, and the specific transaction IDs that form the suspicious chain.

The agent mirrors this structure exactly. It receives the case record as a JSON object, queries the database, and returns a structured `AnalystOutput`.

## 2. The Data Structures

The agent's input and output are structured as Pydantic models. This allows us to enforce a schema at the model level, which simplifies prompt engineering and evaluation.

**`CaseFile`** is what the agent receives. It contains only what a real analyst would be given at case open time: no ground truth, no answer.

```python
class CaseFile(BaseModel):
    case_id: str               # unique identifier
    seed_transaction_id: str   # the flagged transaction
    seed_timestamp: str        # end of the investigation window
    window_start: str          # start of the investigation window
    trigger_label: str         # why the case was opened (may be noisy)
```

**`GroundTruth`** records what actually happened. It is never shown to the agent. It is used only by the graders during evaluation.

```python
class GroundTruth(BaseModel):
    is_laundering: bool
    pattern_type: LaunderingPattern    # FAN-IN, FAN-OUT, CYCLE, ..., NONE
    pattern_description: str
    attempt_transaction_ids: str       # comma-separated laundering chain
```

**`AnalystOutput`** is what the agent must produce. Its schema is enforced at the model level via `output_schema`.

```python
class AnalystOutput(BaseModel):
    summary_narrative: str             # the agent's reasoning
    is_laundering: bool
    pattern_type: LaunderingPattern
    pattern_description: str
    flagged_transaction_ids: str       # the agent's identified laundering chain
```

A **`CaseRecord`** bundles `input: CaseFile` and `expected_output: GroundTruth` together. This is the unit that goes into the Langfuse dataset. The `input` field is sent to the agent; the `expected_output` field is passed to the graders.

In [None]:
# Demonstrate the structure manually
example_case = CaseRecord(
    input=CaseFile(
        case_id="demo-001",
        seed_transaction_id="txn-abc",
        seed_timestamp="2022-09-15T14:30:00",
        window_start="2022-09-01T00:00:00",
        trigger_label="QA_SAMPLE",  # low-signal: gives no hint about laundering
    ),
    expected_output=GroundTruth(
        is_laundering=True,
        pattern_type=LaunderingPattern.FAN_OUT,
        pattern_description="One source dispersing funds to many destinations.",
        attempt_transaction_ids="txn-abc,txn-def,txn-ghi",
    ),
)

print("--- Input (what the agent sees) ---")
print(example_case.input.model_dump_json(indent=2))

print("\n--- Expected Output (hidden from the agent; used for grading) ---")
print(example_case.expected_output.model_dump_json(indent=2))

## 3. The Four Case Types

A robust evaluation dataset needs to test more than just "can the agent find laundering?". We deliberately construct four case types, each probing a different failure mode.

| Case type | `is_laundering` (ground truth) | `trigger_label` | What it tests |
|---|---|---|---|
| **True Positive** | `True` | Pattern name (e.g. `FAN-OUT`) | Can the agent correctly identify and describe a real laundering pattern? |
| **True Negative** | `False` | Low-signal (`QA_SAMPLE`, `RANDOM_REVIEW`, ...) | Can the agent correctly clear a benign case without over-investigating? |
| **False Positive** | `False` | High-signal (`ANOMALOUS_BEHAVIOR_ALERT`, `LAW_ENFORCEMENT_REFERRAL`, ...) | Can the agent resist a misleading trigger and avoid a false alarm? |
| **False Negative** | `True` | Low-signal (`QA_SAMPLE`, `RANDOM_REVIEW`, ...) | Can the agent find laundering even when the trigger provides no hint? |

The false positive and false negative cases are the most diagnostic. They test whether the agent can reason independently rather than follow the trigger label.

### How each type is built

**True Positives** are parsed from the `Patterns.txt` file in the Kaggle dataset. This file records every known laundering attempt: the accounts involved, the exact transactions, and the pattern type. The `trigger_label` is set to the pattern name, simulating an alerting system that correctly identified the behaviour.

**True Negatives** sample random benign transactions from the dataset. The `trigger_label` is set to one of `QA_SAMPLE`, `RANDOM_REVIEW`, `RETROSPECTIVE_REVIEW`, or `MODEL_MONITORING_SAMPLE`, realistic labels for a routine compliance review that carries no signal about laundering.

**False Positives** are built from benign accounts with an unusually high transaction volume on a single day. High volume is a common heuristic alert trigger, so these cases look suspicious at first glance. The trigger label is a high-signal label like `ANOMALOUS_BEHAVIOR_ALERT`, but the ground truth is `is_laundering=False`.

**False Negatives** are taken from additional laundering attempts beyond those used as True Positives. The key difference: the `trigger_label` is swapped to a low-signal review label, removing any hint. The agent must discover the laundering through its own investigation.

## 4. Generating Case Files

In [None]:
CASES_PATH = Path("implementations/aml_investigation/data/aml_cases.jsonl")

ILLICIT_RATIO = "HI"  # "HI" or "LI"
TRANSACTIONS_SIZE = "Small"  # "Small", "Medium", or "Large"

In [None]:
# Run this cell only if cases do not exist yet.
# It downloads the dataset from Kaggle and may take a minute if files aren't cached
# locally.

if not CASES_PATH.exists():
    print("Downloading dataset files...")
    path_to_transactions_csv = download_dataset_file(ILLICIT_RATIO, TRANSACTIONS_SIZE, "Trans.csv")
    path_to_patterns_txt = download_dataset_file(ILLICIT_RATIO, TRANSACTIONS_SIZE, "Patterns.txt")
    print("Download complete.")

    print("Normalizing transactions...")
    transactions_df = pd.read_csv(path_to_transactions_csv)
    transactions_df = normalize_transactions_data(transactions_df)
    print(f"Loaded {len(transactions_df):,} transactions.")

    print("Building cases...")
    cases = build_cases(
        path_to_patterns_txt,
        transactions_df,
        num_laundering_cases=5,
        num_normal_cases=5,
        num_false_negative_cases=3,
        num_false_positive_cases=3,
        lookback_days=10,  # how far back from the seed transaction the agent should investigate
    )
    print(f"Built {len(cases)} cases.")

    CASES_PATH.parent.mkdir(parents=True, exist_ok=True)
    with CASES_PATH.open("w", encoding="utf-8") as f:
        for record in cases:
            f.write(record.model_dump_json() + "\n")
    print(f"Wrote cases to {CASES_PATH}")
else:
    print(f"Cases already exist at {CASES_PATH}")

In [None]:
raw_cases = [json.loads(line) for line in CASES_PATH.read_text().splitlines() if line.strip()]
cases = [CaseRecord.model_validate(raw_case) for raw_case in raw_cases]

print(f"Total cases loaded: {len(cases)}")

In [None]:
# Summary table of all cases
summary = pd.DataFrame(
    [
        {
            "case_id": case.input.case_id[:12] + "...",
            "trigger_label": case.input.trigger_label,
            "is_laundering": case.expected_output.is_laundering,
            "pattern_type": case.expected_output.pattern_type.value,
            "window_days": (pd.Timestamp(case.input.seed_timestamp) - pd.Timestamp(case.input.window_start)).days,
        }
        for case in cases
    ]
)

print(summary)

In [None]:
# Classify each case into one of the four types
_LOW_SIGNAL = {"QA_SAMPLE", "RANDOM_REVIEW", "RETROSPECTIVE_REVIEW", "MODEL_MONITORING_SAMPLE"}
_HIGH_SIGNAL = {"ANOMALOUS_BEHAVIOR_ALERT", "LAW_ENFORCEMENT_REFERRAL", "EXTERNAL_TIP"}
_PATTERN_LABELS = {p.value for p in LaunderingPattern if p != LaunderingPattern.NONE}


def classify_case(case: CaseRecord) -> str:
    """Classify a case record."""
    label = case.input.trigger_label
    is_laundering = case.expected_output.is_laundering
    if label in _PATTERN_LABELS and is_laundering:
        return "True Positive"
    if label in _LOW_SIGNAL and not is_laundering:
        return "True Negative"
    if (label in _HIGH_SIGNAL or label in _PATTERN_LABELS) and not is_laundering:
        return "False Positive"
    if label in _LOW_SIGNAL and is_laundering:
        return "False Negative"
    return "Other"


summary["case_type"] = [classify_case(case) for case in cases]
print(summary["case_type"].value_counts().to_string())

In [None]:
# Print one representative example of each case type
for case_type in ["True Positive", "True Negative", "False Positive", "False Negative"]:
    idx = summary[summary["case_type"] == case_type].index
    if len(idx) == 0:
        print(f"[{case_type}] no examples in this dataset\n")
        continue
    case = cases[idx[0]]
    print(f"=== {case_type} ===")
    print(f"  trigger_label : {case.input.trigger_label}")
    print(f"  is_laundering : {case.expected_output.is_laundering}")
    print(f"  pattern_type  : {case.expected_output.pattern_type.value}")
    print(f"  window        : {case.input.window_start}  to  {case.input.seed_timestamp}")
    print()

## 5. The Agent

The agent is a Google ADK `LlmAgent` configured with three things:

- A detailed system prompt (`ANALYST_PROMPT`) describing the investigation workflow, a strategy for querying the database efficiently (start with aggregates, expand selectively), and the laundering typologies to look for.
- Two tools: `get_schema_info()` and `execute(query)` from `ReadOnlySqlDatabase`, the same ones explored in Notebook 1.
- A structured output schema that enforces `AnalystOutput`, so the final response is always a valid, parseable object.

`AmlInvestigationTask` is a thin wrapper around the agent that:
1. Serializes the `CaseFile` to JSON and sends it as the user message
2. Streams the agent's response via the ADK runner
3. Extracts the final response and parses it into an `AnalystOutput` object

It implements the `TaskFunction` protocol expected by the Langfuse experiment harness, so it can be passed directly to `run_experiment`. We will use it that way in Notebook 3.

## 6. Running a Single Case

Let's run one case manually and watch the agent work.

> **Note:** This requires a `.env` file with valid `GOOGLE_API_KEY`, `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, and `LANGFUSE_HOST` values.

In [None]:
task = AmlInvestigationTask()
print(f"Agent : {task._agent.name}")
print(f"Model : {task._agent.model}")
print(f"Tools : {[tool.name for tool in task._agent.tools]}")

In [None]:
# Pick a case type to run.
# Try all four types to see how the agent behaves on each.
CASE_TYPE_TO_RUN = "True Positive"  # options: "True Positive", "True Negative", "False Positive", "False Negative"

idx = summary[summary["case_type"] == CASE_TYPE_TO_RUN].index
if len(idx) == 0:
    raise ValueError(f"No cases of type '{CASE_TYPE_TO_RUN}' found.")

selected_case = cases[idx[0]]
print(f"Running case : {selected_case.input.case_id}")
print(f"  Type         : {CASE_TYPE_TO_RUN}")
print(f"  Trigger      : {selected_case.input.trigger_label}")
print(f"  Window       : {selected_case.input.window_start} to {selected_case.input.seed_timestamp}")
print()
print("--- Input sent to the agent ---")
print(selected_case.input.model_dump_json(indent=2))

In [None]:
# Run the agent. This makes live LLM calls and may take 2-3 minutes.
agent_output = await task(item={"input": selected_case.input.model_dump()})

if agent_output is None:
    print("Agent returned no output. Check your credentials and that the database exists.")
else:
    print("\n--- Agent Output ---")
    print(agent_output)

    print("Agent finished.")

## 7. Comparing Agent Output to Ground Truth

Before introducing automated graders, let's compare the output by hand.

In [None]:
if agent_output is not None:
    ground_truth = selected_case.expected_output

    is_laundering_match = ground_truth.is_laundering == agent_output["is_laundering"]
    pattern_match = ground_truth.pattern_type.value == agent_output["pattern_type"]

    ground_truth_transaction_ids = {i.strip() for i in ground_truth.attempt_transaction_ids.split(",") if i.strip()}
    agent_flagged_ids = {i.strip() for i in agent_output["flagged_transaction_ids"].split(",") if i.strip()}

    print(f"{'Field':<30} {'Ground Truth':<25} {'Agent Output':<25} {'Match?'}")
    print("-" * 90)
    print(
        f"{'is_laundering':<30} {str(ground_truth.is_laundering):<25} {str(agent_output['is_laundering']):<25} {'OK' if is_laundering_match else 'WRONG'}"
    )
    print(
        f"{'pattern_type':<30} {ground_truth.pattern_type.value:<25} {agent_output['pattern_type'].value:<25} {'OK' if pattern_match else 'WRONG'}"
    )
    print()

    print(f"Ground truth tx IDs  : {ground_truth.attempt_transaction_ids or '(none)'}")
    print(f"Agent flagged tx IDs : {agent_output['flagged_transaction_ids'] or '(none)'}")

In [None]:
if agent_output is not None:
    print("=== Agent Summary Narrative ===")
    print(agent_output["summary_narrative"])

## 8. Try the Other Case Types

Go back to the cell in section 6 that sets `CASE_TYPE_TO_RUN` and change it to each of the four types. A few things to pay attention to:

- **False Positive**: The `trigger_label` suggests something suspicious. Does the agent correctly clear the case, or does it follow the trigger?
- **False Negative**: The `trigger_label` is noise. Does the agent still find the laundering pattern through its own investigation?
- **True Negative**: A completely benign case. Does the agent close it cleanly without over-reaching?

In the **next** notebook, we will introduce automated graders that quantify the agent's performance across these dimensions at scale.

In [None]:
await task.close()
print("Task closed.")