# Example (CLI): Extract indicators from a PDF to JSON\n
\n
This notebook is the notebook version of `examples/example_cli_extract_pdf_json.py`.\n
It calls the CLI programmatically so the output matches the `vsme-extract` command (schema + options).\n
\n
## Prerequisites\n
\n
- Install dependencies: `pip install .`\n
- Provide `SCW_API_KEY` (via environment variable or a `.env` file at repo root).\n

In [None]:
from __future__ import annotations

import os
import sys
from pathlib import Path

from dotenv import find_dotenv, load_dotenv


In [None]:
# Ensure we import the local repo package (useful when you also have an installed version).
_CWD = Path.cwd().resolve()
_REPO_ROOT = _CWD if (_CWD / "vsme_extractor").exists() else _CWD.parent
if str(_REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(_REPO_ROOT))

# Load .env if present (do NOT override explicit environment variables).
dotenv_path = find_dotenv(usecwd=True)
load_dotenv(dotenv_path, override=False)
print("Repo root:", _REPO_ROOT)
print(".env found:", bool(dotenv_path), dotenv_path)


## 1) Select a PDF\n
\n
The repo may ship example PDFs under `./data/test/`. If you do not have them locally, edit the path below to point to an existing PDF.\n

In [None]:
pdf_path = Path("/your_path/your_file.pdf")  # <-- edit this
if not pdf_path.exists():
    raise FileNotFoundError(f"PDF file does not exist: {pdf_path}")
pdf_path.exists(), pdf_path


## 2) Run the extractor (CLI) and export JSON\n
\n
Notes:\n
- `--codes` lets you restrict to a small subset for faster testing.\n
- You can include retrieval details by setting env var `VSME_OUTPUT_JSON_INCLUDE_RETRIEVAL_DETAILS=1` or toggling the boolean below.\n
- Output is written next to the PDF as `*.vsme.json`.\n

In [None]:
# Optional: restrict to a small set of indicators for faster testing (edit as needed)
codes = "B3_1,B3_2,B7_1,B1_1"

# Optional: keep retrieval debug fields in each indicator row
include_retrieval_details = (
    os.getenv("VSME_OUTPUT_JSON_INCLUDE_RETRIEVAL_DETAILS") or "0"
).strip().lower() in {"1", "true", "yes", "y", "on"}

from vsme_extractor.cli import main as cli_main  # noqa: E402

args = [
    str(pdf_path),
    "--no-log-stdout",
    "--output-format",
    "json",
    "--json-include-status",
    "--codes",
    codes,
]
if include_retrieval_details:
    args.append("--json-include-retrieval-details")

cli_main(args)
out_path = pdf_path.with_suffix(".vsme.json")
print("Export:", out_path)
out_path.exists()


## 3) Inspect the generated JSON (optional)\n
\n
The JSON schema matches the CLI output: top-level keys typically include `pdf`, `results`, `stats` (and optionally `status`).\n

In [None]:
import json

data = json.loads(out_path.read_text(encoding="utf-8"))
data.keys()


In [None]:
# Preview a small part of the payload
{
    "pdf": data.get("pdf"),
    "n_results": len(data.get("results", [])),
    "stats": data.get("stats"),
}
