# Atlas DataFlow — Notebook Orchestrator (Happy Path) v1

Este notebook demonstra o **caminho feliz** (happy path) de execução do Atlas DataFlow:

- **Configuração** explícita
- **Contrato** explícito
- **DAG** explícito (Steps + ordem)
- **Execução** via APIs públicas do core
- **Manifest v1** materializado ao final

> O notebook **não** implementa lógica de negócio.  
> Ele apenas **orquestra**.


In [1]:
# Requisitos de ambiente (recomendado)
# - Na raiz do repositório:  pip install -e .
#
# Este notebook assume que o pacote 'atlas_dataflow' está importável no kernel.

from __future__ import annotations

import os
import json
import uuid
from pathlib import Path
from datetime import datetime, timezone

from atlas_dataflow.core.pipeline.context import RunContext
from atlas_dataflow.core.pipeline.registry import StepRegistry
from atlas_dataflow.core.engine.engine import Engine

from atlas_dataflow.core.config.hashing import compute_config_hash
from atlas_dataflow.core.contract.hashing import compute_contract_hash
from atlas_dataflow.core.traceability.manifest import (
    create_manifest,
    add_event,
    step_started,
    step_finished,
    step_failed,
    save_manifest,
)

# Steps canônicos
from atlas_dataflow.steps.ingest.load import IngestLoadStep
from atlas_dataflow.steps.contract.load import ContractLoadStep
from atlas_dataflow.steps.contract.conformity_report import ContractConformityReportStep
from atlas_dataflow.steps.transform.cast_types_safe import CastTypesSafeStep
from atlas_dataflow.steps.audit.profile_baseline import AuditProfileBaselineStep
from atlas_dataflow.steps.audit.schema_types import AuditSchemaTypesStep
from atlas_dataflow.steps.audit.duplicates import AuditDuplicatesStep


## 1) Run Directory (estado explícito)

- `ATLAS_RUN_DIR` (opcional) controla onde as runs são materializadas.
- Caso não exista, usa `./runs/notebook_happy_path_v1/<run_id>`.

Nada é escrito fora do `run_dir`.


In [2]:
run_id = os.environ.get("ATLAS_RUN_ID", uuid.uuid4().hex[:12])

base_run_dir = Path(os.environ.get("ATLAS_RUN_DIR", "./runs/notebook_happy_path_v1")).expanduser()
run_dir = base_run_dir / run_id
run_dir.mkdir(parents=True, exist_ok=True)

run_dir


WindowsPath('runs/notebook_happy_path_v1/58d10887f8ac')

## 2) Insumos do template (fixture reprodutível)

Para que este notebook rode do início ao fim, criamos um CSV e um contrato mínimos **dentro do run_dir**.
Em uso real, substitua apenas os paths (sem mudar o papel do notebook).


In [3]:
input_csv = run_dir / "input.csv"
contract_path = run_dir / "contract.json"

csv_text = """age,plan,active,target
34,basic,true,0
52,premium,false,1
41,basic,true,0
"""
input_csv.write_text(csv_text, encoding="utf-8")

contract = {
    "contract_version": "1.0",
    "problem": {"name": "template_demo", "type": "classification"},
    "target": {"name": "target", "dtype": "int", "allowed_null": False},
    "features": [
        {"name": "age", "role": "numerical", "dtype": "int", "required": True, "allowed_null": False},
        {"name": "plan", "role": "categorical", "dtype": "category", "required": True, "allowed_null": False},
        {"name": "active", "role": "boolean", "dtype": "bool", "required": True, "allowed_null": False},
    ],
    "defaults": {},
    "categories": {
        "plan": {"allowed": ["basic", "premium"], "normalization": {"type": "lower"}}
    },
    "imputation": {
        "age": {"strategy": "median", "mandatory": False},
        "plan": {"strategy": "most_frequent", "mandatory": False},
        "active": {"strategy": "most_frequent", "mandatory": False},
    },
}

contract_path.write_text(json.dumps(contract, ensure_ascii=False, indent=2), encoding="utf-8")

config = {
    "engine": {"fail_fast": True},
    "contract": {"path": str(contract_path)},
    "steps": {
        "ingest.load": {"enabled": True, "path": str(input_csv)},
        "contract.load": {"enabled": True},
        "contract.conformity_report": {"enabled": True},
        "transform.cast_types_safe": {"enabled": True},
        "audit.profile_baseline": {"enabled": True},
        "audit.schema_types": {"enabled": True},
        "audit.duplicates": {"enabled": True},
    },
}

config


{'engine': {'fail_fast': True},
 'contract': {'path': 'runs\\notebook_happy_path_v1\\58d10887f8ac\\contract.json'},
 'steps': {'ingest.load': {'enabled': True,
   'path': 'runs\\notebook_happy_path_v1\\58d10887f8ac\\input.csv'},
  'contract.load': {'enabled': True},
  'contract.conformity_report': {'enabled': True},
  'transform.cast_types_safe': {'enabled': True},
  'audit.profile_baseline': {'enabled': True},
  'audit.schema_types': {'enabled': True},
  'audit.duplicates': {'enabled': True}}}

## 3) RunContext explícito

O notebook cria o contexto e o entrega ao Engine.
O contrato em memória é injetado por `contract.load`.


In [4]:
ctx = RunContext(
    run_id=run_id,
    created_at=datetime.now(timezone.utc),
    config=config,
    contract={},
    meta={"run_dir": str(run_dir)},
)

ctx




## 4) DAG explícito — *Happy Path*

Ordem declarada (caminho feliz):

1. `ingest.load` (dataset existe)
2. `contract.load` (contrato existe no ctx)
3. `contract.conformity_report` (dataset vs contrato)
4. `transform.cast_types_safe` (coerção segura por contrato)
5. `audit.*` (diagnósticos)


In [5]:
steps = [
    IngestLoadStep(),
    ContractLoadStep(),
    ContractConformityReportStep(),
    CastTypesSafeStep(),
    AuditProfileBaselineStep(),
    AuditSchemaTypesStep(),
    AuditDuplicatesStep(),
]

registry = StepRegistry()
for s in steps:
    registry.add(s)

[s.id for s in registry.list()]


['ingest.load',
 'contract.load',
 'contract.conformity_report',
 'transform.cast_types_safe',
 'audit.profile_baseline',
 'audit.schema_types',
 'audit.duplicates']

## 5) Execução explícita (Engine) + Manifest v1

- roda o pipeline via Engine
- materializa `manifest.json` ao final (com hashes de config/contract)

Obs.: o Manifest deriva de eventos/resultados; o notebook não “interpreta dados”.


In [6]:
# Hashes semânticos
config_hash = compute_config_hash(config)
contract_hash = compute_contract_hash(contract)

try:
    from importlib.metadata import version as _pkg_version
    atlas_version = _pkg_version("atlas-dataflow")
except Exception:
    atlas_version = "dev"

manifest = create_manifest(
    run_id=run_id,
    started_at=ctx.created_at,
    atlas_version=atlas_version,
    config_hash=config_hash,
    contract_hash=contract_hash,
)

add_event(manifest, event_type="run_started", ts=ctx.created_at, payload={"run_dir": str(run_dir)})

engine = Engine(steps=registry.list(), ctx=ctx)

t_run_start = datetime.now(timezone.utc)
result = engine.run()
t_run_end = datetime.now(timezone.utc)

# Usa timestamps do próprio resultado quando disponíveis; cai para tempos da run se não houver.
for step_id, step_result in result.steps.items():
    kind = getattr(step_result, "kind", None)
    kind_value = kind.value if hasattr(kind, "value") else str(kind)

    started_at = getattr(step_result, "started_at", None) or t_run_start
    finished_at = getattr(step_result, "finished_at", None) or t_run_end

    step_started(manifest, step_id=step_id, kind=kind_value, ts=started_at)

    status = getattr(step_result, "status", None)
    status_value = status.value if hasattr(status, "value") else str(status)

    if status_value == "failed":
        step_failed(manifest, step_id=step_id, ts=finished_at, error=getattr(step_result, "summary", "failed"))
    else:
        step_finished(manifest, step_id=step_id, ts=finished_at, result=step_result)

add_event(manifest, event_type="run_finished", ts=t_run_end, payload={"duration_ms": int((t_run_end - t_run_start).total_seconds() * 1000)})

manifest_path = run_dir / "manifest.json"
save_manifest(manifest, manifest_path)

manifest_path


WindowsPath('runs/notebook_happy_path_v1/58d10887f8ac/manifest.json')

## 6) Status (somente leitura)

Mostra o status final de cada Step.  
(O notebook não acessa ou transforma o dataset.)


In [7]:
[(sid, sr.status.value if hasattr(sr.status, "value") else str(sr.status), sr.summary) for sid, sr in result.steps.items()]


[('contract.load', 'success', 'contract loaded and validated'),
 ('contract.conformity_report',
  'failed',
  'No tabular dataset found in RunContext artifacts (expected data.raw_rows or data.transformed_rows)'),
 ('ingest.load', 'success', 'dataset loaded'),
 ('audit.profile_baseline', 'success', 'baseline profile computed'),
 ('audit.duplicates', 'success', 'duplicates audit computed'),
 ('audit.schema_types', 'success', 'schema types audit computed'),
 ('transform.cast_types_safe', 'skipped', 'skipped due to failed dependency')]