# Consist Quickstart

This notebook is a minimal first pass through Consist. You'll create a tiny CSV, run a cleaning step, and see how cache hits and misses behave.

## Installation

If Consist is not installed yet, install the package in your environment:

```bash
pip install consist
```

We'll keep everything in the current working directory so you can see the artifacts on disk.

In [40]:
from consist import Tracker
from pathlib import Path
import pandas as pd

# Clean slate for reproducible notebook output
db_path = Path("./provenance.duckdb")
if db_path.exists():
    db_path.unlink()

tracker = Tracker(run_dir="./runs", db_path="./provenance.duckdb")

## Create a small input file

Consist treats file inputs as artifacts. When you pass a file path in `inputs={...}`, it hashes the file contents for caching and (optionally) loads the data into your function.

In [41]:
raw_path = Path("raw.csv")
df = pd.DataFrame({"value": [0.1, 0.6, 0.9, 0.2]})
df.to_csv(raw_path, index=False)
raw_path

PosixPath('raw.csv')

## Define a step

Our step accepts a loaded DataFrame and writes a cleaned parquet file under the run directory.
Returning a path is enough for Consist to log the output as an artifact.

In [42]:
def clean_data(raw_df, threshold=0.5, out_name="cleaned.parquet"):
    cleaned = raw_df[raw_df["value"] > threshold]
    out_path = tracker.run_dir / out_name
    cleaned.to_parquet(out_path)
    return out_path

## Run it (cache miss)

The first run executes the function and records provenance for inputs, config, and outputs.

In [43]:
result1 = tracker.run(
    fn=clean_data,
    inputs={"raw_df": raw_path},
    config={"threshold": 0.5},
    fn_args={"out_name": "cleaned_0_5.parquet"},
    outputs=["cleaned"],
)
print(f"Cache hit: {result1.cache_hit}")

Cache hit: False


In [44]:
cleaned_artifact = result1.outputs["cleaned"]
print(f"Artifact: {cleaned_artifact.id}")
print(f"Path: {cleaned_artifact.path.name}")

Artifact: 7eb75755-d8b5-43f3-82da-7cef6834f092
Path: cleaned_0_5.parquet


## Run it again (cache hit)

Same inputs and config produce the same signature, so Consist returns the cached output immediately.


In [45]:
result2 = tracker.run(
    fn=clean_data,
    inputs={"raw_df": raw_path},
    config={"threshold": 0.5},
    outputs=["cleaned"],
)
print(f"Cache hit: {result2.cache_hit}")

Cache hit: True


## Change config (cache miss)

Changing `config` changes the signature. Consist re-executes and logs a new run.
We pass a different output name via `fn_args` to avoid overwriting files; it does not affect caching.


In [46]:
result3 = tracker.run(
    fn=clean_data,
    inputs={"raw_df": raw_path},
    config={"threshold": 0.8},
    fn_args={"out_name": "cleaned_0_8.parquet"},
    outputs=["cleaned"],
)
print(f"Cache hit: {result3.cache_hit}")

Cache hit: False


## Inspect recorded runs

You can query recent runs by model name and inspect their metadata, including cache-hit flags.
To filter runs by parameter values, see [Concepts: config vs facet](../docs/concepts.md#the-config-vs-facet-distinction).


In [47]:
runs = tracker.find_runs(model="clean_data", limit=10)
[(run.id, run.status, run.meta.get("cache_hit", False)) for run in runs]

[('clean_data_e5a7b637', 'completed', False),
 ('clean_data_a031db5e', 'completed', True),
 ('clean_data_4e21ed62', 'completed', False)]

## See the outputs on disk

Consist writes artifacts under the run directory. Listing files makes the abstraction concrete.


In [48]:
print("Output files:")
for p in sorted(Path("runs").glob("*.parquet")):
    print(f"  {p.name}")

print("\nCleaned data (threshold=0.5):")
pd.read_parquet("runs/cleaned_0_5.parquet")

Output files:
  cleaned_0_5.parquet
  cleaned_0_8.parquet

Cleaned data (threshold=0.5):


Unnamed: 0,value
1,0.6
2,0.9
