## Consist Quickstart

This notebook demonstrates Consist's core feature: intelligent caching based on code, configuration, and input data.

By the end, you'll understand:
- How Consist tracks provenance (what code/config/inputs produced each result)
- How to skip redundant computation with caching
- How to query and inspect your runs

## Installation

If Consist is not installed yet, install the package in your environment:

```bash
pip install consist
```

We'll keep everything in the current working directory so you can see the artifacts on disk.

In [10]:
from pathlib import Path

import pandas as pd
import consist
from consist import Tracker, use_tracker

# Clean slate for reproducible notebook output
db_path = Path("./provenance.duckdb")
if db_path.exists():
    db_path.unlink()

# Initialize Tracker:
# - run_dir: directory where outputs are stored
# - db_path: DuckDB file that stores run metadata and provenance
tracker = Tracker(run_dir="./runs", db_path="./provenance.duckdb")


## Create a small input file

We create a tiny CSV file to use as input. When you pass a file path in `inputs={...}`, Consist computes its content hash. This hash becomes part of the cache signature: if the same input file is used again with the same config, Consist recognizes it and returns cached results.

In [11]:
raw_path = Path("raw.csv")
df = pd.DataFrame({"value": [0.1, 0.6, 0.9, 0.2]})
df.to_csv(raw_path, index=False)
raw_path

PosixPath('raw.csv')

## Define a step

Define a function that accepts loaded data and returns a DataFrame.
With `outputs=[...]`, Consist logs the return value as an artifact automatically.

For file-writing tools, you can use the injected run context (`_consist_ctx`) and
log outputs manually; see the Usage Guide for that pattern.


In [12]:
def clean_data(raw_df: pd.DataFrame, threshold: float = 0.5) -> pd.DataFrame:
    cleaned = raw_df[raw_df["value"] > threshold].copy()
    return cleaned


## Run it (cache miss)

Call `consist.run()` inside `use_tracker(...)` to execute the function and record provenance.

The key parameters:
- `fn`: Your function to execute
- `inputs`: Files to hash (auto-loaded into function args)
- `config`: Parameters to hash (become part of the cache signature)
- `outputs`: Names for your function's return values

On this first run, Consist executes the function and records a new run.


In [13]:
with use_tracker(tracker):
    result1 = consist.run(
        fn=clean_data,
        inputs={"raw_df": raw_path},
        config={"threshold": 0.5},
        outputs=["cleaned"],
    )
print(f"Cache hit: {result1.cache_hit}")


Cache hit: False


In [14]:
cleaned_artifact = result1.outputs["cleaned"]
cleaned_df = consist.load_df(cleaned_artifact)
print(f"Artifact: {cleaned_artifact.id}")
print(f"Path: {cleaned_artifact.path}")
cleaned_df.head()


Unnamed: 0,value
0,0.6
1,0.9


## Run it again (cache hit)

Now run with identical inputs and config.

Consist computes a fingerprint (signature) from:
1. Your function's code
2. The `config` dict
3. The hashes of `inputs` files

Since we're using the same code, config, and input file, the signature matches a previous run. Instead of re-executing, Consist returns the cached result instantly.

In [15]:
with use_tracker(tracker):
    result2 = consist.run(
        fn=clean_data,
        inputs={"raw_df": raw_path},
        config={"threshold": 0.5},
        outputs=["cleaned"],
    )
print(f"Cache hit: {result2.cache_hit}")


Cache hit: True


## Change config (cache miss)

Now change `config` to use a different threshold (0.8 instead of 0.5).

This changes the signature, so Consist must re-execute the function and record a new run.


In [16]:
with use_tracker(tracker):
    result3 = consist.run(
        fn=clean_data,
        inputs={"raw_df": raw_path},
        config={"threshold": 0.8},
        outputs=["cleaned"],
    )
print(f"Cache hit: {result3.cache_hit}")


Cache hit: False


## Inspect recorded runs

Query recent runs by model name (derived from function name). Notice the `cache_hit` flag shows which runs were executed vs. returned from cache.

To filter runs by parameter values, see [Query Facets](../docs/usage-guide.md#query-facets-with-pivot_facets).


In [17]:
runs = tracker.find_runs(model="clean_data", limit=10)
[(run.id, run.status, run.meta.get("cache_hit", False)) for run in runs]

[('clean_data_22a1bf70', 'completed', False),
 ('clean_data_dce2bc60', 'completed', True),
 ('clean_data_0ae6da2d', 'completed', False)]

## See the outputs on disk

Consist stores artifacts under the run directory. You can access the files directly, or load them via the Artifact object.
Notice we have two Parquet files because we ran with different config values.


In [18]:
output_root = tracker.run_dir / "outputs"
print("Output files:")
for p in sorted(output_root.rglob("*.parquet")):
    print(f"  {p.relative_to(tracker.run_dir)}")

print("\nCleaned data (threshold=0.5):")
consist.load_df(result1.outputs["cleaned"])


Unnamed: 0,value
0,0.6
1,0.9
