# 01 Reproducible Pipeline Design

Configs, run IDs, dataset hashes, and artifact layout.


## Table of Contents
- [Configs](#configs)
- [Outputs](#outputs)
- [Reproducibility](#reproducibility)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)


## Why This Notebook Matters
Model ops notebooks turn your work into reproducible runs with saved artifacts.
The goal is: someone else can run your pipeline and see the same metrics.


## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can explain what you built and why each step exists.
- You can run your work end-to-end without undefined variables.

## Common Pitfalls
- Running cells top-to-bottom without reading the instructions.
- Leaving `...` placeholders in code cells.
- Not recording which dataset/config a model was trained on.
- Overwriting artifacts without run IDs.

## Matching Guide
- `docs/guides/05_model_ops/01_reproducible_pipeline_design.md`



## How To Use This Notebook
- This notebook is hands-on. Most code cells are incomplete on purpose.
- Complete each TODO, then run the cell.
- Use the matching guide (`docs/guides/05_model_ops/01_reproducible_pipeline_design.md`) for deep explanations and alternative examples.
- Write short interpretation notes as you go (what changed, why it matters).



<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.



In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT



## Goal
Understand how to turn notebooks into reproducible runs with configs + artifacts.

A model is not "done" when you get a good plot.
A model is done when you can re-run it and reproduce:
- the dataset used
- the features used
- the metrics
- the predictions



## Primer: Paths, Files, and Environment Variables

You will see a few patterns repeatedly in this repo.

### Environment variables
> **What this is:** Environment variables are key/value settings provided by your shell to your Python process.

We use them for API keys and configuration defaults.

```python
import os

# Reads an environment variable or returns None
fred_key = os.getenv('FRED_API_KEY')
print('FRED key set?', fred_key is not None)
```

If you're running from a terminal, you can set a key like this:

```bash
export FRED_API_KEY="your_key_here"
```

Then restart the Jupyter kernel (so Python picks up the new env var).

### Paths (why `pathlib.Path`)
> **What this is:** A Path is a safe way to build file paths without worrying about OS-specific separators.

```python
from pathlib import Path

p = Path('data') / 'sample' / 'macro_quarterly_sample.csv'
print(p)
print('exists?', p.exists())
```

In these notebooks, the bootstrap cell defines:
- `PROJECT_ROOT` (repo root)
- `DATA_DIR`, `RAW_DIR`, `PROCESSED_DIR`, `SAMPLE_DIR`

Prefer those over hard-coding paths.

### Reading and writing CSV files
```python
import pandas as pd

# Read
# df = pd.read_csv(p, index_col=0, parse_dates=True)

# Write
# out = Path('data') / 'processed' / 'my_dataset.csv'
# out.parent.mkdir(parents=True, exist_ok=True)
# df.to_csv(out)
```

### Tip
If you get a "file not found" error:
- `print(path)` to confirm you're reading what you think you're reading
- `print(path.exists())` to confirm the file exists
- if you're using a relative path, confirm your current working directory: `import os; print(os.getcwd())`


<a id="configs"></a>
## Configs

### Goal
Inspect a YAML config and understand what it controls.



### Your Turn (1): Load and inspect configs/recession.yaml


In [None]:
import yaml
from pathlib import Path

cfg_path = PROJECT_ROOT / 'configs' / 'recession.yaml'
cfg = yaml.safe_load(cfg_path.read_text())

# TODO: Print top-level keys and explain what each one controls.
cfg.keys()



### Your Turn (2): Find where config values are used


In [None]:
# TODO: Open scripts/train_recession.py and scripts/build_datasets.py.
# Find how 'series', 'feature settings', and 'split rules' are used.
# Write a short list of 'hard-coded' vs 'configurable'.
...



<a id="outputs"></a>
## Outputs

### Goal
Run a pipeline and inspect the artifact bundle under `outputs/<run_id>/`.



### Your Turn: Run the pipeline from your terminal


Run these commands in terminal (from repo root):
- `python scripts/build_datasets.py --recession-config configs/recession.yaml --census-config configs/census.yaml`
- `python scripts/train_recession.py --config configs/recession.yaml`

Then come back here and inspect the generated `outputs/<run_id>/` folder.



### Your Turn (2): Inspect outputs/ in Python


In [None]:
from pathlib import Path

# TODO: List run folders under outputs/
out_dir = PROJECT_ROOT / 'outputs'
runs = sorted([p for p in out_dir.glob('*') if p.is_dir()])
runs[-3:]



<a id="reproducibility"></a>
## Reproducibility

### Goal
Verify that a run is self-describing (you can tell what it did).

Minimum expected artifacts:
- `model.joblib`
- `metrics.json`
- `predictions.csv`



### Your Turn: Check artifact bundle completeness


In [None]:
# TODO: Pick the newest run folder and check expected files exist.
if not runs:
    raise RuntimeError('No runs found. Did you run the training script?')

run = runs[-1]
expected = ['model.joblib', 'metrics.json', 'predictions.csv']
for name in expected:
    print(name, (run / name).exists())
...



<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2-3 sentences summarizing what you verified.



In [None]:
# TODO: Run one script end-to-end and confirm an artifact bundle exists.
# Example:
# - list outputs/ and pick the newest run_id
# - assert model.joblib and metrics.json exist
...



## Extensions (Optional)
- Try one additional variant beyond the main path (different features, different split, different model).
- Write down what improved, what got worse, and your hypothesis for why.



## Reflection
- What did you assume implicitly (about timing, availability, stationarity, or costs)?
- If you had to ship this model, what would you monitor?



<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Configs</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_reproducible_pipeline_design — Configs
# Open configs/recession.yaml and configs/census.yaml and explain each field.
```

</details>

<details><summary>Solution: Outputs</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_reproducible_pipeline_design — Outputs
# Run:
#   python scripts/build_datasets.py --recession-config configs/recession.yaml --census-config configs/census.yaml
#   python scripts/train_recession.py --config configs/recession.yaml
# Then inspect outputs/<run_id>/
```

</details>

<details><summary>Solution: Reproducibility</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_reproducible_pipeline_design — Reproducibility
# Confirm run_metadata.json includes dataset hash and feature list.
```

</details>

