# FOMC Data Pipeline - Productivity Analysis

This notebook is an end-to-end walkthrough of the Rearc Data Quest (Parts 1-3) using LocalStack:

- Part 1: Republish and keep the BLS `pr` dataset in sync in S3
- Part 2: Fetch the DataUSA population API and store the JSON payload in S3
- Part 3: Load `pr.data.0.Current` (TSV) + `population.json` into pandas DataFrames and produce:
  1. Mean + standard deviation of annual US population (2013-2018)
  2. Best year per `series_id` (max sum of quarterly values)
  3. Join for `series_id=PRS30006032`, `period=Q01` with population by year

## Prerequisites

Before running this notebook, ensure LocalStack is running:

```bash
# Start LocalStack + pre-create buckets/queues
.venv/bin/python tools/localstack_up.py

# Optional: run the full local pipeline refresh
.venv/bin/python tools/localstack_full_refresh.py
```

The first cell below loads environment variables from `.env.shared` + `.env.localstack`.


In [46]:
# Cell 0: Load LocalStack environment + project paths
# This must run BEFORE importing from src.config

import os
import sys
from pathlib import Path

def find_repo_root(start: Path) -> Path:
    for p in [start] + list(start.parents):
        if (p / "pyproject.toml").exists() and (p / "src").exists():
            return p
    raise RuntimeError(f"Could not find repo root starting from: {start}")

REPO_ROOT = find_repo_root(Path.cwd())

# Make local modules importable (tools/env_loader.py, src/*)
sys.path.insert(0, str(REPO_ROOT / "tools"))
sys.path.insert(0, str(REPO_ROOT))

from env_loader import load_localstack_env  # noqa: E402

load_localstack_env()

print(f"Repo root: {REPO_ROOT}")
print(f"AWS_ENDPOINT_URL: {os.environ.get('AWS_ENDPOINT_URL', 'not set (using AWS)')}")
print(f"FOMC_BUCKET_PREFIX: {os.environ.get('FOMC_BUCKET_PREFIX', 'not set')}")


Repo root: /Users/ryan/Developer/fomc-agent
AWS_ENDPOINT_URL: http://localhost:4566
FOMC_BUCKET_PREFIX: fomc


## Part 1 + Part 2: Sync Raw Data into S3 (LocalStack)

This cell runs the same code path as the scheduled Lambda in the AWS/CDK pipeline:

- Part 1: sync BLS `pr` files into S3 (keeps in sync with adds/updates/deletes)
- Part 2: fetch the DataUSA population API and store the JSON payload in S3


In [47]:
import os
import json
import time
import urllib.request

endpoint = os.environ.get('AWS_ENDPOINT_URL', 'http://localhost:4566').rstrip('/')
health_url = f"{endpoint}/_localstack/health"

try:
    with urllib.request.urlopen(health_url, timeout=5) as resp:  # nosec - local URL
        health = json.loads(resp.read().decode("utf-8", errors="replace"))
    print("LocalStack is reachable:", health.get("version", "unknown"))
except Exception as exc:
    raise RuntimeError(
        f"LocalStack does not look reachable at {health_url}. "
        "Start it with `.venv/bin/python tools/localstack_up.py` and re-run this cell."
    ) from exc

from src.lambdas.data_fetcher.handler import handler as fetcher_handler

started = time.time()
response = fetcher_handler({}, None)
duration_seconds = time.time() - started

body = response.get("body")
if isinstance(body, str):
    try:
        body = json.loads(body)
    except Exception:
        pass

print(json.dumps({"duration_seconds": round(duration_seconds, 2), "response": {**response, "body": body}}, indent=2, default=str))


{
  "duration_seconds": 9.91,
  "response": {
    "statusCode": 200,
    "body": {
      "bls": {
        "pr": {
          "updated": [],
          "added": [],
          "unchanged": [
            "pr.data.0.Current"
          ],
          "deleted": []
        },
        "cu": {
          "updated": [],
          "added": [],
          "unchanged": [
            "cu.data.0.Current"
          ],
          "deleted": []
        },
        "jt": {
          "updated": [],
          "added": [],
          "unchanged": [
            "jt.data.0.Current"
          ],
          "deleted": []
        },
        "ci": {
          "updated": [],
          "added": [],
          "unchanged": [
            "ci.data.0.Current"
          ],
          "deleted": []
        }
      },
      "datausa": {
        "datasets": {
          "population": {
            "action": "skipped",
            "dataset_id": "population",
            "key": "population.json"
          },
          "commute_time": {


In [48]:
import pandas as pd

from src.config import bls_data_key, get_bls_bucket, get_datausa_bucket, get_datausa_key

BLS_BUCKET = get_bls_bucket()
BLS_KEY = bls_data_key("pr", "pr.data.0.Current")  # Quest Part 1
DATAUSA_BUCKET = get_datausa_bucket()
POP_KEY = get_datausa_key()  # Quest Part 2

print(f"BLS raw:     s3://{BLS_BUCKET}/{BLS_KEY}")
print(f"DataUSA raw: s3://{DATAUSA_BUCKET}/{POP_KEY}")


BLS raw:     s3://fomc-bls-raw/pr/pr.data.0.Current
DataUSA raw: s3://fomc-datausa-raw/population.json


## Load Data from S3

In [49]:
from IPython.display import display
from src.analytics.reports import load_population_from_s3, load_bls_from_s3

pop_df = load_population_from_s3(bucket=DATAUSA_BUCKET, key=POP_KEY)
bls_df = load_bls_from_s3(bucket=BLS_BUCKET, key=BLS_KEY)

print(f'Population rows: {len(pop_df)}')
print(f'BLS rows: {len(bls_df)}')

display(pop_df.head())
display(bls_df.head())


Unnamed: 0,series_id,year,period,value,footnote_codes
0,PRS30006011,1995,Q01,2.6,
1,PRS30006011,1995,Q02,2.1,
2,PRS30006011,1995,Q03,0.9,
3,PRS30006011,1995,Q04,0.1,
4,PRS30006011,1995,Q05,1.4,


## Report 1: Population Statistics (2013-2018)

Calculate mean and standard deviation of annual US population for years 2013-2018 inclusive.

In [50]:
from src.analytics.reports import report_population_stats

pop_stats = report_population_stats(pop_df)
print(f"Population Statistics (2013-2018):")
print(f"  Mean:   {pop_stats['mean']:,.0f}")
print(f"  StdDev: {pop_stats['stddev']:,.0f}")

Population Statistics (2013-2018):
  Mean:   322,069,808
  StdDev: 4,158,441


## Report 2: Best Year by Series ID

For every series_id, find the year with the largest sum of quarterly values.

In [51]:
from IPython.display import display
from src.analytics.reports import report_best_year_by_series

best_year_rows = report_best_year_by_series(bls_df)
best_year_df = pd.DataFrame(best_year_rows).sort_values('series_id').reset_index(drop=True)
print(f'Series with best years: {len(best_year_df)}')
display(best_year_df.head(32))

Unnamed: 0,series_id,year,value
0,PRS30006011,2022,20.5
1,PRS30006012,2022,17.1
2,PRS30006013,1998,705.9
3,PRS30006021,2010,17.7
4,PRS30006022,2010,12.4
5,PRS30006023,2014,503.2
6,PRS30006031,2022,20.5
7,PRS30006032,2021,17.1
8,PRS30006033,1998,702.7
9,PRS30006061,2022,34.5


## Report 3: Series + Population Join

Join BLS data (series_id=PRS30006032, period=Q01) with population data by year.

In [52]:
from IPython.display import display
from src.analytics.reports import report_series_population_join

join_rows = report_series_population_join(bls_df, pop_df)
join_df = pd.DataFrame(join_rows).sort_values('year').reset_index(drop=True)
display(join_df)

Unnamed: 0,series_id,year,period,value,Population
0,PRS30006032,1995,Q01,0.0,
1,PRS30006032,1996,Q01,-4.2,
2,PRS30006032,1997,Q01,2.8,
3,PRS30006032,1998,Q01,0.9,
4,PRS30006032,1999,Q01,-4.1,
5,PRS30006032,2000,Q01,0.5,
6,PRS30006032,2001,Q01,-6.3,
7,PRS30006032,2002,Q01,-6.6,
8,PRS30006032,2003,Q01,-5.7,
9,PRS30006032,2004,Q01,2.0,


---

## Data Quality & Sync History

Load sync logs from S3 and visualize data pipeline health.

In [53]:
import json
from datetime import datetime

from botocore.exceptions import ClientError

from src.config import get_bls_bucket, get_datausa_bucket, get_datausa_datasets
from src.helpers.aws_client import get_client

s3 = get_client("s3")

def _is_missing_key(exc: Exception) -> bool:
    if not isinstance(exc, ClientError):
        return False
    code = str(exc.response.get("Error", {}).get("Code", "")).strip()
    return code in {"404", "NoSuchKey", "NotFound"}

def load_jsonl(bucket: str, key: str) -> list[dict]:
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
    except Exception as exc:
        if _is_missing_key(exc):
            return []
        raise
    lines = response["Body"].read().decode("utf-8", errors="replace").splitlines()
    out: list[dict] = []
    for line in lines:
        line = line.strip()
        if not line:
            continue
        try:
            out.append(json.loads(line))
        except json.JSONDecodeError:
            continue
    return out

# BLS sync logs are stored per-series
bls_series_id = "pr"
bls_log = load_jsonl(get_bls_bucket(), f"_sync_state/{bls_series_id}/sync_log.jsonl")

# DataUSA sync logs are stored per-dataset
datausa_bucket = get_datausa_bucket()
datausa_datasets = get_datausa_datasets(default="population")
datausa_logs_by_dataset = {
    ds: load_jsonl(datausa_bucket, f"_sync_state/datausa/{ds}/sync_log.jsonl")
    for ds in datausa_datasets
}

datausa_log: list[dict] = []
for ds, logs in datausa_logs_by_dataset.items():
    for entry in logs:
        e = dict(entry)
        e.setdefault("dataset_id", ds)
        datausa_log.append(e)

print(f'BLS sync log entries (pr): {len(bls_log)}')
print(f'DataUSA sync log entries (all datasets): {len(datausa_log)}')


BLS sync log entries (pr): 10
DataUSA sync log entries (all datasets): 30


In [54]:
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from collections import Counter, defaultdict

# Chart 1: BLS/pr Sync Actions
# Chart 2: Data Freshness (BLS + each DataUSA dataset)
# Chart 3: File Size Changes (BLS)
# Chart 4: Update Frequency (BLS)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Chart 1
if bls_log:
    actions = Counter(e.get("action", "unknown") for e in bls_log)
    ax = axes[0, 0]
    ax.bar(actions.keys(), actions.values(), color=["#2ecc71", "#3498db", "#e74c3c", "#95a5a6"])
    ax.set_title("Chart 1: BLS/pr Sync Actions")
    ax.set_ylabel("Count")
else:
    axes[0, 0].text(0.5, 0.5, "No BLS sync data", ha="center", va="center")
    axes[0, 0].set_title("Chart 1: BLS/pr Sync Actions")

def _parse_ts(ts: str):
    try:
        from datetime import datetime
        return datetime.fromisoformat(ts.replace("Z", "+00:00")).replace(tzinfo=None)
    except Exception:
        return None

def _last_ts(logs: list[dict]):
    dts = []
    for e in logs:
        ts = e.get("timestamp")
        if not ts:
            continue
        dt = _parse_ts(ts)
        if dt is not None:
            dts.append(dt)
    return max(dts) if dts else None

# Chart 2
ax = axes[0, 1]
sources = []
days_since = []
from datetime import datetime
now = datetime.now()
pairs = [(bls_log, "BLS/pr")] + [(logs, f"DataUSA/{ds}") for ds, logs in sorted(datausa_logs_by_dataset.items())]
for logs, name in pairs:
    last = _last_ts(logs)
    if last is None:
        continue
    sources.append(name)
    days_since.append((now - last).days)

if sources:
    colors = ["#2ecc71" if d <= 7 else "#f39c12" if d <= 30 else "#e74c3c" for d in days_since]
    ax.barh(sources, days_since, color=colors)
    ax.set_xlabel("Days Since Last Sync")
    ax.set_title("Chart 2: Data Freshness")
else:
    ax.text(0.5, 0.5, "No sync data", ha="center", va="center")
    ax.set_title("Chart 2: Data Freshness")

# Chart 3: File Size Changes (BLS)
ax = axes[1, 0]
if bls_log:
    file_sizes = defaultdict(list)
    for entry in bls_log:
        if "bytes" in entry and "file" in entry:
            file_sizes[entry["file"]].append(entry["bytes"])
    if file_sizes:
        for fname, sizes in list(file_sizes.items())[:5]:
            ax.plot(range(len(sizes)), sizes, marker="o", label=fname[:20])
        ax.set_title("Chart 3: File Size Changes (BLS)")
        ax.set_ylabel("Bytes")
        ax.legend(fontsize=7)
    else:
        ax.text(0.5, 0.5, "No file size data", ha="center", va="center")
        ax.set_title("Chart 3: File Size Changes")
else:
    ax.text(0.5, 0.5, "No BLS sync data", ha="center", va="center")
    ax.set_title("Chart 3: File Size Changes")

# Chart 4: Update Frequency (BLS)
ax = axes[1, 1]
if bls_log:
    hours = []
    for entry in bls_log:
        ts = entry.get("timestamp", "")
        if not ts or entry.get("action") not in ("updated", "added"):
            continue
        dt = _parse_ts(ts)
        if dt is not None:
            hours.append(dt.hour)
    if hours:
        ax.hist(hours, bins=24, range=(0, 24), color="#3498db", edgecolor="white")
        ax.set_xlabel("Hour of Day (UTC)")
        ax.set_ylabel("Updates")
        ax.set_title("Chart 4: Update Frequency by Hour")
    else:
        ax.text(0.5, 0.5, "No update data", ha="center", va="center")
        ax.set_title("Chart 4: Update Frequency")
else:
    ax.text(0.5, 0.5, "No sync data", ha="center", va="center")
    ax.set_title("Chart 4: Update Frequency")

plt.tight_layout()
out_path = REPO_ROOT / "notebooks" / "sync_history.png"
out_path.parent.mkdir(parents=True, exist_ok=True)
plt.savefig(out_path, dpi=100, bbox_inches="tight")
plt.show()
print(f"Charts saved to {out_path}")


  plt.show()


In [55]:
# Summary Table: Current state of all data sources
def load_state_json(bucket: str, key: str) -> dict:
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        return json.loads(response["Body"].read())
    except Exception:
        return {}

from src.config import get_bls_bucket, get_datausa_bucket, get_datausa_datasets

bls_state = load_state_json(get_bls_bucket(), "_sync_state/pr/latest_state.json")
datausa_bucket = get_datausa_bucket()
datausa_states = {
    ds: load_state_json(datausa_bucket, f"_sync_state/datausa/{ds}/latest_state.jsonl")
    for ds in get_datausa_datasets(default="population")
}

print("{:<25} {:<22} {:<10} {}".format("Source", "Last Sync", "Items", "Status"))
print("-" * 72)

bls_sync = bls_state.get("last_sync", "N/A")
bls_files = len(bls_state.get("files", {}) or {})
print("{:<25} {:<22} {:<10} {}".format("BLS/pr", str(bls_sync)[:19], bls_files, "Current" if bls_files > 0 else "No data"))

for ds, state in sorted(datausa_states.items()):
    last_sync = state.get("last_sync", "N/A")
    record_count = state.get("record_count")
    items = record_count if isinstance(record_count, int) else "N/A"
    status = "Current" if state.get("content_hash") else "No data"
    print("{:<25} {:<22} {:<10} {}".format("DataUSA/" + ds, str(last_sync)[:19], str(items), status))


Source                    Last Sync              Items      Status
------------------------------------------------------------------------
BLS/pr                    2026-02-09T04:09:41    1          Current
DataUSA/citizenship       2026-02-09T03:32:19    20         Current
DataUSA/commute_time      2026-02-09T03:32:14    48334      Current
DataUSA/population        2026-02-09T03:32:13    10         Current


In [56]:
print('Analysis complete.')

Analysis complete.
