# The Calibration Matrix

The calibration pipeline has three stages: (1) compute uprated target values ([`hierarchical_uprating.ipynb`](hierarchical_uprating.ipynb)), (2) assemble the sparse constraint matrix (this notebook), and (3) optimize weights ([`fit_calibration_weights.py`](../policyengine_us_data/datasets/cps/local_area_calibration/fit_calibration_weights.py)). This notebook is the diagnostic checkpoint between stages 1 and 2 — understand your matrix before you optimize.

We build the full calibration matrix using `SparseMatrixBuilder`, then use `MatrixTracer` to inspect its structure: what rows and columns represent, how target groups partition the loss function, and where sparsity patterns emerge.

**Requirements:** `policy_data.db` and the stratified CPS h5 file in `STORAGE_FOLDER`.

## 1. Setup

In [1]:
import numpy as np
import pandas as pd
from policyengine_us import Microsimulation
from policyengine_us_data.storage import STORAGE_FOLDER
from policyengine_us_data.datasets.cps.local_area_calibration.sparse_matrix_builder import (
    SparseMatrixBuilder,
)
from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
    get_all_cds_from_database,
    create_target_groups,
    STATE_CODES,
)
from policyengine_us_data.datasets.cps.local_area_calibration.matrix_tracer import (
    MatrixTracer,
)

db_path = STORAGE_FOLDER / "calibration" / "policy_data.db"
db_uri = f"sqlite:///{db_path}"
dataset_path = STORAGE_FOLDER / "stratified_extended_cps_2024.h5"

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
sim = Microsimulation(dataset=str(dataset_path))
cds_to_calibrate = get_all_cds_from_database(db_uri)

builder = SparseMatrixBuilder(
    db_uri=db_uri,
    time_period=2024,
    cds_to_calibrate=cds_to_calibrate,
    dataset_path=str(dataset_path),
)

targets_df, X_sparse, household_id_mapping = builder.build_matrix(
    sim,
    target_filter={"domain_variables": ["aca_ptc", "snap"]},
    hierarchical_domains=["aca_ptc", "snap"],
)

tracer = MatrixTracer(
    targets_df, X_sparse, household_id_mapping, cds_to_calibrate, sim
)

print(f"Matrix shape: {X_sparse.shape}")
print(f"Non-zero entries: {X_sparse.nnz:,}")

Matrix shape: (1411, 5231564)
Non-zero entries: 2,199,033


## 2. Matrix overview

In [3]:
tracer.print_matrix_structure()


MATRIX STRUCTURE BREAKDOWN

Matrix dimensions: 1411 rows x 5231564 columns
  Rows = 1411 targets
  Columns = 11999 households x 436 CDs
           = 11,999 x 436 = 5,231,564

--------------------------------------------------------------------------------
COLUMN STRUCTURE (Households stacked by CD)
--------------------------------------------------------------------------------

Showing first and last 5 CDs of 436 total:

First 5 CDs:
cd_geoid  start_col  end_col  n_households
    1001          0    11998         11999
     101      11999    23997         11999
     102      23998    35996         11999
     103      35997    47995         11999
     104      47996    59994         11999

Last 5 CDs:
cd_geoid  start_col  end_col  n_households
     901    5171569  5183567         11999
     902    5183568  5195566         11999
     903    5195567  5207565         11999
     904    5207566  5219564         11999
     905    5219565  5231563         11999

------------------------------

## 3. Anatomy of a row

Each row is one calibration target — a known aggregate (dollar total, household count, person count) that the optimizer tries to match. The row vector's non-zero entries identify which (household, CD) pairs can contribute to that target.

In [4]:
mid_row = X_sparse.shape[0] // 2
row_info = tracer.get_row_info(mid_row)
print(f"Row {mid_row}:")
for k, v in row_info.items():
    print(f"  {k}: {v}")

Row 705:
  row_index: 705
  variable: household_count
  variable_desc: Households represented
  geographic_id: 3402
  target_value: 48652.0536866581
  stratum_id: 9625
  domain_variable: snap


In [5]:
row_vec = X_sparse[mid_row, :]
nz_cols = row_vec.nonzero()[1]
print(f"Row {mid_row} has {len(nz_cols):,} non-zero columns")

if len(nz_cols) > 0:
    first_col = tracer.get_column_info(nz_cols[0])
    last_col = tracer.get_column_info(nz_cols[-1])
    print(f"\nFirst non-zero column ({nz_cols[0]}):")
    for k, v in first_col.items():
        print(f"  {k}: {v}")
    print(f"\nLast non-zero column ({nz_cols[-1]}):")
    for k, v in last_col.items():
        print(f"  {k}: {v}")

    unique_cds = set(
        tracer.get_column_info(c)["cd_geoid"] for c in nz_cols
    )
    print(f"\nSpans {len(unique_cds)} CD(s)")

Row 705 has 1,841 non-zero columns

First non-zero column (1991877):
  column_index: 1991877
  cd_geoid: 3402
  household_id: 952
  household_index: 43

Last non-zero column (2003831):
  column_index: 2003831
  cd_geoid: 3402
  household_id: 177860
  household_index: 11997

Spans 1 CD(s)


## 4. Anatomy of a column

Each column represents one (household, CD) pair. The columns are organized in blocks: the first `n_households` columns belong to CD 1, the next to CD 2, and so on. The block formula is:

$$\text{column\_idx} = \text{cd\_block} \times n_{\text{households}} + \text{hh\_index}$$

In [6]:
col_idx = tracer.n_households * 5 + 42
col_info = tracer.get_column_info(col_idx)
print(f"Column {col_idx}:")
for k, v in col_info.items():
    print(f"  {k}: {v}")

col_vec = X_sparse[:, col_idx]
nz_rows = col_vec.nonzero()[0]
print(f"\nThis column has non-zero values in {len(nz_rows)} target rows")
if len(nz_rows) > 0:
    print("First 5 target rows:")
    for r in nz_rows[:5]:
        ri = tracer.get_row_info(r)
        print(
            f"  row {r}: {ri['variable']} "
            f"(geo={ri['geographic_id']}, "
            f"val={X_sparse[r, col_idx]:.2f})"
        )

Column 60037:
  column_index: 60037
  cd_geoid: 105
  household_id: 946
  household_index: 42

This column has non-zero values in 0 target rows


In [7]:
expected_col = 5 * tracer.n_households + 42
assert col_idx == expected_col, f"{col_idx} != {expected_col}"
print(
    f"Block formula verified: "
    f"cd_block=5 * n_hh={tracer.n_households} + hh_idx=42 = {expected_col}"
)

Block formula verified: cd_block=5 * n_hh=11999 + hh_idx=42 = 60037


## 5. Target groups and loss weighting

Target groups partition the rows by (domain, variable, geographic level). Each group contributes equally to the loss function, so 436 district-level rows don't drown out 1 national row. The group IDs depend on the current database contents and may change if targets are added or removed.

In [8]:
target_groups, group_info = create_target_groups(targets_df)

records = []
for gid, info in enumerate(group_info):
    mask = target_groups == gid
    vals = targets_df.loc[mask, "value"]
    records.append({
        "group_id": gid,
        "description": info,
        "n_targets": mask.sum(),
        "min_value": vals.min(),
        "median_value": vals.median(),
        "max_value": vals.max(),
    })

group_df = pd.DataFrame(records)
print(group_df.to_string(index=False))


=== Creating Target Groups ===

National targets:
  Group 0: ACA PTC Person Count = 19,743,689

State targets:
  Group 1: SNAP Household Count (51 targets)
  Group 2: Snap (51 targets)

District targets:
  Group 3: Aca Ptc (436 targets)
  Group 4: ACA PTC Tax Unit Count (436 targets)
  Group 5: SNAP Household Count (436 targets)

Total groups created: 6
 group_id                                                         description  n_targets    min_value  median_value    max_value
        0 Group 0: National ACA PTC Person Count (1 target, value=19,743,689)          1 1.974369e+07  1.974369e+07 1.974369e+07
        1                    Group 1: State SNAP Household Count (51 targets)         51 1.369100e+04  2.772372e+05 3.128640e+06
        2                                    Group 2: State Snap (51 targets)         51 5.670186e+07  1.293585e+09 1.237718e+10
        3                             Group 3: District Aca Ptc (436 targets)        436 5.420354e+06  2.937431e+07 3.880971e+0

In [9]:
for gid in [0, 2, 4]:
    if gid >= len(group_info):
        continue
    rows = tracer.get_group_rows(gid)
    print(f"\n--- {group_info[gid]} ---")
    print(rows.head(8).to_string(index=False))


--- Group 0: National ACA PTC Person Count (1 target, value=19,743,689) ---
 row_index     variable      variable_desc geographic_id  target_value  stratum_id domain_variable
         0 person_count People represented            US    19743689.0         491         aca_ptc

--- Group 2: State Snap (51 targets) ---
 row_index variable  variable_desc geographic_id  target_value  stratum_id domain_variable
        52     snap SNAP allotment             1  1733693703.0        9330            snap
        53     snap SNAP allotment            10   254854243.0        9337            snap
        54     snap SNAP allotment            11   319119173.0        9338            snap
        55     snap SNAP allotment            12  6604797454.0        9339            snap
        56     snap SNAP allotment            13  3281329856.0        9340            snap
        57     snap SNAP allotment            15   731331421.0        9341            snap
        58     snap SNAP allotment            

## 6. Tracing a household

One CPS household appears in every CD block (once per CD = 436 column positions). But most of those columns are zero — the household only contributes where its characteristics match the target constraints.

In [10]:
snap_values = sim.calculate("snap", map_to="household").values
hh_ids = sim.calculate("household_id", map_to="household").values
positive_snap = hh_ids[snap_values > 0]
example_hh = int(positive_snap[0])
print(f"Example SNAP-receiving household: {example_hh}")
print(f"SNAP value: ${snap_values[hh_ids == example_hh][0]:,.0f}")

positions = tracer.get_household_column_positions(example_hh)
print(f"Column positions across CDs: {len(positions)}")

Example SNAP-receiving household: 654
SNAP value: $70
Column positions across CDs: 436


In [11]:
cd_activity = []
for cd_geoid, col_pos in positions.items():
    col_vec = X_sparse[:, col_pos]
    nnz = col_vec.nnz
    cd_activity.append({"cd_geoid": cd_geoid, "col_pos": col_pos, "nnz": nnz})

cd_df = pd.DataFrame(cd_activity)
n_active = (cd_df["nnz"] > 0).sum()
n_zero = (cd_df["nnz"] == 0).sum()
print(f"CDs with non-zero entries: {n_active}")
print(f"CDs with all-zero columns: {n_zero}")

top10 = cd_df.nlargest(10, "nnz")
print(f"\nTop 10 CDs by activity for household {example_hh}:")
for _, r in top10.iterrows():
    state_fips = int(r["cd_geoid"]) // 100
    abbr = STATE_CODES.get(state_fips, "??")
    print(f"  CD {r['cd_geoid']} ({abbr}): {r['nnz']} non-zero rows")

CDs with non-zero entries: 160
CDs with all-zero columns: 276

Top 10 CDs by activity for household 654:
  CD 1001 (DE): 3 non-zero rows
  CD 1101 (DC): 3 non-zero rows
  CD 1201 (FL): 3 non-zero rows
  CD 1202 (FL): 3 non-zero rows
  CD 1203 (FL): 3 non-zero rows
  CD 1204 (FL): 3 non-zero rows
  CD 1205 (FL): 3 non-zero rows
  CD 1206 (FL): 3 non-zero rows
  CD 1207 (FL): 3 non-zero rows
  CD 1208 (FL): 3 non-zero rows


## 7. Sparsity analysis

In [12]:
total_cells = X_sparse.shape[0] * X_sparse.shape[1]
density = X_sparse.nnz / total_cells
print(f"Total cells: {total_cells:,}")
print(f"Non-zero entries: {X_sparse.nnz:,}")
print(f"Density: {density:.6f}")
print(f"Sparsity: {1 - density:.4%}")

Total cells: 7,381,736,804
Non-zero entries: 2,199,033
Density: 0.000298
Sparsity: 99.9702%


In [13]:
nnz_per_row = np.diff(X_sparse.indptr)
print(f"Non-zeros per row:")
print(f"  min:    {nnz_per_row.min():,}")
print(f"  median: {int(np.median(nnz_per_row)):,}")
print(f"  mean:   {nnz_per_row.mean():,.0f}")
print(f"  max:    {nnz_per_row.max():,}")

from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
    _get_geo_level,
)

geo_levels = targets_df["geographic_id"].apply(_get_geo_level)
level_names = {0: "National", 1: "State", 2: "District"}
print("\nBy geographic level:")
for level in [0, 1, 2]:
    mask = (geo_levels == level).values
    if mask.any():
        vals = nnz_per_row[mask]
        print(
            f"  {level_names[level]:10s}: "
            f"n={mask.sum():>4d}, "
            f"median nnz={int(np.median(vals)):>7,}, "
            f"range=[{vals.min():,}, {vals.max():,}]"
        )

Non-zeros per row:
  min:    0
  median: 0
  mean:   1,558
  max:    77,116

By geographic level:
  National  : n=   1, median nnz=      0, range=[0, 0]
  State     : n= 102, median nnz= 10,423, range=[1,468, 77,116]
  District  : n=1308, median nnz=      0, range=[0, 1,988]


In [14]:
n_hh = tracer.n_households
n_cds = tracer.n_geographies
cd_nnz = []
for cd_idx in range(n_cds):
    block = X_sparse[:, cd_idx * n_hh : (cd_idx + 1) * n_hh]
    cd_nnz.append({
        "cd_geoid": cds_to_calibrate[cd_idx],
        "nnz": block.nnz,
    })

cd_nnz_df = pd.DataFrame(cd_nnz)
print(f"Non-zeros per CD block:")
print(f"  min:    {cd_nnz_df['nnz'].min():,} (CD {cd_nnz_df.loc[cd_nnz_df['nnz'].idxmin(), 'cd_geoid']})")
print(f"  median: {int(cd_nnz_df['nnz'].median()):,}")
print(f"  max:    {cd_nnz_df['nnz'].max():,} (CD {cd_nnz_df.loc[cd_nnz_df['nnz'].idxmax(), 'cd_geoid']})")

Non-zeros per CD block:
  min:    4,326 (CD 2801)
  median: 4,884
  max:    5,964 (CD 1101)


## 8. Group exclusion

`GROUPS_TO_EXCLUDE` removes redundant or harmful constraints before training. For example, state-level SNAP household counts (Group 1) are redundant with reconciled district rows (Group 4) and can confuse the optimizer. Group IDs depend on database contents, so always check `print_matrix_structure()` output first.

In [15]:
GROUPS_TO_EXCLUDE = [1]

print(f"Before exclusion: {X_sparse.shape[0]} rows")

keep_mask = ~np.isin(tracer.target_groups, GROUPS_TO_EXCLUDE)
n_dropped = (~keep_mask).sum()
print(f"Excluding groups {GROUPS_TO_EXCLUDE}: dropping {n_dropped} rows")

X_filtered = X_sparse[keep_mask, :]
targets_filtered = targets_df[keep_mask].reset_index(drop=True)
print(f"After exclusion: {X_filtered.shape[0]} rows")

Before exclusion: 1411 rows
Excluding groups [1]: dropping 51 rows
After exclusion: 1360 rows


In [16]:
remaining_groups, remaining_info = create_target_groups(targets_filtered)
print(f"\nRemaining groups ({len(remaining_info)}):")
for info in remaining_info:
    print(f"  {info}")


=== Creating Target Groups ===

National targets:
  Group 0: ACA PTC Person Count = 19,743,689

State targets:
  Group 1: Snap (51 targets)

District targets:
  Group 2: Aca Ptc (436 targets)
  Group 3: ACA PTC Tax Unit Count (436 targets)
  Group 4: SNAP Household Count (436 targets)

Total groups created: 5

Remaining groups (5):
  Group 0: National ACA PTC Person Count (1 target, value=19,743,689)
  Group 1: State Snap (51 targets)
  Group 2: District Aca Ptc (436 targets)
  Group 3: District ACA PTC Tax Unit Count (436 targets)
  Group 4: District SNAP Household Count (436 targets)


## 9. Achievable targets

A target is achievable if at least one household can contribute to it (row sum > 0). Rows with sum = 0 are impossible constraints that the optimizer cannot satisfy.

In [17]:
row_sums = np.array(X_filtered.sum(axis=1)).flatten()
achievable_mask = row_sums > 0
n_achievable = achievable_mask.sum()
n_impossible = (~achievable_mask).sum()

print(f"Achievable targets: {n_achievable}")
print(f"Impossible targets: {n_impossible}")

if n_impossible > 0:
    impossible = targets_filtered[~achievable_mask]
    print("\nImpossible targets:")
    for _, r in impossible.iterrows():
        print(
            f"  {r.get('domain_variable', '?')}/{r['variable']} "
            f"(geo={r['geographic_id']})"
        )

Achievable targets: 487
Impossible targets: 873

Impossible targets:
  aca_ptc/person_count (geo=US)
  aca_ptc/aca_ptc (geo=1001)
  aca_ptc/aca_ptc (geo=101)
  aca_ptc/aca_ptc (geo=102)
  aca_ptc/aca_ptc (geo=103)
  aca_ptc/aca_ptc (geo=104)
  aca_ptc/aca_ptc (geo=105)
  aca_ptc/aca_ptc (geo=106)
  aca_ptc/aca_ptc (geo=107)
  aca_ptc/aca_ptc (geo=1101)
  aca_ptc/aca_ptc (geo=1201)
  aca_ptc/aca_ptc (geo=1202)
  aca_ptc/aca_ptc (geo=1203)
  aca_ptc/aca_ptc (geo=1204)
  aca_ptc/aca_ptc (geo=1205)
  aca_ptc/aca_ptc (geo=1206)
  aca_ptc/aca_ptc (geo=1207)
  aca_ptc/aca_ptc (geo=1208)
  aca_ptc/aca_ptc (geo=1209)
  aca_ptc/aca_ptc (geo=1210)
  aca_ptc/aca_ptc (geo=1211)
  aca_ptc/aca_ptc (geo=1212)
  aca_ptc/aca_ptc (geo=1213)
  aca_ptc/aca_ptc (geo=1214)
  aca_ptc/aca_ptc (geo=1215)
  aca_ptc/aca_ptc (geo=1216)
  aca_ptc/aca_ptc (geo=1217)
  aca_ptc/aca_ptc (geo=1218)
  aca_ptc/aca_ptc (geo=1219)
  aca_ptc/aca_ptc (geo=1220)
  aca_ptc/aca_ptc (geo=1221)
  aca_ptc/aca_ptc (geo=1222)
  aca_p

In [18]:
ratios = row_sums[achievable_mask] / targets_filtered.loc[achievable_mask, "value"].values
ratio_df = targets_filtered[achievable_mask].copy()
ratio_df["row_sum"] = row_sums[achievable_mask]
ratio_df["ratio"] = ratios

hardest = ratio_df.nsmallest(5, "ratio")
print("Hardest targets (lowest row_sum / target_value ratio):")
for _, r in hardest.iterrows():
    print(
        f"  {r.get('domain_variable', '?')}/{r['variable']} "
        f"(geo={r['geographic_id']}): "
        f"ratio={r['ratio']:.4f}, "
        f"row_sum={r['row_sum']:,.0f}, "
        f"target={r['value']:,.0f}"
    )

Hardest targets (lowest row_sum / target_value ratio):
  snap/household_count (geo=3615): ratio=0.0088, row_sum=1,535, target=173,591
  snap/household_count (geo=3613): ratio=0.0110, row_sum=1,535, target=139,162
  snap/household_count (geo=621): ratio=0.0124, row_sum=1,483, target=119,148
  snap/household_count (geo=3608): ratio=0.0129, row_sum=1,535, target=118,977
  snap/household_count (geo=634): ratio=0.0130, row_sum=1,483, target=113,916


In [19]:
X_final = X_filtered[achievable_mask, :]
print(f"Final matrix shape: {X_final.shape}")
print(f"Final non-zero entries: {X_final.nnz:,}")
print(f"Final density: {X_final.nnz / (X_final.shape[0] * X_final.shape[1]):.6f}")
print("\nThis is what the optimizer receives.")

Final matrix shape: (487, 5231564)
Final non-zero entries: 1,466,022
Final density: 0.000575

This is what the optimizer receives.


## Summary

The calibration matrix pipeline has five steps:

1. **Build** — `SparseMatrixBuilder.build_matrix()` queries targets, applies hierarchical uprating, evaluates constraints, and assembles the sparse CSR matrix.
2. **Read** — `MatrixTracer` decodes rows (targets) and columns (household-CD pairs) so you can verify the matrix makes sense.
3. **Groups** — `create_target_groups()` partitions rows for balanced loss weighting. `GROUPS_TO_EXCLUDE` drops redundant constraints.
4. **Sparsity** — Most of the matrix is zero. District-level targets confine non-zeros to single CD blocks; national targets span all blocks.
5. **Filter** — Remove impossible targets (row sum = 0) before handing to the optimizer.

When adding new domains or variables to the calibration, re-run this notebook to verify the new targets appear correctly and don't introduce impossible constraints.