# The Calibration Matrix

The calibration pipeline has three stages: (1) compute uprated target values, (2) assemble the sparse constraint matrix (this notebook), and (3) optimize weights (`unified_calibration.py`). This notebook is the diagnostic checkpoint between stages 1 and 2 — understand your matrix before you optimize.

We build the full calibration matrix using `UnifiedMatrixBuilder` with clone-based geography from `assign_random_geography`, then inspect its structure: what rows and columns represent, how target groups partition the loss function, and where sparsity patterns emerge.

**Column layout:** `col = clone_idx * n_records + record_idx`

**Requirements:** `policy_data.db`, `block_cd_distributions.csv.gz`, and the stratified CPS h5 file in `STORAGE_FOLDER`.

## 1. Setup

In [1]:
import numpy as np
import pandas as pd
from policyengine_us import Microsimulation
from policyengine_us_data.storage import STORAGE_FOLDER
from policyengine_us_data.calibration.unified_matrix_builder import (
    UnifiedMatrixBuilder,
)
from policyengine_us_data.calibration.clone_and_assign import (
    assign_random_geography,
)
from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
    create_target_groups,
    drop_target_groups,
    get_geo_level,
    STATE_CODES,
)

db_path = STORAGE_FOLDER / "calibration" / "policy_data.db"
db_uri = f"sqlite:///{db_path}"
dataset_path = STORAGE_FOLDER / "stratified_extended_cps_2024.h5"

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
sim = Microsimulation(dataset=str(dataset_path))
n_records = sim.calculate("household_id", map_to="household").values.shape[0]

N_CLONES = 3  # keep small for diagnostics
geography = assign_random_geography(n_records, n_clones=N_CLONES, seed=42)

builder = UnifiedMatrixBuilder(
    db_uri=db_uri,
    time_period=2024,
    dataset_path=str(dataset_path),
)

targets_df, X_sparse, target_names = builder.build_matrix(
    geography,
    sim,
    target_filter={"domain_variables": ["aca_ptc", "snap"]},
    hierarchical_domains=["aca_ptc", "snap"],
)

n_total = n_records * N_CLONES
print(f"Records: {n_records:,}, Clones: {N_CLONES}, Total columns: {n_total:,}")
print(f"Matrix shape: {X_sparse.shape}")
print(f"Non-zero entries: {X_sparse.nnz:,}")

Records: 11,999, Clones: 3, Total columns: 35,997
Matrix shape: (1411, 35997)
Non-zero entries: 29,425


## 2. Matrix overview

In [3]:
print(f"Targets: {X_sparse.shape[0]}")
print(f"Columns: {X_sparse.shape[1]:,} ({N_CLONES} clones x {n_records:,} records)")
print(f"Non-zeros: {X_sparse.nnz:,}")
print(f"Density: {X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1]):.6f}")

geo_levels = targets_df["geographic_id"].apply(get_geo_level)
level_names = {0: "National", 1: "State", 2: "District"}
for level in [0, 1, 2]:
    n = (geo_levels == level).sum()
    if n > 0:
        print(f"  {level_names[level]}: {n} targets")

Targets: 1411
Columns: 35,997 (3 clones x 11,999 records)
Non-zeros: 29,425
Density: 0.000579
  National: 1 targets
  State: 102 targets
  District: 1308 targets


## 3. Anatomy of a row

Each row is one calibration target — a known aggregate (dollar total, household count, person count) that the optimizer tries to match. The row vector's non-zero entries identify which cloned records can contribute to that target.

In [4]:
mid_row = X_sparse.shape[0] // 2
row = targets_df.iloc[mid_row]
print(f"Row {mid_row}: {target_names[mid_row]}")
print(f"  variable: {row['variable']}")
print(f"  geographic_id: {row['geographic_id']}")
print(f"  geo_level: {row['geo_level']}")
print(f"  target value: {row['value']:,.0f}")
print(f"  uprating_factor: {row.get('uprating_factor', 'N/A')}")

Row 705: cd_3402/household_count/[snap>0]
  variable: household_count
  geographic_id: 3402
  geo_level: district
  target value: 48,652
  uprating_factor: 1.0


In [5]:
row_vec = X_sparse[mid_row, :]
nz_cols = row_vec.nonzero()[1]
print(f"Row {mid_row} has {len(nz_cols):,} non-zero columns")

if len(nz_cols) > 0:
    clone_indices = nz_cols // n_records
    record_indices = nz_cols % n_records
    print(f"  Spans {len(np.unique(clone_indices))} clone(s)")
    print(f"  Spans {len(np.unique(record_indices))} unique record(s)")

    first_col = nz_cols[0]
    print(f"\nFirst non-zero column ({first_col}):")
    print(f"  clone_idx: {first_col // n_records}")
    print(f"  record_idx: {first_col % n_records}")
    print(f"  state_fips: {geography.state_fips[first_col]}")
    print(f"  cd_geoid: {geography.cd_geoid[first_col]}")
    print(f"  value: {X_sparse[mid_row, first_col]:.2f}")

Row 705 has 10 non-zero columns
  Spans 3 clone(s)
  Spans 10 unique record(s)

First non-zero column (1212):
  clone_idx: 0
  record_idx: 1212
  state_fips: 34
  cd_geoid: 3402
  value: 1.00


## 4. Anatomy of a column

Each column represents one (record, clone) pair. Columns are organized in clone blocks: the first `n_records` columns belong to clone 0, the next to clone 1, and so on. The block formula is:

$$\text{column\_idx} = \text{clone\_idx} \times n_{\text{records}} + \text{record\_idx}$$

In [6]:
col_idx = 1 * n_records + 42  # clone 1, record 42
clone_idx = col_idx // n_records
record_idx = col_idx % n_records
print(f"Column {col_idx}:")
print(f"  clone_idx: {clone_idx}")
print(f"  record_idx: {record_idx}")
print(f"  state_fips: {geography.state_fips[col_idx]}")
print(f"  cd_geoid: {geography.cd_geoid[col_idx]}")
print(f"  block_geoid: {geography.block_geoid[col_idx]}")

col_vec = X_sparse[:, col_idx]
nz_rows = col_vec.nonzero()[0]
print(f"\nThis column has non-zero values in {len(nz_rows)} target rows")
if len(nz_rows) > 0:
    print("First 5 target rows:")
    for r in nz_rows[:5]:
        row = targets_df.iloc[r]
        print(
            f"  row {r}: {row['variable']} "
            f"(geo={row['geographic_id']}, "
            f"val={X_sparse[r, col_idx]:.2f})"
        )

Column 12041:
  clone_idx: 1
  record_idx: 42
  state_fips: 45
  cd_geoid: 4507
  block_geoid: 450410002022009

This column has non-zero values in 0 target rows


In [7]:
expected_col = 1 * n_records + 42
assert col_idx == expected_col, f"{col_idx} != {expected_col}"
print(
    f"Block formula verified: "
    f"clone_idx=1 * n_records={n_records} + record_idx=42 = {expected_col}"
)

Block formula verified: clone_idx=1 * n_records=11999 + record_idx=42 = 12041


## 5. Target groups and loss weighting

Target groups partition the rows by (domain, variable, geographic level). Each group contributes equally to the loss function, so 436 district-level rows don't drown out 1 national row. The group IDs depend on the current database contents and may change if targets are added or removed.

In [8]:
target_groups, group_info = create_target_groups(targets_df)

records = []
for gid, info in enumerate(group_info):
    mask = target_groups == gid
    vals = targets_df.loc[mask, "value"]
    records.append({
        "group_id": gid,
        "description": info,
        "n_targets": mask.sum(),
        "min_value": vals.min(),
        "median_value": vals.median(),
        "max_value": vals.max(),
    })

group_df = pd.DataFrame(records)
print(group_df.to_string(index=False))


=== Creating Target Groups ===

National targets:
  Group 0: ACA PTC Person Count = 19,743,689

State targets:
  Group 1: SNAP Household Count (51 targets)
  Group 2: Snap (51 targets)

District targets:
  Group 3: Aca Ptc (436 targets)
  Group 4: ACA PTC Tax Unit Count (436 targets)
  Group 5: SNAP Household Count (436 targets)

Total groups created: 6
 group_id                                                         description  n_targets    min_value  median_value    max_value
        0 Group 0: National ACA PTC Person Count (1 target, value=19,743,689)          1 1.974369e+07  1.974369e+07 1.974369e+07
        1                    Group 1: State SNAP Household Count (51 targets)         51 1.369100e+04  2.772372e+05 3.128640e+06
        2                                    Group 2: State Snap (51 targets)         51 5.670186e+07  1.293585e+09 1.237718e+10
        3                             Group 3: District Aca Ptc (436 targets)        436 5.420354e+06  2.937431e+07 3.880971e+0

In [9]:
for gid in [0, 2, 4]:
    if gid >= len(group_info):
        continue
    mask = target_groups == gid
    rows = targets_df[mask][["variable", "geographic_id", "value"]].head(8)
    print(f"\n--- {group_info[gid]} ---")
    print(rows.to_string(index=False))


--- Group 0: National ACA PTC Person Count (1 target, value=19,743,689) ---
    variable geographic_id      value
person_count            US 19743689.0

--- Group 2: State Snap (51 targets) ---
variable geographic_id        value
    snap             1 1733693703.0
    snap            10  254854243.0
    snap            11  319119173.0
    snap            12 6604797454.0
    snap            13 3281329856.0
    snap            15  731331421.0
    snap            16  281230283.0
    snap            17 4469341818.0

--- Group 4: District ACA PTC Tax Unit Count (436 targets) ---
      variable geographic_id        value
tax_unit_count          1000 25064.255490
tax_unit_count           101  9794.081624
tax_unit_count           102 11597.544977
tax_unit_count           103  9160.097959
tax_unit_count           104  9786.728220
tax_unit_count           105 18266.234326
tax_unit_count           106 25397.026846
tax_unit_count           107 11798.642968


## 6. Tracing a household across clones

One CPS record appears once per clone (N_CLONES column positions). Each clone places it in a different census block/CD/state, so it contributes to different geographic targets depending on the clone.

In [10]:
snap_values = sim.calculate("snap", map_to="household").values
hh_ids = sim.calculate("household_id", map_to="household").values
positive_snap = hh_ids[snap_values > 0]
example_hh_idx = int(np.where(snap_values > 0)[0][0])
print(f"Example SNAP-receiving household: record index {example_hh_idx}")
print(f"SNAP value: ${snap_values[example_hh_idx]:,.0f}")

clone_cols = [c * n_records + example_hh_idx for c in range(N_CLONES)]
print(f"\nColumn positions across {N_CLONES} clones:")
for col in clone_cols:
    state = geography.state_fips[col]
    cd = geography.cd_geoid[col]
    block = geography.block_geoid[col]
    col_vec = X_sparse[:, col]
    nnz = col_vec.nnz
    abbr = STATE_CODES.get(state, "??")
    print(f"  col {col}: {abbr} (state={state}, CD={cd}) — {nnz} non-zero rows")

Example SNAP-receiving household: record index 2
SNAP value: $679

Column positions across 3 clones:
  col 2: TX (state=48, CD=4814) — 4 non-zero rows
  col 12001: IN (state=18, CD=1804) — 3 non-zero rows
  col 24000: PA (state=42, CD=4212) — 3 non-zero rows


In [11]:
for col in clone_cols:
    col_vec = X_sparse[:, col]
    nz_rows = col_vec.nonzero()[0]
    if len(nz_rows) == 0:
        continue
    clone_i = col // n_records
    print(f"\nClone {clone_i} (col {col}, CD {geography.cd_geoid[col]}):")
    for r in nz_rows[:5]:
        row = targets_df.iloc[r]
        print(
            f"  {row['variable']} (geo={row['geographic_id']}): "
            f"{X_sparse[r, col]:.2f}"
        )
    if len(nz_rows) > 5:
        print(f"  ... and {len(nz_rows) - 5} more")


Clone 0 (col 2, CD 4814):
  person_count (geo=US): 3.00
  household_count (geo=48): 1.00
  snap (geo=48): 678.60
  household_count (geo=4814): 1.00

Clone 1 (col 12001, CD 1804):
  household_count (geo=18): 1.00
  snap (geo=18): 678.60
  household_count (geo=1804): 1.00

Clone 2 (col 24000, CD 4212):
  household_count (geo=42): 1.00
  snap (geo=42): 678.60
  household_count (geo=4212): 1.00


## 7. Sparsity analysis

In [12]:
total_cells = X_sparse.shape[0] * X_sparse.shape[1]
density = X_sparse.nnz / total_cells
print(f"Total cells: {total_cells:,}")
print(f"Non-zero entries: {X_sparse.nnz:,}")
print(f"Density: {density:.6f}")
print(f"Sparsity: {1 - density:.4%}")

Total cells: 50,791,767
Non-zero entries: 29,425
Density: 0.000579
Sparsity: 99.9421%


In [13]:
nnz_per_row = np.diff(X_sparse.indptr)
print(f"Non-zeros per row:")
print(f"  min:    {nnz_per_row.min():,}")
print(f"  median: {int(np.median(nnz_per_row)):,}")
print(f"  mean:   {nnz_per_row.mean():,.0f}")
print(f"  max:    {nnz_per_row.max():,}")

geo_levels = targets_df["geographic_id"].apply(get_geo_level)
level_names = {0: "National", 1: "State", 2: "District"}
print("\nBy geographic level:")
for level in [0, 1, 2]:
    mask = (geo_levels == level).values
    if mask.any():
        vals = nnz_per_row[mask]
        print(
            f"  {level_names[level]:10s}: "
            f"n={mask.sum():>4d}, "
            f"median nnz={int(np.median(vals)):>7,}, "
            f"range=[{vals.min():,}, {vals.max():,}]"
        )

Non-zeros per row:
  min:    0
  median: 10
  mean:   21
  max:    3,408

By geographic level:
  National  : n=   1, median nnz=  3,408, range=[3,408, 3,408]
  State     : n= 102, median nnz=     80, range=[10, 694]
  District  : n=1308, median nnz=      9, range=[0, 27]


In [14]:
clone_nnz = []
for ci in range(N_CLONES):
    block = X_sparse[:, ci * n_records : (ci + 1) * n_records]
    n_states = len(np.unique(geography.state_fips[ci * n_records : (ci + 1) * n_records]))
    clone_nnz.append({
        "clone": ci,
        "nnz": block.nnz,
        "unique_states": n_states,
    })

clone_df = pd.DataFrame(clone_nnz)
print("Non-zeros per clone block:")
print(clone_df.to_string(index=False))

Non-zeros per clone block:
 clone  nnz  unique_states
     0 9775             51
     1 9810             51
     2 9840             51


## 8. Dropping target groups

Some target groups are redundant after hierarchical uprating. For example, state-level SNAP Household Count (Group 1) is redundant with district-level SNAP Household Count (Group 5) — the district targets were already reconciled to sum to the state totals.

Specify drops as `(variable_label, geo_level)` pairs. The labels come from the group descriptions above; the geo level is "National", "State", or "District".

In [15]:
GROUPS_TO_DROP = [
    ("SNAP Household Count", "State"),
]

targets_filtered, X_filtered = drop_target_groups(
    targets_df, X_sparse, target_groups, group_info, GROUPS_TO_DROP
)

Matrix before: 1411 rows
  DROPPING Group 1: State SNAP Household Count (51 targets) (51 rows)

  KEEPING  Group 0: National ACA PTC Person Count (1 target, value=19,743,689) (1 rows)
  KEEPING  Group 2: State Snap (51 targets) (51 rows)
  KEEPING  Group 3: District Aca Ptc (436 targets) (436 rows)
  KEEPING  Group 4: District ACA PTC Tax Unit Count (436 targets) (436 rows)
  KEEPING  Group 5: District SNAP Household Count (436 targets) (436 rows)

Matrix after: 1360 rows


In [16]:
remaining_groups, remaining_info = create_target_groups(targets_filtered)
print(f"\nRemaining groups ({len(remaining_info)}):")
for info in remaining_info:
    print(f"  {info}")


=== Creating Target Groups ===

National targets:
  Group 0: ACA PTC Person Count = 19,743,689

State targets:
  Group 1: Snap (51 targets)

District targets:
  Group 2: Aca Ptc (436 targets)
  Group 3: ACA PTC Tax Unit Count (436 targets)
  Group 4: SNAP Household Count (436 targets)

Total groups created: 5

Remaining groups (5):
  Group 0: National ACA PTC Person Count (1 target, value=19,743,689)
  Group 1: State Snap (51 targets)
  Group 2: District Aca Ptc (436 targets)
  Group 3: District ACA PTC Tax Unit Count (436 targets)
  Group 4: District SNAP Household Count (436 targets)


## 9. Achievable targets

A target is achievable if at least one household can contribute to it (row sum > 0). Rows with sum = 0 are impossible constraints that the optimizer cannot satisfy.

In [17]:
row_sums = np.array(X_filtered.sum(axis=1)).flatten()
achievable_mask = row_sums > 0
n_achievable = achievable_mask.sum()
n_impossible = (~achievable_mask).sum()

print(f"Achievable targets: {n_achievable}")
print(f"Impossible targets: {n_impossible}")

if n_impossible > 0:
    impossible = targets_filtered[~achievable_mask]
    by_var = (
        impossible.groupby(["domain_variable", "variable"])
        .agg(count=("value", "size"))
        .reset_index()
        .sort_values("count", ascending=False)
    )
    print("\nImpossible targets by (domain, variable):")
    for _, r in by_var.iterrows():
        print(f"  {r['domain_variable']}/{r['variable']}: {r['count']}")

Achievable targets: 1358
Impossible targets: 2

Impossible targets by (domain, variable):
  aca_ptc/aca_ptc: 1
  aca_ptc/tax_unit_count: 1


In [18]:
ratios = row_sums[achievable_mask] / targets_filtered.loc[achievable_mask, "value"].values
ratio_df = targets_filtered[achievable_mask].copy()
ratio_df["row_sum"] = row_sums[achievable_mask]
ratio_df["ratio"] = ratios

hardest = ratio_df.nsmallest(5, "ratio")
print("Hardest targets (lowest row_sum / target_value ratio):")
for _, r in hardest.iterrows():
    print(
        f"  {r.get('domain_variable', '?')}/{r['variable']} "
        f"(geo={r['geographic_id']}): "
        f"ratio={r['ratio']:.4f}, "
        f"row_sum={r['row_sum']:,.0f}, "
        f"target={r['value']:,.0f}"
    )

Hardest targets (lowest row_sum / target_value ratio):
  aca_ptc/aca_ptc (geo=3612): ratio=0.0000, row_sum=5,439, target=376,216,522
  aca_ptc/aca_ptc (geo=2508): ratio=0.0000, row_sum=2,024, target=124,980,814
  aca_ptc/tax_unit_count (geo=2508): ratio=0.0000, row_sum=1, target=51,937
  aca_ptc/tax_unit_count (geo=3612): ratio=0.0000, row_sum=2, target=73,561
  aca_ptc/tax_unit_count (geo=1198): ratio=0.0000, row_sum=1, target=30,419


In [19]:
X_final = X_filtered[achievable_mask, :]
print(f"Final matrix shape: {X_final.shape}")
print(f"Final non-zero entries: {X_final.nnz:,}")
print(f"Final density: {X_final.nnz / (X_final.shape[0] * X_final.shape[1]):.6f}")
print("\nThis is what the optimizer receives.")

Final matrix shape: (1358, 35997)
Final non-zero entries: 23,018
Final density: 0.000471

This is what the optimizer receives.


## Summary

The calibration matrix pipeline has five steps:

1. **Clone + assign** — `assign_random_geography()` creates N clones of each CPS record, each with a random census block (and derived CD/state).
2. **Build** — `UnifiedMatrixBuilder.build_matrix()` queries targets, applies hierarchical uprating, simulates each clone with its assigned geography, and assembles the sparse CSR matrix.
3. **Groups** — `create_target_groups()` partitions rows for balanced loss weighting. `GROUPS_TO_EXCLUDE` drops redundant constraints.
4. **Sparsity** — Most of the matrix is zero. District-level targets confine non-zeros to clones assigned to that district; national targets span all clones.
5. **Filter** — Remove impossible targets (row sum = 0) before handing to the optimizer.

When adding new domains or variables to the calibration, re-run this notebook to verify the new targets appear correctly and don't introduce impossible constraints.