# Local Area Calibration Setup

This notebook demonstrates the sparse matrix construction for local area (congressional district) calibration. It uses a subset of CDs from NC, HI, MT, and AK for manageable runtime.

## Section 1: Setup & Configuration

In [1]:
from sqlalchemy import create_engine, text
import pandas as pd
import numpy as np

from policyengine_us import Microsimulation
from policyengine_us_data.storage import STORAGE_FOLDER
from policyengine_us_data.datasets.cps.local_area_calibration.sparse_matrix_builder import (
    SparseMatrixBuilder,
)
from policyengine_us_data.datasets.cps.local_area_calibration.matrix_tracer import (
    MatrixTracer,
)
from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
    get_calculated_variables,
    create_target_groups,
)

In [2]:
db_path = STORAGE_FOLDER / "calibration" / "policy_data.db"
db_uri = f"sqlite:///{db_path}"
dataset_path = str(STORAGE_FOLDER / "stratified_extended_cps_2024.h5")

engine = create_engine(db_uri)

## Section 2: Select Test Congressional Districts

We use CDs from 4 states for testing:
- **NC (37)**: 14 CDs (3701-3714) - provides same-state different-CD test cases
- **HI (15)**: 2 CDs (1501-1502)
- **MT (30)**: 2 CDs (3001-3002)
- **AK (2)**: 1 CD (200)

In [3]:
query = """
SELECT DISTINCT sc.value as cd_geoid
FROM stratum_constraints sc
WHERE sc.constraint_variable = 'congressional_district_geoid'
  AND (
    sc.value LIKE '37__'
    OR sc.value LIKE '150_'
    OR sc.value LIKE '300_'
    OR sc.value = '200' OR sc.value = '201'
  )
ORDER BY sc.value
"""

with engine.connect() as conn:
    result = conn.execute(text(query)).fetchall()
    test_cds = [row[0] for row in result]

print(f"Testing with {len(test_cds)} congressional districts:")
print(f"  NC (37): {[cd for cd in test_cds if cd.startswith('37')]}")
print(f"  HI (15): {[cd for cd in test_cds if cd.startswith('15')]}")
print(f"  MT (30): {[cd for cd in test_cds if cd.startswith('30')]}")
print(f"  AK (2):  {[cd for cd in test_cds if cd.startswith('20')]}")

Testing with 19 congressional districts:
  NC (37): ['3701', '3702', '3703', '3704', '3705', '3706', '3707', '3708', '3709', '3710', '3711', '3712', '3713', '3714']
  HI (15): ['1501', '1502']
  MT (30): ['3001', '3002']
  AK (2):  ['201']


## Section 3: Build the Sparse Matrix

The sparse matrix `X_sparse` has:
- **Rows**: Calibration targets (e.g., SNAP totals by geography)
- **Columns**: (household × CD) pairs - each household appears once per CD

We filter to SNAP targets using the `domain_variables` filter for this demonstration.

In [4]:
sim = Microsimulation(dataset=dataset_path)

builder = SparseMatrixBuilder(
    db_uri,
    time_period=2024,
    cds_to_calibrate=test_cds,
    dataset_path=dataset_path,
)

targets_df, X_sparse, household_id_mapping = builder.build_matrix(
    sim, target_filter={"domain_variables": ["snap"], "variables": ["snap"]}
)

print(f"X_sparse shape: {X_sparse.shape}")
print(f"  Rows (targets): {X_sparse.shape[0]}")
print(f"  Columns (household × CD pairs): {X_sparse.shape[1]}")
print(f"  Non-zero entries: {X_sparse.nnz:,}")
print(f"  Sparsity: {1 - X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1]):.2%}")

X_sparse shape: (539, 227981)
  Rows (targets): 539
  Columns (household × CD pairs): 227981
  Non-zero entries: 141,536
  Sparsity: 99.88%


## Section 4: Understanding the Matrix Structure with MatrixTracer

The `MatrixTracer` helps navigate the sparse matrix by providing lookups between:
- Column indices ↔ (household_id, CD) pairs
- Row indices ↔ target definitions

In [5]:
tracer = MatrixTracer(
    targets_df, X_sparse, household_id_mapping, test_cds, sim
)

tracer.print_matrix_structure()


MATRIX STRUCTURE BREAKDOWN

Matrix dimensions: 539 rows x 227981 columns
  Rows = 539 targets
  Columns = 11999 households x 19 CDs
           = 11,999 x 19 = 227,981

--------------------------------------------------------------------------------
COLUMN STRUCTURE (Households stacked by CD)
--------------------------------------------------------------------------------

Showing first and last 5 CDs of 19 total:

First 5 CDs:
cd_geoid  start_col  end_col  n_households
    1501          0    11998         11999
    1502      11999    23997         11999
     201      23998    35996         11999
    3001      35997    47995         11999
    3002      47996    59994         11999

Last 5 CDs:
cd_geoid  start_col  end_col  n_households
    3710     167986   179984         11999
    3711     179985   191983         11999
    3712     191984   203982         11999
    3713     203983   215981         11999
    3714     215982   227980         11999

--------------------------------------

In [6]:
target_groups, group_info = create_target_groups(targets_df)


=== Creating Target Groups ===

National targets:
  Group 0: Snap = 93,730,290,000

State targets:
  Group 1: SNAP Household Count (51 targets)
  Group 2: Snap (51 targets)

District targets:
  Group 3: SNAP Household Count (436 targets)

Total groups created: 4


In [7]:
target_group = tracer.get_group_rows(2)
row_loc = target_group.iloc[28]['row_index']  # Manually found the index value 28
row_info = tracer.get_row_info(row_loc)
var = row_info['variable']
var_desc = row_info['variable_desc']
target_geo_id = int(row_info['geographic_id'])

print("Row info for North Carolina's SNAP benefit amount:")
print(row_info)

Row info for North Carolina's SNAP benefit amount:
{'row_index': 80, 'variable': 'snap', 'variable_desc': 'SNAP allotment', 'geographic_id': '37', 'target_value': 2934626410.0, 'stratum_id': 9363, 'domain_variable': 'snap'}


In [8]:
hh_snap_df = pd.DataFrame(sim.calculate_dataframe([
    "household_id", "household_weight", "state_fips", "snap"])                                        
)
print(hh_snap_df)

       household_id  household_weight  state_fips    snap
0                26       1205.310059          23     0.0
1                34       2170.419922          23     0.0
2                38        587.510010          23     0.0
3                46       1010.840027          23     0.0
4                71        957.460022          23     0.0
...             ...               ...         ...     ...
11994        177822          0.000000          15     0.0
11995        177829          0.000000          15     0.0
11996        177831          0.000000          15     0.0
11997        177860          0.000000          15  6294.0
11998        177861          0.000000          15     0.0

[11999 rows x 4 columns]


If we were to include `congressional_district_geoid` above, they would all be zeros. It's not until we do the calibration, i.e., come back with a vector of weights `w` to multiply `X_sparse` with, that we will set `congressional_district_geoid`.

However, every household is already a donor to every contressional district. You can get the column positions for every household (remember targets are on the rows, donor households on the columns) by running tracer's get_household_column_positions with the *original* `household_id`.

In [9]:
# Reverse lookup: get all column positions for a specific household
hh_id = hh_snap_df.loc[hh_snap_df.snap > 0].household_id.values[0]
print(hh_snap_df.loc[hh_snap_df.household_id == hh_id])

print("\nEvaluating the tracer.get_household_column_positions dictionary:\n")
positions = tracer.get_household_column_positions(hh_id)
print(positions)

    household_id  household_weight  state_fips       snap
23           654       1550.660034          23  70.080002

Evaluating the tracer.get_household_column_positions dictionary:

{'1501': 23, '1502': 12022, '201': 24021, '3001': 36020, '3002': 48019, '3701': 60018, '3702': 72017, '3703': 84016, '3704': 96015, '3705': 108014, '3706': 120013, '3707': 132012, '3708': 144011, '3709': 156010, '3710': 168009, '3711': 180008, '3712': 192007, '3713': 204006, '3714': 216005}


## Section 5: Understanding the cells of the X_Sparse matrix and Target vector

In [10]:
print("Remember, this is a North Carolina target:\n")
print(targets_df.iloc[row_loc])

print("\nNC State target. Household donated to NC's 2nd district, 2024 SNAP dollars:")
print(X_sparse[row_loc, positions['3702']])  # Household donated to NC's 2nd district

print("\nSame target, same household, donated to AK's at Large district, 2024 SNAP dollars:")
print(X_sparse[row_loc, positions['201']])  # Household donated to AK's at Large District

Remember, this is a North Carolina target:

target_id                  8942
stratum_id                 9363
variable                   snap
value              2934626410.0
period                     2024
geo_level                 state
geographic_id                37
domain_variable            snap
original_value     2934626410.0
uprating_factor             1.0
Name: 80, dtype: object

NC State target. Household donated to NC's 2nd district, 2024 SNAP dollars:
70.08

Same target, same household, donated to AK's at Large district, 2024 SNAP dollars:
0.0


Key property: For state-level targets, only CDs in that state should have non-zero values.

Example: A NC state SNAP target should have zeros for HI, MT, and AK CD columns.

So let's see that same household's value for the Alaska state target:

In [11]:
target_group = tracer.get_group_rows(2)
new_row_loc = target_group.iloc[10]['row_index']   # Manually found the index value 10
row_info = tracer.get_row_info(row_loc)
var = row_info['variable']
var_desc = row_info['variable_desc']
target_geo_id = int(row_info['geographic_id'])

print("Row info for Alaska's SNAP benefit amount:")
print(row_info)

Row info for Alaska's SNAP benefit amount:
{'row_index': 80, 'variable': 'snap', 'variable_desc': 'SNAP allotment', 'geographic_id': '37', 'target_value': 2934626410.0, 'stratum_id': 9363, 'domain_variable': 'snap'}


In [12]:
print("\nHousehold donated to AK's 1st district, 2024 SNAP dollars:")
print(X_sparse[new_row_loc, positions['201']])  # Household donated to AK's at Large District


Household donated to AK's 1st district, 2024 SNAP dollars:
0.0


## Section 6: Simulating State-Swapped Calculations

When a household is "transplanted" to a different state, state-dependent benefits like SNAP are recalculated under the destination state's rules.

In [13]:
def create_state_simulation(state_fips):
    """Create a simulation with all households assigned to a specific state."""
    s = Microsimulation(dataset=dataset_path)
    s.set_input(
        "state_fips", 2024, np.full(hh_snap_df.shape[0], state_fips, dtype=np.int32)
    )
    for var in get_calculated_variables(s):
        s.delete_arrays(var)
    return s

# Compare SNAP for first 5 households under NC vs AK rules
nc_sim = create_state_simulation(37)  # NC
ak_sim = create_state_simulation(2)   # AK

nc_snap = nc_sim.calculate("snap", map_to="household").values[:5]
ak_snap = ak_sim.calculate("snap", map_to="household").values[:5]

print("SNAP values for first 5 households under different state rules:")
print(f"  NC rules: {nc_snap}")
print(f"  AK rules: {ak_snap}")
print(f"  Difference: {ak_snap - nc_snap}")

SNAP values for first 5 households under different state rules:
  NC rules: [0. 0. 0. 0. 0.]
  AK rules: [0. 0. 0. 0. 0.]
  Difference: [0. 0. 0. 0. 0.]


## Section 7: Creating the h5 files

  `w` (required)
  - The calibrated weight vector from L0 calibration
  - Shape: (n_cds * n_households,) — a flattened matrix where each CD has weights for all households
  - Gets reshaped to (n_cds, n_households) internally

  `cds_to_calibrate` (required)
  - The ordered list of CD GEOIDs used when building w
  - Serves two purposes:
    a. Tells us how to reshape w (via its length)
    b. Provides the index mapping so we can extract the right rows for any cd_subset

  `cd_subset` (optional, default None)
  - Which CDs to actually include in the output dataset
  - Must be a subset of cds_to_calibrate
  - If None, all CDs are included
  - Use cases: build a single-state file, a single-CD file for testing, etc.

  `output_path` (optional but effectively required — raises if None)
  - Where to save the resulting .h5 file
  - Creates parent directories if needed

  `dataset_path` (optional, default None)
  - Path to the base .h5 dataset that was used during calibration
  - This is the "template" — household structure, demographics, etc.
  - The function loads this, reweights households per CD, updates geography, and stacks

In [14]:
import os

from policyengine_us_data.datasets.cps.local_area_calibration.stacked_dataset_builder import create_sparse_cd_stacked_dataset

# Initialize the weights w for demonstration
# We can't allow too many w cells to be positive for a given state, or the reindexing will fail
w = np.random.binomial(n=1, p=0.01, size=X_sparse.shape[1]).astype(float)

# We'll make sure our earlier household is included:
household_ids = sim.calculate("household_id", map_to="household").values
hh_idx = np.where(household_ids == hh_id)[0][0]

cd_idx = test_cds.index('3701')
flat_idx = cd_idx * len(household_ids) + hh_idx
w[flat_idx] = 2.5

cd_idx = test_cds.index('201')
flat_idx = cd_idx * len(household_ids) + hh_idx
w[flat_idx] = 3.5

# Create a folder for the outputs of the function that is to come.
new_folder_name = "calibration_output"
os.makedirs(new_folder_name, exist_ok=True)
output_path = os.path.join(new_folder_name, "results.h5")

In [15]:
cd_subset = ['3701', '201']
create_sparse_cd_stacked_dataset(
    w,
    test_cds, # cds_to_calibrate - Defines the structure of the weight vector w
    cd_subset=cd_subset, #  cd_subset - Specifies which CDs to actually include in the output dataset (optional, defaults to all).
    dataset_path=dataset_path,
    output_path=output_path,
)

Processing subset of 2 CDs: 3701, 201...
Output path: calibration_output/results.h5

Original dataset has 11,999 households
Extracted weights for 2 CDs from full weight matrix
Total active household-CD pairs: 230
Total weight in W matrix: 234
Processing CD 201 (2/2)...

Combining 2 CD DataFrames...
Total households across all CDs: 230
Combined DataFrame shape: (578, 222)

Reindexing all entity IDs using 25k ranges per CD...
  Created 230 unique households across 2 CDs
  Reindexing persons using 25k ranges...
  Reindexing tax units...
  Reindexing SPM units...
  Reindexing marital units...
  Reindexing families...
  Final persons: 578
  Final households: 230
  Final tax units: 314
  Final SPM units: 236
  Final marital units: 461
  Final families: 249

Weights in combined_df AFTER reindexing:
  HH weight sum: 0.00M
  Person weight sum: 0.00M
  Ratio: 1.00

Overflow check:
  Max person ID after reindexing: 5,125,285
  Max person ID × 100: 512,528,500
  int32 max: 2,147,483,647
  ✓ No ove

'calibration_output/results.h5'

In [16]:
%ls calibration_output

[34mmappings[m[m/   results.h5


Note that there is a *mappings* directory that has also been created by create_sparse_cd_stacked_dataset. This contains the CSV file that links the original households to the donor households. The reason it's a seperate folder is to keep the h5 files and the mapping CSVs organized when this function is run for all districts or states.

In [17]:
%ls calibration_output/mappings

results_household_mapping.csv


In [18]:
sim_after = Microsimulation(dataset="./calibration_output/results.h5")

hh_after_df =  pd.DataFrame(sim_after.calculate_dataframe([
    "household_id", "congressional_district_geoid", "county", "household_weight", "state_fips", "snap"])                                        
)
print(hh_after_df)

     household_id  congressional_district_geoid  \
0           50000                           201   
1           50001                           201   
2           50002                           201   
3           50003                           201   
4           50004                           201   
..            ...                           ...   
225        125113                          3701   
226        125114                          3701   
227        125115                          3701   
228        125116                          3701   
229        125117                          3701   

                              county  household_weight  state_fips  \
0             NORTH_SLOPE_BOROUGH_AK               3.5           2   
1      ALEUTIANS_WEST_CENSUS_AREA_AK               1.0           2   
2    FAIRBANKS_NORTH_STAR_BOROUGH_AK               1.0           2   
3         KENAI_PENINSULA_BOROUGH_AK               1.0           2   
4       HOONAH_ANGOON_CENSUS_AREA_AK 

We can see one of the correct instances above but let's confirm that this new household id does in fact link back to the original in both cases.

In [19]:
mapping_df = pd.read_csv("calibration_output/mappings/results_household_mapping.csv")
mapping_df.loc[mapping_df.original_household_id == hh_id]

Unnamed: 0,new_household_id,original_household_id,congressional_district,state_fips
0,50000,654,201,2
1,125000,654,3701,37


In [20]:
new_hh_ids = mapping_df.loc[mapping_df.original_household_id == hh_id].new_household_id
hh_after_df.loc[hh_after_df.household_id.isin(new_hh_ids)]

Unnamed: 0,household_id,congressional_district_geoid,county,household_weight,state_fips,snap
0,50000,201,NORTH_SLOPE_BOROUGH_AK,3.5,2,0.0
112,125000,3701,HALIFAX_COUNTY_NC,2.5,37,70.080002


And we can see that the snap numbers still match their values from the different US state systems. However note that due to the use of policyengine-core's random function in a component of snap_gross_income, for some households, the value in the final simulation will not match the one used in creating the X matrix (`X_sparse` here). This is outlined in [Issue 412](https://github.com/PolicyEngine/policyengine-core/issues/412).

In [21]:
%rm -r calibration_output