# Hierarchical Uprating: HIFs and Uprating Factors

Calibration targets in `policy_data.db` come from different sources, at different geographic levels, and from different time periods. Before we can use them, two adjustments are needed:

1. **Uprating factor (UF)**: Bridges the time gap between the source data's period and the calibration year. For most domains, dollar-valued targets use CPI and count targets use population growth. For **ACA PTC**, we use real state-level enrollment and average APTC changes from CMS/KFF data, giving each state its own UF.

2. **Hierarchy inconsistency factor (HIF)**: Corrects for the fact that district-level totals from one source may not sum to the state-level total from another. This is a pure base-year geometry correction with no time dimension.

These two factors are **separable by linearity**. For each congressional district row:

$$\text{value} = \text{original\_value} \times \text{HIF} \times \text{UF}$$

where $\text{HIF} = S_{\text{base}} \;/\; \sum_i CD_{i,\text{base}}$ and the sum constraint holds:

$$\sum_i (CD_i \times \text{HIF} \times \text{UF}) = \text{UF} \times S_{\text{base}} = S_{\text{uprated}}$$

This notebook demonstrates both factors using two domains with contrasting behavior:
- **ACA PTC** (IRS data): Districts sum exactly to state totals, so HIF = 1.0 everywhere. The UF varies by state, reflecting real enrollment and APTC changes between 2022 and 2024.
- **SNAP** (USDA data): District household counts substantially undercount the state administrative totals, so HIF > 1 (often 1.2 to 1.7). The SNAP data is already at the target period, so UF = 1.0.

## Setup

In [1]:
import numpy as np
import pandas as pd

from policyengine_us_data.storage import STORAGE_FOLDER
from policyengine_us_data.calibration.unified_matrix_builder import (
    UnifiedMatrixBuilder,
)
from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
    STATE_CODES,
)

db_path = STORAGE_FOLDER / "calibration" / "policy_data.db"
db_uri = f"sqlite:///{db_path}"
builder = UnifiedMatrixBuilder(db_uri, time_period=2024)

  from .autonotebook import tqdm as notebook_tqdm


## 1. Raw targets from the database

We query both domains via the `target_overview` view. Each row is a target at a specific geographic level (national, state, or district), for a specific variable, from a specific source period.

In [2]:
DOMAINS = ["aca_ptc", "snap"]

raw = builder._query_targets({"domain_variables": DOMAINS})

summary = (
    raw.groupby(["domain_variable", "geo_level", "variable", "period"])
    .agg(count=("value", "size"), total_value=("value", "sum"))
    .reset_index()
)
summary

Unnamed: 0,domain_variable,geo_level,variable,period,count,total_value
0,aca_ptc,district,aca_ptc,2022,436,14191850000.0
1,aca_ptc,district,tax_unit_count,2022,436,6848330.0
2,aca_ptc,national,aca_ptc,2022,1,14191850000.0
3,aca_ptc,national,person_count,2024,1,19743690.0
4,aca_ptc,national,tax_unit_count,2022,1,6848330.0
5,aca_ptc,state,aca_ptc,2022,51,14191850000.0
6,aca_ptc,state,tax_unit_count,2022,51,6848330.0
7,snap,district,household_count,2024,436,15632680.0
8,snap,state,household_count,2024,51,22177090.0
9,snap,state,snap,2024,51,93657870000.0


Notice the structural difference between the two domains:

- **ACA PTC**: Same variables (`aca_ptc`, `tax_unit_count`) at all three levels. Source period is 2022.
- **SNAP**: `household_count` at state + district only; `snap` (dollars) at state only. Source period is 2024.

Also note the SNAP hierarchy gap: the district `household_count` total is much less than the state total. We can see this directly:

In [3]:
snap_hh = raw[
    (raw["domain_variable"] == "snap")
    & (raw["variable"] == "household_count")
]
for level in ["state", "district"]:
    total = snap_hh[snap_hh["geo_level"] == level]["value"].sum()
    print(f"  {level:10s} household_count total: {total:>14,.0f}")

ratio = (
    snap_hh[snap_hh["geo_level"] == "district"]["value"].sum()
    / snap_hh[snap_hh["geo_level"] == "state"]["value"].sum()
)
print(f"\n  District / State ratio: {ratio:.3f}")
print(f"  Districts account for only {ratio:.0%} of the state totals.")

  state      household_count total:     22,177,091
  district   household_count total:     15,632,675

  District / State ratio: 0.705
  Districts account for only 70% of the state totals.


## 2. Generic uprating factors

The uprating factors bridge from each source period to the calibration year (2024). Dollar targets use CPI-U; count targets use national population growth. These come from the `policyengine_us` parameters tree.

In [4]:
from policyengine_us import Microsimulation

sim = Microsimulation()
params = sim.tax_benefit_system.parameters
uprating_factors = builder._calculate_uprating_factors(params)

for (yr, kind), f in sorted(uprating_factors.items()):
    if f != 1.0:
        print(f"  {yr} -> 2024 ({kind}): {f:.6f}")

  2022 -> 2024 (cpi): 1.101889
  2022 -> 2024 (pop): 1.020415
  2023 -> 2024 (cpi): 1.035512
  2023 -> 2024 (pop): 1.010947
  2025 -> 2024 (cpi): 0.970879
  2025 -> 2024 (pop): 0.990801


Apply the generic uprating to all rows. This sets `original_value` (the source data) and `value` (the uprated result).

In [5]:
raw["original_value"] = raw["value"].copy()
raw["uprating_factor"] = raw.apply(
    lambda r: builder._get_uprating_info(
        r["variable"], r["period"], uprating_factors
    )[0],
    axis=1,
)
raw["value"] = raw["original_value"] * raw["uprating_factor"]

For ACA PTC (period 2022), the uprating factors are substantial. For SNAP (period 2024), they are 1.0:

In [6]:
sample_states = {6: "CA", 48: "TX", 36: "NY"}

for fips, abbr in sample_states.items():
    rows = raw[
        (raw["geo_level"] == "state")
        & (raw["geographic_id"] == str(fips))
    ]
    for _, r in rows.iterrows():
        print(
            f"  {abbr} [{r['domain_variable']:8s}] "
            f"{r['variable']:20s}  "
            f"orig={r['original_value']:>14,.0f}  "
            f"factor={r['uprating_factor']:.4f}  "
            f"uprated={r['value']:>14,.0f}"
        )

  CA [snap    ] household_count       orig=     3,128,640  factor=1.0000  uprated=     3,128,640
  CA [snap    ] snap                  orig=12,377,175,489  factor=1.0000  uprated=12,377,175,489
  CA [aca_ptc ] tax_unit_count        orig=     1,234,230  factor=1.0204  uprated=     1,259,427
  CA [aca_ptc ] aca_ptc               orig= 2,754,865,000  factor=1.1019  uprated= 3,035,556,532
  TX [snap    ] household_count       orig=     1,466,107  factor=1.0000  uprated=     1,466,107
  TX [snap    ] snap                  orig= 7,210,895,950  factor=1.0000  uprated= 7,210,895,950
  TX [aca_ptc ] tax_unit_count        orig=       571,890  factor=1.0204  uprated=       583,565
  TX [aca_ptc ] aca_ptc               orig= 1,159,849,000  factor=1.1019  uprated= 1,278,025,314
  NY [snap    ] household_count       orig=     1,707,770  factor=1.0000  uprated=     1,707,770
  NY [snap    ] snap                  orig= 7,353,983,677  factor=1.0000  uprated= 7,353,983,677
  NY [aca_ptc ] tax_unit_count

## 3. Hierarchical reconciliation

This is where the two factors get computed and applied to district-level rows. For each (state, variable) pair within a domain:

- **HIF** = `state_original / sum(cd_originals)` -- pure base-year correction
- **UF** = state-specific uprating factor:
  - For **ACA PTC**: loaded from `aca_ptc_multipliers_2022_2024.csv` (CMS/KFF enrollment data). `tax_unit_count` uses `vol_mult` (enrollment ratio); `aca_ptc` uses `vol_mult * val_mult` (enrollment x average APTC ratio).
  - For other domains: national CPI/pop factors as fallback.

After reconciliation, national and state rows used only for the hierarchy calculation are dropped. Rows already at the target period (like CMS `person_count` or SNAP state-level totals) are kept.

In [7]:
result = builder._apply_hierarchical_uprating(
    raw, DOMAINS, uprating_factors
)

### ACA PTC: HIF = 1, state-varying uprating factors

The IRS data is internally consistent: district totals sum exactly to state totals, so HIF = 1.0 everywhere. The uprating factors now vary by state, reflecting real CMS/KFF enrollment and average APTC changes between 2022 and 2024. States with large enrollment growth (e.g., WV, LA, TN) have higher factors; states with stable or declining enrollment (e.g., ME, DC) have lower factors.

In [8]:
def show_reconciliation(result, raw, domain, sample_states):
    domain_rows = result[result["domain_variable"] == domain]
    cd_domain = domain_rows[domain_rows["geo_level"] == "district"]
    if cd_domain.empty:
        print("  (no district rows)")
        return
    for fips, abbr in sample_states.items():
        cd_state = cd_domain[
            cd_domain["geographic_id"].apply(
                lambda g, s=fips: (
                    int(g) // 100 == s
                    if g not in ("US",)
                    else False
                )
            )
        ]
        if cd_state.empty:
            continue
        for var in sorted(cd_state["variable"].unique()):
            var_rows = cd_state[cd_state["variable"] == var]
            hif = var_rows["hif"].iloc[0]
            uf = var_rows["state_uprating_factor"].iloc[0]
            cd_sum = var_rows["value"].sum()
            st_row = raw[
                (raw["geo_level"] == "state")
                & (raw["geographic_id"] == str(fips))
                & (raw["variable"] == var)
                & (raw["domain_variable"] == domain)
            ]
            uprated_state = (
                st_row["value"].iloc[0]
                if len(st_row)
                else np.nan
            )
            print(
                f"  {abbr} {var:20s}  "
                f"hif={hif:.6f}  "
                f"uprating={uf:.6f}  "
                f"sum(CDs)={cd_sum:>14,.0f}  "
                f"uprated_state={uprated_state:>14,.0f}"
            )

show_reconciliation(result, raw, "aca_ptc", sample_states)

  CA aca_ptc               hif=1.000000  uprating=1.209499  sum(CDs)= 3,332,007,010  uprated_state= 3,035,556,532
  CA tax_unit_count        hif=1.000000  uprating=1.055438  sum(CDs)=     1,302,653  uprated_state=     1,259,427
  TX aca_ptc               hif=1.000000  uprating=1.957664  sum(CDs)= 2,270,594,110  uprated_state= 1,278,025,314
  TX tax_unit_count        hif=1.000000  uprating=1.968621  sum(CDs)=     1,125,834  uprated_state=       583,565
  NY aca_ptc               hif=1.000000  uprating=1.343861  sum(CDs)= 2,049,797,288  uprated_state= 1,680,716,304
  NY tax_unit_count        hif=1.000000  uprating=1.075089  sum(CDs)=       593,653  uprated_state=       563,463


In [9]:
aca_cds = result[
    (result["domain_variable"] == "aca_ptc")
    & (result["geo_level"] == "district")
    & (result["variable"] == "aca_ptc")
]

state_ufs = (
    aca_cds.assign(state_fips=aca_cds["geographic_id"].apply(
        lambda g: int(g) // 100
    ))
    .groupby("state_fips")["state_uprating_factor"]
    .first()
    .sort_values()
)

print("ACA PTC uprating factors (aca_ptc = vol_mult * val_mult):")
print(f"  {'State FIPS':>12s}  {'Factor':>8s}")
print(f"  {'─'*12}  {'─'*8}")
for fips in list(state_ufs.index[:5]) + ["..."] + list(state_ufs.index[-5:]):
    if fips == "...":
        print(f"  {'...':>12s}")
    else:
        abbr = STATE_CODES.get(fips, str(fips))
        print(f"  {abbr:>12s}  {state_ufs[fips]:>8.4f}")
print(f"\n  Range: [{state_ufs.min():.4f}, {state_ufs.max():.4f}]")
print(f"  Median: {state_ufs.median():.4f}")

ACA PTC uprating factors (aca_ptc = vol_mult * val_mult):
    State FIPS    Factor
  ────────────  ────────
            HI    0.9766
            NV    1.0353
            OR    1.0556
            DC    1.1592
            SD    1.1820
           ...
            TN    2.1770
            GA    2.2579
            MS    2.3237
            LA    2.4206
            WV    2.4466

  Range: [0.9766, 2.4466]
  Median: 1.4139


The state-level uprating factors for ACA PTC come from real CMS/KFF data. Here are the extremes:

### SNAP: HIF does the heavy lifting

SNAP district-level household counts come from a different source than state administrative totals, and they systematically undercount. The HIF corrects for this. Since the data is already at period 2024, the uprating factor is 1.0.

In [10]:
show_reconciliation(result, raw, "snap", sample_states)

  CA household_count       hif=1.681273  uprating=1.000000  sum(CDs)=     3,128,640  uprated_state=     3,128,640
  TX household_count       hif=1.244524  uprating=1.000000  sum(CDs)=     1,466,107  uprated_state=     1,466,107
  NY household_count       hif=1.344447  uprating=1.000000  sum(CDs)=     1,707,770  uprated_state=     1,707,770


The HIFs here are substantial: California's state SNAP household count is ~68% higher than the sum of its district counts. The HIF rescales each district proportionally so the corrected districts sum to the state total.

Note that SNAP dollar amounts (`snap` variable) exist only at the state level with no district breakdown, so there are no district rows to reconcile for that variable.

## 4. What remains after reconciliation

The hierarchical uprating drops rows that were only needed for the reconciliation calculation (national/state rows from older periods) and keeps everything else.

In [11]:
level_counts = (
    result.groupby(["domain_variable", "geo_level"])
    .size()
    .reset_index(name="count")
)
level_counts

Unnamed: 0,domain_variable,geo_level,count
0,aca_ptc,district,872
1,aca_ptc,national,1
2,snap,district,436
3,snap,state,102


In [12]:
nat_rows = result[result["geo_level"] == "national"]
state_rows = result[result["geo_level"] == "state"]

if len(nat_rows):
    print("Kept national rows:")
    for _, r in nat_rows.iterrows():
        print(
            f"  [{r['domain_variable']}] {r['variable']}  "
            f"period={r['period']}  value={r['value']:,.0f}"
        )

if len(state_rows):
    print(f"\nKept state rows: {len(state_rows)}")
    print("  (SNAP state-level targets at period=2024 are preserved)")
    for var in state_rows["variable"].unique():
        n = len(state_rows[state_rows["variable"] == var])
        print(f"  {var}: {n} rows")

Kept national rows:
  [aca_ptc] person_count  period=2024  value=19,743,689

Kept state rows: 102
  (SNAP state-level targets at period=2024 are preserved)
  household_count: 51 rows
  snap: 51 rows


## 5. Verification: sum(CDs) == uprated state

The core invariant: for every (state, variable) pair that has district rows, the sum of reconciled district values must equal the uprated state total.

In [13]:
all_ok = True
checks = 0
for domain in DOMAINS:
    domain_result = result[result["domain_variable"] == domain]
    cd_result = domain_result[
        domain_result["geo_level"] == "district"
    ]
    if cd_result.empty:
        continue

    for fips, abbr in sorted(STATE_CODES.items()):
        cd_rows = cd_result[
            cd_result["geographic_id"].apply(
                lambda g, s=fips: (
                    int(g) // 100 == s
                    if g not in ("US",)
                    else False
                )
            )
        ]
        if cd_rows.empty:
            continue
        for var in cd_rows["variable"].unique():
            var_rows = cd_rows[cd_rows["variable"] == var]
            cd_sum = var_rows["value"].sum()

            # The reconciliation targets state_original * state_UF,
            # not the generically uprated state value.
            st = raw[
                (raw["geo_level"] == "state")
                & (raw["geographic_id"] == str(fips))
                & (raw["variable"] == var)
                & (raw["domain_variable"] == domain)
            ]
            if st.empty:
                continue
            state_original = st["original_value"].iloc[0]
            state_uf = var_rows["state_uprating_factor"].iloc[0]
            expected = state_original * state_uf

            ok = np.isclose(cd_sum, expected, rtol=1e-6)
            checks += 1
            if not ok:
                print(
                    f"  FAIL [{domain}] {abbr} {var}: "
                    f"sum(CDs)={cd_sum:.2f} != "
                    f"expected={expected:.2f}"
                )
                all_ok = False

print(
    f"  {checks} checks across {len(DOMAINS)} domains: "
    + ("ALL PASSED" if all_ok else "SOME FAILED")
)

  153 checks across 2 domains: ALL PASSED
