Skip to content

Add CBO aggregate + per-AGI-bracket targets for cap gains, dividends, interest#868

Merged
MaxGhenis merged 5 commits into
mainfrom
add-cap-gains-agi-targets
May 4, 2026
Merged

Add CBO aggregate + per-AGI-bracket targets for cap gains, dividends, interest#868
MaxGhenis merged 5 commits into
mainfrom
add-cap-gains-agi-targets

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Adds calibration targets to constrain capital-gains, dividend, and interest-income aggregates that were running 5-15x over CBO/SOI targets in the default enhanced_cps_2024 dataset.

Fixes the national-level half of #555 (which only flagged the issue at state level) and resolves #866.

Diagnosis

In the current default enhanced_cps_2024.h5, two records with raw $62M / $79M LTCG ended up with calibration weights of 73k / 51k — together contributing 87% of the $9.92T national long-term capital gains aggregate. CBO target for 2024: $1.29T. So we were 7.7× over for LTCG, 12× over for net_capital_gains, and Gini was reading 0.93 (vs. real ~0.45) on the resulting income distribution.

Same mechanism as #555: PR #537 removed the AGI ceiling on PUF imputation, but the calibration targets weren't tightened to constrain the new high-income tail. The bracket-level SOI targets in build_loss_matrix can be satisfied while a few records absorb extreme weight to fill the population/AGI targets — overshooting the un-constrained component aggregates as a side effect.

Changes

1. utils/loss.py build_loss_matrix (consumed by the legacy EnhancedCPS_2024.generate() path that builds enhanced_cps_2024.h5):

Add three CBO income_by_source aggregate targets so the optimizer has hard upper bounds on national totals:

CBO_INCOME_BY_SOURCE_TARGETS = [
    ("net_capital_gains", ["net_capital_gains"], "net_capital_gain"),
    ("qualified_dividend_income", ["qualified_dividend_income"], "qualified_dividend_income"),
    ("taxable_interest_and_ordinary_dividends",
     ["taxable_interest_income", "non_qualified_dividend_income"],
     "taxable_interest_and_ordinary_dividends"),
]

2. calibration/target_config.yaml (consumed by unified_calibration for national/US.h5):

  • Add per-AGI-bracket net_capital_gains targets (DB already had them from the SOI ETL; just weren't included).
  • Add tax_unit_count × cap-gains targets (per-bracket counts of returns with cap gains).
  • Re-include dividend_income, qualified_dividend_income, and taxable_interest_income aggregates that were previously dropped for "high error or tension". 30% rel-error on a soft target is much better than no constraint at all when the alternative is 5-15× inflation.

Test plan

  • Unit tests still pass (1 pre-existing unrelated failure in test_policyengine_utils.py)
  • Run a national rebuild: python -m policyengine_us_data.datasets.cps.enhanced_cps
  • Verify aggregate net_capital_gains lands within ~50% of CBO $1.29T target (currently ~16x over)
  • Verify aggregate qualified_dividend_income lands within ~50% of CBO $354B target (currently ~6x over)
  • Verify Gini on household_net_income drops from 0.93 to a more plausible level
  • Run unified calibration with updated target_config.yaml and check national_unified_diagnostics.csv for new per-bracket cap-gains rel errors

The rebuild will reveal whether soft targets are sufficient or whether we also need the per-record contribution cap proposed in #555. If aggregates still run 2-3× over after this change, we'll need to escalate to a hard-constraint solution in microcalibrate.

Refs

🤖 Generated with Claude Code

MaxGhenis and others added 3 commits May 3, 2026 14:15
… interest

The calibration optimizer was leaving capital gains, dividends, and
interest aggregates 5-15x inflated relative to CBO targets in the
default enhanced_cps_2024 dataset. Two records with raw $62M / $79M
LTCG ended up with calibration weights of 73k / 51k — together
contributing 87% of the inflated $9.92T national LTCG aggregate
(real CBO target: $1.29T for 2024).

Root cause is the same as #555 at state level: when PR #537 removed
the AGI ceiling on PUF imputation, calibration weights weren't
re-tuned to constrain the new high-income tail. The bracket-level
SOI targets in build_loss_matrix can be satisfied while a few
records absorb extreme weight to fill the population/AGI targets,
overshooting the un-constrained component aggregates as a side effect.

Two changes:

1. utils/loss.py (build_loss_matrix, used by enhanced_cps_2024):
   Add three CBO income_by_source aggregate targets — net_capital_gains,
   qualified_dividend_income, and taxable_interest_and_ordinary_dividends —
   so the optimizer has hard upper bounds on these national totals.

2. calibration/target_config.yaml (used by unified_calibration /
   national/US.h5): add per-AGI-bracket net_capital_gains targets
   (the DB already has them from SOI ETL, they just weren't
   included), plus re-include dividend_income, qualified_dividend_income,
   and taxable_interest_income aggregates that were previously
   dropped for "high error or tension". 30% rel-error on a soft
   target is still vastly better than no constraint.

Refs #555, #866.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxGhenis
Copy link
Copy Markdown
Contributor Author

MaxGhenis commented May 4, 2026

Local rebuild / validation update after the latest fixes.

Pushed additional fixes in 0b82e67e:

  • Keep all-return SOI rows for investment-income controls, including low-AGI investment-income brackets, while avoiding duplicate taxable-only investment constraints.
  • Make the Forbes top-tail path usable when the IRS $100M+ aggregate-row target exceeds eligible Forbes units by scaling synthetic replicate weights to the aggregate-row population instead of falling back to donor synthesis.
  • Treat missing aggregate amount targets as zero during Forbes/PUF amount calibration.
  • Spread Forbes integer remainder weights across units; subagent re-review after that fix found no remaining findings.

I also rolled the HF policyengine/policyengine-us-data main artifacts back to the clean Apr 23 release (release_manifest version 1.86.1, source revision d702ccbc77cf9c5a83c7b46027d8f07ab5bf6410). New HF commit is df8acc14419272899d5a0949e457157aaf4b7edc; rollback workflow: https://github.com/PolicyEngine/policyengine-us-data/actions/runs/25290847940.

Local build path run:

uv run python -m policyengine_us_data.datasets.puf.puf
uv run python -m policyengine_us_data.datasets.cps.extended_cps
uv run python -m policyengine_us_data.calibration.create_stratified_cps
uv run python -m policyengine_us_data.calibration.create_source_imputed_cps
uv run python -m policyengine_us_data.datasets.cps.enhanced_cps

The final enhanced_cps run passed post-generation weight validation. Note: after that full H5 build I made the deterministic Forbes remainder-spreading micro-fix; it is covered by unit tests, but I did not spend another full rebuild on that last row-ordering adjustment.

Aggregate sanity check, 2026

Metric Broken before Local rebuilt H5 Reference Rebuilt ratio
net_capital_gains $20,700B $1,397B ~$1,700B 0.82x
long_term_capital_gains $9,920B $1,645B ~$1,700B 0.97x
qualified_dividend_income ~$3,400B $391B ~$400B 0.98x
taxable_interest_income not captured $351B ~$500B 0.70x
adjusted_gross_income not captured $16,742B ~$18,810B 0.89x
income_tax not captured $2,303B ~$2,200B 1.05x
household_net_income not captured $15,075B ~$22,000B 0.69x

UBI + top LTCG rate 20% to 25%, 2026

Metric Broken dataset Local rebuilt H5
UBI gross outlays +$681B +$681B
Cap-gains rate-hike revenue +$66B +$60B
Net federal cost +$614B +$621B
Gini 0.93 0.5790 -> 0.5564

Interpretation: the national capital-income aggregate bug is fixed. The old negative-AGI LTCG whale is gone: <0 AGI now has only $1.8B LTCG and $0.0B tax-hike contribution. The remaining +$60B cap-gains-rate result is consistent with a static microsimulation of a 5pp top LTCG rate increase. The lower $10-25B TPC/JCT-style range would require a behavioral capital-gains realization response layer; it is not something I would expect this data calibration PR to reproduce by itself. I did not add weight caps.

Local verification:

uv run ruff format --check .
uv run ruff check policyengine_us_data/datasets/puf/aggregate_record_utils.py policyengine_us_data/datasets/puf/forbes_backbone.py policyengine_us_data/utils/loss.py tests/unit/datasets/test_disaggregate_puf.py tests/unit/calibration/test_loss_targets.py
uv run pytest tests/unit/datasets/test_disaggregate_puf.py tests/unit/calibration/test_loss_targets.py tests/unit/test_etl_irs_soi_overlay.py tests/unit/calibration/test_target_config.py -q -p no:cacheprovider
# 72 passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

National enhanced_cps_2024 has 5-15x inflated capital-gains/dividend/interest aggregates (same as #555 at state level)

1 participant