Add CBO aggregate + per-AGI-bracket targets for cap gains, dividends, interest by MaxGhenis · Pull Request #868 · PolicyEngine/policyengine-us-data

MaxGhenis · 2026-05-03T18:15:49Z

Summary

Adds calibration targets to constrain capital-gains, dividend, and interest-income aggregates that were running 5-15x over CBO/SOI targets in the default enhanced_cps_2024 dataset.

Fixes the national-level half of #555 (which only flagged the issue at state level) and resolves #866.

Diagnosis

In the current default enhanced_cps_2024.h5, two records with raw $62M / $79M LTCG ended up with calibration weights of 73k / 51k — together contributing 87% of the $9.92T national long-term capital gains aggregate. CBO target for 2024: $1.29T. So we were 7.7× over for LTCG, 12× over for net_capital_gains, and Gini was reading 0.93 (vs. real ~0.45) on the resulting income distribution.

Same mechanism as #555: PR #537 removed the AGI ceiling on PUF imputation, but the calibration targets weren't tightened to constrain the new high-income tail. The bracket-level SOI targets in build_loss_matrix can be satisfied while a few records absorb extreme weight to fill the population/AGI targets — overshooting the un-constrained component aggregates as a side effect.

Changes

1. utils/loss.py build_loss_matrix (consumed by the legacy EnhancedCPS_2024.generate() path that builds enhanced_cps_2024.h5):

Add three CBO income_by_source aggregate targets so the optimizer has hard upper bounds on national totals:

CBO_INCOME_BY_SOURCE_TARGETS = [
    ("net_capital_gains", ["net_capital_gains"], "net_capital_gain"),
    ("qualified_dividend_income", ["qualified_dividend_income"], "qualified_dividend_income"),
    ("taxable_interest_and_ordinary_dividends",
     ["taxable_interest_income", "non_qualified_dividend_income"],
     "taxable_interest_and_ordinary_dividends"),
]

2. calibration/target_config.yaml (consumed by unified_calibration for national/US.h5):

Add per-AGI-bracket net_capital_gains targets (DB already had them from the SOI ETL; just weren't included).
Add tax_unit_count × cap-gains targets (per-bracket counts of returns with cap gains).
Re-include dividend_income, qualified_dividend_income, and taxable_interest_income aggregates that were previously dropped for "high error or tension". 30% rel-error on a soft target is much better than no constraint at all when the alternative is 5-15× inflation.

Test plan

Unit tests still pass (1 pre-existing unrelated failure in test_policyengine_utils.py)
Run a national rebuild: python -m policyengine_us_data.datasets.cps.enhanced_cps
Verify aggregate net_capital_gains lands within ~50% of CBO $1.29T target (currently ~16x over)
Verify aggregate qualified_dividend_income lands within ~50% of CBO $354B target (currently ~6x over)
Verify Gini on household_net_income drops from 0.93 to a more plausible level
Run unified calibration with updated target_config.yaml and check national_unified_diagnostics.csv for new per-bracket cap-gains rel errors

The rebuild will reveal whether soft targets are sufficient or whether we also need the per-record contribution cap proposed in #555. If aggregates still run 2-3× over after this change, we'll need to escalate to a hard-constraint solution in microcalibrate.

Refs

Closes National enhanced_cps_2024 has 5-15x inflated capital-gains/dividend/interest aggregates (same as #555 at state level) #866
Refs Uncapped PUF incomes + calibration weights produce ~19x inflated state-level aggregates #555 (national counterpart)
Refs CPS top-coding caps AGI at $6.26M — zero observations above $10M in any state #530 / Add PUF + source impute modules, fix AGI ceiling (issue #530) #537 (original AGI ceiling removal that triggered this)

🤖 Generated with Claude Code

… interest The calibration optimizer was leaving capital gains, dividends, and interest aggregates 5-15x inflated relative to CBO targets in the default enhanced_cps_2024 dataset. Two records with raw $62M / $79M LTCG ended up with calibration weights of 73k / 51k — together contributing 87% of the inflated $9.92T national LTCG aggregate (real CBO target: $1.29T for 2024). Root cause is the same as #555 at state level: when PR #537 removed the AGI ceiling on PUF imputation, calibration weights weren't re-tuned to constrain the new high-income tail. The bracket-level SOI targets in build_loss_matrix can be satisfied while a few records absorb extreme weight to fill the population/AGI targets, overshooting the un-constrained component aggregates as a side effect. Two changes: 1. utils/loss.py (build_loss_matrix, used by enhanced_cps_2024): Add three CBO income_by_source aggregate targets — net_capital_gains, qualified_dividend_income, and taxable_interest_and_ordinary_dividends — so the optimizer has hard upper bounds on these national totals. 2. calibration/target_config.yaml (used by unified_calibration / national/US.h5): add per-AGI-bracket net_capital_gains targets (the DB already has them from SOI ETL, they just weren't included), plus re-include dividend_income, qualified_dividend_income, and taxable_interest_income aggregates that were previously dropped for "high error or tension". 30% rel-error on a soft target is still vastly better than no constraint. Refs #555, #866. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MaxGhenis · 2026-05-04T00:53:01Z

Local rebuild / validation update after the latest fixes.

Pushed additional fixes in 0b82e67e:

Keep all-return SOI rows for investment-income controls, including low-AGI investment-income brackets, while avoiding duplicate taxable-only investment constraints.
Make the Forbes top-tail path usable when the IRS $100M+ aggregate-row target exceeds eligible Forbes units by scaling synthetic replicate weights to the aggregate-row population instead of falling back to donor synthesis.
Treat missing aggregate amount targets as zero during Forbes/PUF amount calibration.
Spread Forbes integer remainder weights across units; subagent re-review after that fix found no remaining findings.

I also rolled the HF policyengine/policyengine-us-data main artifacts back to the clean Apr 23 release (release_manifest version 1.86.1, source revision d702ccbc77cf9c5a83c7b46027d8f07ab5bf6410). New HF commit is df8acc14419272899d5a0949e457157aaf4b7edc; rollback workflow: https://github.com/PolicyEngine/policyengine-us-data/actions/runs/25290847940.

Local build path run:

uv run python -m policyengine_us_data.datasets.puf.puf
uv run python -m policyengine_us_data.datasets.cps.extended_cps
uv run python -m policyengine_us_data.calibration.create_stratified_cps
uv run python -m policyengine_us_data.calibration.create_source_imputed_cps
uv run python -m policyengine_us_data.datasets.cps.enhanced_cps

The final enhanced_cps run passed post-generation weight validation. Note: after that full H5 build I made the deterministic Forbes remainder-spreading micro-fix; it is covered by unit tests, but I did not spend another full rebuild on that last row-ordering adjustment.

Aggregate sanity check, 2026

Metric	Broken before	Local rebuilt H5	Reference	Rebuilt ratio
`net_capital_gains`	$20,700B	$1,397B	~$1,700B	0.82x
`long_term_capital_gains`	$9,920B	$1,645B	~$1,700B	0.97x
`qualified_dividend_income`	~$3,400B	$391B	~$400B	0.98x
`taxable_interest_income`	not captured	$351B	~$500B	0.70x
`adjusted_gross_income`	not captured	$16,742B	~$18,810B	0.89x
`income_tax`	not captured	$2,303B	~$2,200B	1.05x
`household_net_income`	not captured	$15,075B	~$22,000B	0.69x

UBI + top LTCG rate 20% to 25%, 2026

Metric	Broken dataset	Local rebuilt H5
UBI gross outlays	+$681B	+$681B
Cap-gains rate-hike revenue	+$66B	+$60B
Net federal cost	+$614B	+$621B
Gini	0.93	0.5790 -> 0.5564

Interpretation: the national capital-income aggregate bug is fixed. The old negative-AGI LTCG whale is gone: <0 AGI now has only $1.8B LTCG and $0.0B tax-hike contribution. The remaining +$60B cap-gains-rate result is consistent with a static microsimulation of a 5pp top LTCG rate increase. The lower $10-25B TPC/JCT-style range would require a behavioral capital-gains realization response layer; it is not something I would expect this data calibration PR to reproduce by itself. I did not add weight caps.

Local verification:

uv run ruff format --check .
uv run ruff check policyengine_us_data/datasets/puf/aggregate_record_utils.py policyengine_us_data/datasets/puf/forbes_backbone.py policyengine_us_data/utils/loss.py tests/unit/datasets/test_disaggregate_puf.py tests/unit/calibration/test_loss_targets.py
uv run pytest tests/unit/datasets/test_disaggregate_puf.py tests/unit/calibration/test_loss_targets.py tests/unit/test_etl_irs_soi_overlay.py tests/unit/calibration/test_target_config.py -q -p no:cacheprovider
# 72 passed

MaxGhenis and others added 3 commits May 3, 2026 14:15

Fix Forbes top-tail scaling and capital-income targets

890835d

Test capital-income AGI target coverage

499f575

MaxGhenis mentioned this pull request May 3, 2026

fix: national enhanced_cps_2024 has 5-15x inflated capital-gainsdividend #867

Closed

MaxGhenis added 2 commits May 3, 2026 20:50

Fix Forbes top-tail fallback and SOI investment targets

9cd571f

Format Forbes top-tail test

0b82e67

MaxGhenis mentioned this pull request May 4, 2026

National enhanced_cps_2024 has 5-15x inflated capital-gains/dividend/interest aggregates (same as #555 at state level) #866

Closed

MaxGhenis merged commit cfc1719 into main May 4, 2026
10 checks passed

MaxGhenis deleted the add-cap-gains-agi-targets branch May 4, 2026 03:04

This was referenced May 4, 2026

Add IRS SOI long-term capital gains target #869

Merged

Add tax-exempt interest calibration targets #880

Closed

Add tax-exempt interest and charitable deduction targets #881

Merged

MaxGhenis mentioned this pull request May 21, 2026

enhanced_cps_2024 overshoots CBO income_tax target by ~1.86x across 2024-2026 — loss weighting drowns out aggregate targets #1107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CBO aggregate + per-AGI-bracket targets for cap gains, dividends, interest#868

Add CBO aggregate + per-AGI-bracket targets for cap gains, dividends, interest#868
MaxGhenis merged 5 commits into
mainfrom
add-cap-gains-agi-targets

MaxGhenis commented May 3, 2026

Uh oh!

MaxGhenis commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented May 3, 2026

Summary

Diagnosis

Changes

Test plan

Refs

Uh oh!

MaxGhenis commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Aggregate sanity check, 2026

UBI + top LTCG rate 20% to 25%, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxGhenis commented May 4, 2026 •

edited

Loading