You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The enhanced_cps_2024 dataset (national, the Microsimulation() default) has the same inflated-aggregates problem that #555 reported at the state level. Capital gains, dividends, and interest income aggregate to 5–15× their CBO/SOI targets, even though adjusted_gross_income and income_tax hit their targets correctly.
This breaks any analysis that touches the income distribution: top-share metrics, Gini, capital-gains revenue scoring, etc.
Aggregates (2026, default Microsimulation())
Variable
Model 2026
CBO/real 2026
Ratio
net_capital_gains
$20.75T
~$1.7T (CBO)
12.2×
long_term_capital_gains
$13.30T
~$1.7T (CBO)
7.8×
short_term_capital_gains
$7.45T
~$0.3T
24.8×
qualified_dividend_income
$2.25T
~$0.4T
5.6×
taxable_interest_income
$3.12T
~$0.5T
6.2×
partnership_s_corp_income
$1.38T
~$0.6T
2.3×
household_net_income
$82.21T
~$22T
3.7×
household_market_income
$85.14T
~$22T
3.9×
adjusted_gross_income
$16.69T
$18.81T (CBO)
0.9× ✓
income_tax
$2.48T
~$2.2T
1.1× ✓
Concentration
The bulk of the inflation comes from very few records. Top 30 records by weighted LTCG contribution to the $9.92T 2024 aggregate:
Two records account for 87% of the inflated aggregate. Their raw LTCG values ($62M, $79M) are realistic for an individual top-tail tax return — the problem is they got assigned calibration weights of 73,000 and 51,000, meaning each record represents tens of thousands of households at that income level. That's roughly 2–3 orders of magnitude more than the actual count of $50M+ LTCG households in the US.
This matches the diagnosis in #555 ("calibration weights were not re-tuned" after PR #537 removed the AGI ceiling), but #555 closed scope to state-level files. The same problem exists in the nationalenhanced_cps_2024.h5 served as the default dataset.
Top-share / Gini metrics are broken. Person-weighted household_net_income Gini is 0.93 (real US ~0.45–0.50). The 99.99th weighted percentile of household_net_income is $579M.
Cap-gains revenue scoring is overstated by ~3–10× for any reform that hits the top LTCG bracket. PolicyEngine API impact estimates that touch this part of the distribution will report inflated revenue effects.
Distributional analyses that use deciles based on per-capita household income will show extreme top-decile means ($6.5M for D10) and underweight the lower deciles.
Calibration target is in build_loss_matrix but isn't binding
utils/loss.py does add capital_gains_gross per AGI bracket × filing status (and an "All" aggregate row), but the L0 optimizer either doesn't converge to the cap-gains target or trades it off against sparsity. Either:
The L0 regularization is too strong and the optimizer prefers concentrated weights (a few records with very high weight) over distributed weights;
Or a competing target (e.g., AGI total in a high-AGI bracket) is forcing weight onto these specific records.
The result is the same as #555: a couple of high-income records absorb extreme weight to satisfy other constraints, blowing up the income-component aggregates.
Suggested fixes
Add a hard per-record contribution cap to the L0 optimizer in microcalibrate: max(weight × value) per (record, calibration variable) bounded by some fraction of the national target.
Or add explicit national aggregate targets as separate (not summable) loss-matrix rows for capital_gains_gross, qualified_dividends, ordinary_dividends, taxable_interest_income, partnership_and_s_corp_income and tighten their relative-error weight.
#555 suggests fix (1). Whichever is chosen, this needs to ship before the dataset is used for any income-distribution analysis.
Summary
The
enhanced_cps_2024dataset (national, theMicrosimulation()default) has the same inflated-aggregates problem that #555 reported at the state level. Capital gains, dividends, and interest income aggregate to 5–15× their CBO/SOI targets, even thoughadjusted_gross_incomeandincome_taxhit their targets correctly.This breaks any analysis that touches the income distribution: top-share metrics, Gini, capital-gains revenue scoring, etc.
Aggregates (2026, default
Microsimulation())net_capital_gainslong_term_capital_gainsshort_term_capital_gainsqualified_dividend_incometaxable_interest_incomepartnership_s_corp_incomehousehold_net_incomehousehold_market_incomeadjusted_gross_incomeincome_taxConcentration
The bulk of the inflation comes from very few records. Top 30 records by weighted LTCG contribution to the $9.92T 2024 aggregate:
Two records account for 87% of the inflated aggregate. Their raw LTCG values ($62M, $79M) are realistic for an individual top-tail tax return — the problem is they got assigned calibration weights of 73,000 and 51,000, meaning each record represents tens of thousands of households at that income level. That's roughly 2–3 orders of magnitude more than the actual count of $50M+ LTCG households in the US.
This matches the diagnosis in #555 ("calibration weights were not re-tuned" after PR #537 removed the AGI ceiling), but #555 closed scope to state-level files. The same problem exists in the national
enhanced_cps_2024.h5served as the default dataset.Repro
Knock-on effects
Calibration target is in
build_loss_matrixbut isn't bindingutils/loss.pydoes addcapital_gains_grossper AGI bracket × filing status (and an "All" aggregate row), but the L0 optimizer either doesn't converge to the cap-gains target or trades it off against sparsity. Either:The result is the same as #555: a couple of high-income records absorb extreme weight to satisfy other constraints, blowing up the income-component aggregates.
Suggested fixes
microcalibrate:max(weight × value)per (record, calibration variable) bounded by some fraction of the national target.capital_gains_gross,qualified_dividends,ordinary_dividends,taxable_interest_income,partnership_and_s_corp_incomeand tighten their relative-error weight.#555 suggests fix (1). Whichever is chosen, this needs to ship before the dataset is used for any income-distribution analysis.
Related