Skip to content

National enhanced_cps_2024 has 5-15x inflated capital-gains/dividend/interest aggregates (same as #555 at state level) #866

@MaxGhenis

Description

@MaxGhenis

Summary

The enhanced_cps_2024 dataset (national, the Microsimulation() default) has the same inflated-aggregates problem that #555 reported at the state level. Capital gains, dividends, and interest income aggregate to 5–15× their CBO/SOI targets, even though adjusted_gross_income and income_tax hit their targets correctly.

This breaks any analysis that touches the income distribution: top-share metrics, Gini, capital-gains revenue scoring, etc.

Aggregates (2026, default Microsimulation())

Variable Model 2026 CBO/real 2026 Ratio
net_capital_gains $20.75T ~$1.7T (CBO) 12.2×
long_term_capital_gains $13.30T ~$1.7T (CBO) 7.8×
short_term_capital_gains $7.45T ~$0.3T 24.8×
qualified_dividend_income $2.25T ~$0.4T 5.6×
taxable_interest_income $3.12T ~$0.5T 6.2×
partnership_s_corp_income $1.38T ~$0.6T 2.3×
household_net_income $82.21T ~$22T 3.7×
household_market_income $85.14T ~$22T 3.9×
adjusted_gross_income $16.69T $18.81T (CBO) 0.9× ✓
income_tax $2.48T ~$2.2T 1.1× ✓

Concentration

The bulk of the inflation comes from very few records. Top 30 records by weighted LTCG contribution to the $9.92T 2024 aggregate:

rank    idx     weight  raw_ltcg($M)  wtd_ltcg($B)   cum%
   1  52148   73,763.5         62.2     4,586.55   46.3
   2  52077   50,965.1         79.0     4,024.01   86.8
   3  99526   27,023.2          4.9       131.40   88.2
   4  60528   10,563.7          7.0        73.57   88.9
   ...

Two records account for 87% of the inflated aggregate. Their raw LTCG values ($62M, $79M) are realistic for an individual top-tail tax return — the problem is they got assigned calibration weights of 73,000 and 51,000, meaning each record represents tens of thousands of households at that income level. That's roughly 2–3 orders of magnitude more than the actual count of $50M+ LTCG households in the US.

This matches the diagnosis in #555 ("calibration weights were not re-tuned" after PR #537 removed the AGI ceiling), but #555 closed scope to state-level files. The same problem exists in the national enhanced_cps_2024.h5 served as the default dataset.

Repro

from policyengine_us import Microsimulation
sim = Microsimulation()
print(f"net_capital_gains 2026: ${sim.calc('net_capital_gains', period=2026).sum() / 1e12:.2f}T")
# Output: net_capital_gains 2026: $20.75T   (CBO target: ~$1.7T)

Knock-on effects

  • Top-share / Gini metrics are broken. Person-weighted household_net_income Gini is 0.93 (real US ~0.45–0.50). The 99.99th weighted percentile of household_net_income is $579M.
  • Cap-gains revenue scoring is overstated by ~3–10× for any reform that hits the top LTCG bracket. PolicyEngine API impact estimates that touch this part of the distribution will report inflated revenue effects.
  • Distributional analyses that use deciles based on per-capita household income will show extreme top-decile means ($6.5M for D10) and underweight the lower deciles.

Calibration target is in build_loss_matrix but isn't binding

utils/loss.py does add capital_gains_gross per AGI bracket × filing status (and an "All" aggregate row), but the L0 optimizer either doesn't converge to the cap-gains target or trades it off against sparsity. Either:

  • The L0 regularization is too strong and the optimizer prefers concentrated weights (a few records with very high weight) over distributed weights;
  • Or a competing target (e.g., AGI total in a high-AGI bracket) is forcing weight onto these specific records.

The result is the same as #555: a couple of high-income records absorb extreme weight to satisfy other constraints, blowing up the income-component aggregates.

Suggested fixes

  1. Add a hard per-record contribution cap to the L0 optimizer in microcalibrate: max(weight × value) per (record, calibration variable) bounded by some fraction of the national target.
  2. Or shrink the AGI ceiling back for PUF-imputed records (effectively, cap raw LTCG/dividend/interest values at, e.g., $50M). This was the pre-Add PUF + source impute modules, fix AGI ceiling (issue #530) #537 behavior.
  3. Or add explicit national aggregate targets as separate (not summable) loss-matrix rows for capital_gains_gross, qualified_dividends, ordinary_dividends, taxable_interest_income, partnership_and_s_corp_income and tighten their relative-error weight.

#555 suggests fix (1). Whichever is chosen, this needs to ship before the dataset is used for any income-distribution analysis.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions