Skip to content

social_security stored input contradicts its four components for ~2k records (decomposition gap) #183

Description

@MaxGhenis

Summary

social_security is shipped as a stored input in populace_us_2024.h5, but it disagrees with the sum of its four components (social_security_retirement / _disability / _survivors / _dependents) for 2,019 / 160,858 person records (1.255%). In policyengine-us social_security is an adds variable (defined as the sum of those four components), so shipping the aggregate as a stored input that contradicts the components is internally inconsistent.

Pattern

Almost all mismatches are records where the stored total is positive but all four components are $0 — the total was imputed/assigned but never decomposed. Worst case: stored $121,655, all four components $0.

Materiality (low)

Weighted, this is negligible:

  • Affected records carry 4,852 of 337,861,201 household weight ≈ 0.0014% of population (80% have weight < 0.01).
  • Weighted Social Security gap: ≈ $0.3B = 0.02% of total benefits.
  • Concentrated in near-zero-weight, high-income tail records (median age 59 — not all retirement-age).
  • A handful have moderate weight (up to ~865), e.g. an age-40 record with $64,235 of undecomposed SS.

So it does not move weighted aggregates. Filing as low-priority correctness/hygiene, not a numbers bug.

Why it's still worth fixing

Because social_security is an adds variable, any consumer that strips the stored aggregate and recomputes from components — standard policyengine-us behavior, and what calibration pipelines that drop "pseudo-input" aggregates do — silently zeroes out SS for these records, trusting the (incomplete) components over the (complete) stored total. The drop is invisible downstream.

Repro

import pandas as pd
f = "populace_us_2024.h5"  # f0af251 build
comps = [f"social_security_{x}" for x in ["dependents", "disability", "retirement", "survivors"]]
df = pd.read_hdf(f, "person", columns=["social_security"] + comps)
gap = df["social_security"] - df[comps].sum(axis=1)
print((gap.abs() > 1).sum(), "records mismatch; max", gap.abs().max())

Suggested fix

Either populate the four components so they sum to the stored total during the SS imputation/decomposition step, or stop shipping the stored social_security aggregate so the adds formula is the single source of truth (consistent by construction).

Build observed: populace-us-2024-f0af251-703bd81a565c-20260620 (latest at filing; the pattern is likely longstanding across builds).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions