Skip to content

Correct SIPP licensing language; only IRS-PUF is genuinely restricted #808

@MaxGhenis

Description

@MaxGhenis

Context

In the 2026-04-21 working meeting with Lars Vilhuber (AEA Data Editor), John Sabelhaus, and the TRACE team, John corrected a claim we had been making in slides and discussion: that SIPP requires individual user licensing.

John's actual words (paraphrased from the transcript): "The regular SIPP definitely not. Are you using the version of SIPP where you've got the imputed where they did the match and then fuzzed everything up? ... Since I was involved in creating that subsynthetic beta, it's a bit fuzzy as to why they actually require you to sort of agree to something because it is in principle public use data."

Translation: the SIPP vintage we actually consume (pu2023.csv / pu2023_slim.csv on HuggingFace — this is the Census Bureau public-use SIPP, which is NOT the restricted-access SIPP, and may or may not be the Sabelhaus subsynthetic beta depending on which file we're pulling) is effectively public use. The only genuinely restricted input in the PolicyEngine pipeline is IRS-PUF.

Why it matters

We're about to publish a writeup for Lars / the TRACE grant proposal describing the PolicyEngine use case. We need the licensing description to be accurate. Overstating restrictions:

  • Undersells how open the pipeline actually is
  • Muddies the "which inputs genuinely warrant the institutional-certification framing" story
  • Is just factually wrong

Actions

  1. Verify which SIPP product we're using. policyengine_us_data/datasets/sipp/sipp.py pulls pu2023.csv from the HF mirror. Confirm (a) that CSV was derived from Census Bureau public-use SIPP, and (b) whether any part of the pipeline has ever touched the Sabelhaus subsynthetic beta. sipp/README.md currently cites the Census public-use data dictionary, which is consistent with the public-use vintage — but worth spot-checking.

  2. Audit repo claims about SIPP licensing. A grep of policyengine-us-data finds no claim that SIPP requires licensing (✓). But it is worth adding a guardrail: a docstring or a line in sipp/README.md explicitly noting "SIPP is public-use; no per-user license required" so future contributors / LLM agents don't reintroduce the misclaim in docs or blog posts.

  3. Update the IRS-PUF language to make the contrast clearer. IRS-PUF is the single genuinely-restricted input — flagging it more prominently (and SIPP less) is a more faithful description of the pipeline.

Non-goals

  • Not proposing any change to the SIPP ingest pipeline itself. This is a documentation / external-communication accuracy issue.
  • Not proposing any calibration changes.

Related

  • Meeting on 2026-04-21; John Sabelhaus on SIPP licensing (full transcript with him and Lars)
  • policyengine_us_data/datasets/sipp/README.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions