Skip to content

CharlesScottBradley/somaliscan-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SomaliScan: US Government Spending Archive (2003–2026)

A unified, public-domain archive of US government spending, campaign finance, lobbying, and federal employment data — aggregated from public records into a single queryable corpus.

60 tables · ~696M rows · ~100 GB compressed Parquet · CC0

Browse / download data Hugging Face Datasets
Permanent mirror Internet Archive (planned)
Reproducibility scripts/export.py — the exporter used to build this snapshot
License CC0 1.0 — public domain, no attribution required

What's in here

A single coherent view of where US public money goes. Most of this data is publicly available in pieces — scattered across USASpending.gov, FEC bulk downloads, 50 separate state checkbook portals, SBA FOIA releases, CMS Sunshine Act dumps, the IRS Business Master File, and dozens of other sources. The work in this archive is the aggregation, cleaning, and cross-linking into one corpus you can query with one tool.

By category

Category Tables Rows Notes
Federal spending 11 ~107M USASpending awards/contracts/grants + SBA + FEMA
State / local spending 8 ~280M State checkbooks 2003–2026 ($26.9T tracked)
Political money 8 ~370M FEC + multi-state campaign finance
Lobbying 6 ~8M Senate LDA + California CAL-ACCESS
Salaries 2 ~39M Federal (full) + state (GA + MN as of snapshot)
Healthcare 6 ~50M CMS Open Payments, Medicare, NPI, childcare
Entity graph 9 ~30M Cross-source organization registry + edges
Immigration / H-1B 4 ~2.6M DOL LCAs, USCIS aggregates
Congress 3 ~2.4M Cosponsors, votes, Federal Register
Misc 3 ~870K SNAP retailers, NYC childcare

Full per-table documentation: docs/tables/


Quickstart

Everything in this archive is Apache Parquet, the de facto columnar data format. The fastest way to use it is DuckDB — install it once, then query the dataset directly from Hugging Face without downloading anything:

-- DuckDB CLI: query 192K EIDL loans without a single file download
SELECT borrower_name, borrower_state, loan_amount, action_date
FROM 'hf://datasets/somaliscan/spending-archive/eidl_loans.parquet'
WHERE loan_amount > 1000000
ORDER BY loan_amount DESC
LIMIT 20;

Or pull the whole thing locally with one command:

# ~100 GB; works once we've published the HF dataset
huggingface-cli download somaliscan/spending-archive \
  --repo-type=dataset --local-dir=./somaliscan-data

More examples in docs/cookbook.md.


Why this exists, why it ends here

SomaliScan started as a working transparency platform — a database backend, a search interface, an AI query layer, investigative work product. It ran for several years and accumulated a corpus that, in aggregate, is much more useful than the sum of its public sources.

Maintaining a live service is expensive. Open-sourcing the corpus is not. This archive is the terminal snapshot: every row of public-record spending data SomaliScan ever ingested, dumped into Parquet, published under CC0, and mirrored to permanent infrastructure. The live site is shutting down; the data is becoming free, forever.

This is a frozen archive, not an updating project. Last data refresh was 2026-01 to 2026-04 depending on the table — see docs/known-issues.md for per-table currency. If someone wants to refresh historical FEC data or fill state checkbook gaps, the scripts/export.py exporter is included for reproducibility, and the upstream sources are all still public.


Honest limitations

We didn't write the data — we organized it. The cracks in the originals are inherited:

  • FEC historical gap. fec_contributions is currently 2024-focused (241M rows). FEC bulk data 2010–2022 was on the roadmap but never finished — that's ~500M more rows for someone to add.
  • State checkbook coverage varies wildly. Texas, California, NY, and 40 other states have 2010–2026 coverage. Florida is 2017–2025 (missing 2010–2016). New Mexico is 2025 only. See docs/known-issues.md.
  • State salaries are partial. Only Georgia and Minnesota at snapshot time. The other 48 states publish this data but ingestion was never run.
  • Entity linking is partial. The organizations table has 22M rows but only ~4% have EINs filled in — cross-table joins via UEI, EIN, or normalized name will miss rows.
  • Healthcare data is recent. CMS Open Payments 2023–2024 only; 2013–2022 historical is bulk-downloadable but wasn't ingested.

None of these are bugs — they're real coverage gaps in a multi-year ingestion project that stopped at the snapshot point. The archive is honest about what's here.


What's not in this archive

Some content that existed in SomaliScan's internal database is intentionally not republished:

  • Investigation work product (agents/, investigations/, bounty-*)
  • Fraud-detection heuristics and flag tables
  • Compiled person-of-interest lists
  • Materialized views (derivable from base tables)
  • Internal staging tables and work queues

If you want the raw data, it's all here. If you want SomaliScan's investigative angle on it — that part is not being published, and you're free to develop your own.


Cite this dataset

@dataset{somaliscan_spending_2026,
  title  = {SomaliScan: US Government Spending Archive 2003--2026},
  author = {SomaliScan Project},
  year   = {2026},
  publisher = {Hugging Face Datasets},
  version = {1.0.0},
  url    = {https://huggingface.co/datasets/somaliscan/spending-archive},
  license = {CC0-1.0}
}

See CITATION.cff for additional formats.


Acknowledgments

The underlying data is the work product of decades of federal and state disclosure law — every row in this archive exists because someone passed a statute requiring it to be public. Particular thanks to the data teams at:

  • USASpending.gov (Treasury/OMB)
  • FEC bulk data portal
  • SBA FOIA and disclosure office
  • IRS Tax Exempt Organizations Search
  • CMS Open Payments
  • All 50 state comptrollers and secretaries of state
  • Open knowledge groups: OpenTheBooks, OpenSecrets methodology, ProPublica Nonprofit Explorer

About

Public-domain archive of US government spending, campaign finance, lobbying, and federal employment data — 60 tables, ~696M rows. Data on Hugging Face; schema docs and reproducibility scripts here.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages