A unified, public-domain archive of US government spending, campaign finance, lobbying, and federal employment data — aggregated from public records into a single queryable corpus.
60 tables · ~696M rows · ~100 GB compressed Parquet · CC0
| Browse / download data | Hugging Face Datasets |
| Permanent mirror | Internet Archive (planned) |
| Reproducibility | scripts/export.py — the exporter used to build this snapshot |
| License | CC0 1.0 — public domain, no attribution required |
A single coherent view of where US public money goes. Most of this data is publicly available in pieces — scattered across USASpending.gov, FEC bulk downloads, 50 separate state checkbook portals, SBA FOIA releases, CMS Sunshine Act dumps, the IRS Business Master File, and dozens of other sources. The work in this archive is the aggregation, cleaning, and cross-linking into one corpus you can query with one tool.
| Category | Tables | Rows | Notes |
|---|---|---|---|
| Federal spending | 11 | ~107M | USASpending awards/contracts/grants + SBA + FEMA |
| State / local spending | 8 | ~280M | State checkbooks 2003–2026 ($26.9T tracked) |
| Political money | 8 | ~370M | FEC + multi-state campaign finance |
| Lobbying | 6 | ~8M | Senate LDA + California CAL-ACCESS |
| Salaries | 2 | ~39M | Federal (full) + state (GA + MN as of snapshot) |
| Healthcare | 6 | ~50M | CMS Open Payments, Medicare, NPI, childcare |
| Entity graph | 9 | ~30M | Cross-source organization registry + edges |
| Immigration / H-1B | 4 | ~2.6M | DOL LCAs, USCIS aggregates |
| Congress | 3 | ~2.4M | Cosponsors, votes, Federal Register |
| Misc | 3 | ~870K | SNAP retailers, NYC childcare |
Full per-table documentation: docs/tables/
Everything in this archive is Apache Parquet, the de facto columnar data format. The fastest way to use it is DuckDB — install it once, then query the dataset directly from Hugging Face without downloading anything:
-- DuckDB CLI: query 192K EIDL loans without a single file download
SELECT borrower_name, borrower_state, loan_amount, action_date
FROM 'hf://datasets/somaliscan/spending-archive/eidl_loans.parquet'
WHERE loan_amount > 1000000
ORDER BY loan_amount DESC
LIMIT 20;Or pull the whole thing locally with one command:
# ~100 GB; works once we've published the HF dataset
huggingface-cli download somaliscan/spending-archive \
--repo-type=dataset --local-dir=./somaliscan-dataMore examples in docs/cookbook.md.
SomaliScan started as a working transparency platform — a database backend, a search interface, an AI query layer, investigative work product. It ran for several years and accumulated a corpus that, in aggregate, is much more useful than the sum of its public sources.
Maintaining a live service is expensive. Open-sourcing the corpus is not. This archive is the terminal snapshot: every row of public-record spending data SomaliScan ever ingested, dumped into Parquet, published under CC0, and mirrored to permanent infrastructure. The live site is shutting down; the data is becoming free, forever.
This is a frozen archive, not an updating project. Last data refresh
was 2026-01 to 2026-04 depending on the table — see
docs/known-issues.md for per-table currency. If
someone wants to refresh historical FEC data or fill state checkbook gaps,
the scripts/export.py exporter is included for
reproducibility, and the upstream sources are all still public.
We didn't write the data — we organized it. The cracks in the originals are inherited:
- FEC historical gap.
fec_contributionsis currently 2024-focused (241M rows). FEC bulk data 2010–2022 was on the roadmap but never finished — that's ~500M more rows for someone to add. - State checkbook coverage varies wildly. Texas, California, NY, and 40
other states have 2010–2026 coverage. Florida is 2017–2025 (missing
2010–2016). New Mexico is 2025 only. See
docs/known-issues.md. - State salaries are partial. Only Georgia and Minnesota at snapshot time. The other 48 states publish this data but ingestion was never run.
- Entity linking is partial. The
organizationstable has 22M rows but only ~4% have EINs filled in — cross-table joins via UEI, EIN, or normalized name will miss rows. - Healthcare data is recent. CMS Open Payments 2023–2024 only; 2013–2022 historical is bulk-downloadable but wasn't ingested.
None of these are bugs — they're real coverage gaps in a multi-year ingestion project that stopped at the snapshot point. The archive is honest about what's here.
Some content that existed in SomaliScan's internal database is intentionally not republished:
- Investigation work product (
agents/,investigations/,bounty-*) - Fraud-detection heuristics and flag tables
- Compiled person-of-interest lists
- Materialized views (derivable from base tables)
- Internal staging tables and work queues
If you want the raw data, it's all here. If you want SomaliScan's investigative angle on it — that part is not being published, and you're free to develop your own.
@dataset{somaliscan_spending_2026,
title = {SomaliScan: US Government Spending Archive 2003--2026},
author = {SomaliScan Project},
year = {2026},
publisher = {Hugging Face Datasets},
version = {1.0.0},
url = {https://huggingface.co/datasets/somaliscan/spending-archive},
license = {CC0-1.0}
}See CITATION.cff for additional formats.
The underlying data is the work product of decades of federal and state disclosure law — every row in this archive exists because someone passed a statute requiring it to be public. Particular thanks to the data teams at:
- USASpending.gov (Treasury/OMB)
- FEC bulk data portal
- SBA FOIA and disclosure office
- IRS Tax Exempt Organizations Search
- CMS Open Payments
- All 50 state comptrollers and secretaries of state
- Open knowledge groups: OpenTheBooks, OpenSecrets methodology, ProPublica Nonprofit Explorer