Ten working DuckDB queries with real output from the SomaliScan archive. Every example here has been run against this dataset; the numbers you see are the actual results.
The whole archive is Apache Parquet — use DuckDB:
# macOS
brew install duckdb
# Linux / Windows
# see https://duckdb.org/docs/installation/Two ways to access the data:
-- (a) Query from Hugging Face without downloading anything
SELECT * FROM 'hf://datasets/somaliscan/spending-archive/eidl_loans.parquet' LIMIT 10;
-- (b) After local download: query the file path directly
SELECT * FROM '/path/to/somaliscan-data/eidl_loans.parquet' LIMIT 10;The examples below use the local-path form. To run them against the
Hugging Face mirror, replace each 'output/...' path with
'hf://datasets/somaliscan/spending-archive/...'.
SELECT borrower_name,
current_approval_amount::INT AS approved,
date_approved
FROM 'output/ppp_loans/_year=2020/*.parquet'
WHERE borrower_state = 'CA'
ORDER BY current_approval_amount DESC
LIMIT 5;| borrower_name | approved | date_approved |
|-----------------------------------|---------:|---------------|
| RUBIO'S RESTAURANTS, INC. | 10000000 | 2020-04-09 |
| PRODUCERS DAIRY FOODS INC. | 10000000 | 2020-04-10 |
| DRILL TECH DRILLING & SHORING INC | 10000000 | 2020-04-03 |
| UNICAL AVIATION INC. | 10000000 | 2020-04-03 |
| KING'S SEAFOOD COMPANY, LLC | 10000000 | 2020-04-14 |
Notice the path pattern: _year=2020 is a Hive-style partition
directory. DuckDB only reads that year's files — no need to scan all
of ppp_loans.
SELECT recipient_name,
ROUND(SUM(federal_action_obligation)::DOUBLE / 1e9, 1) AS billions
FROM 'output/federal_contracts_v2/**/*.parquet'
WHERE recipient_name IS NOT NULL
GROUP BY recipient_name
ORDER BY billions DESC
LIMIT 5;| recipient_name | billions |
|--------------------------------------|---------:|
| LOCKHEED MARTIN CORPORATION | 39.5 |
| THE BOEING COMPANY | 26.5 |
| OPTUM PUBLIC SECTOR SOLUTIONS, INC. | 20.1 |
| RAYTHEON COMPANY | 17.7 |
| NORTHROP GRUMMAN SYSTEMS CORPORATION | 14.8 |
**/*.parquet is a recursive glob — reads every yearly partition under
federal_contracts_v2/. This is how you aggregate across all years
of a partitioned table.
SELECT borrower_state,
ROUND(SUM(current_approval_amount)::DOUBLE / 1e9, 2) AS billions,
COUNT(*) AS loans
FROM 'output/ppp_loans/**/*.parquet'
WHERE borrower_state IS NOT NULL
GROUP BY borrower_state
ORDER BY billions DESC
LIMIT 8;| borrower_state | billions | loans |
|----------------|---------:|---------:|
| CA | 103.16 | 1270587 |
| TX | 62.75 | 938551 |
| NY | 60.73 | 734797 |
| FL | 50.06 | 983317 |
| IL | 37.30 | 620021 |
| PA | 30.47 | 342420 |
| OH | 27.19 | 351984 |
| NJ | 25.59 | 301929 |
11.5M PPP loans, $793B total, in roughly a second.
SELECT country_name,
ROUND(SUM(constant_amount)::DOUBLE / 1e9, 2) AS billions_usd
FROM 'output/foreign_aid.parquet'
WHERE country_name IS NOT NULL
AND constant_amount IS NOT NULL
GROUP BY country_name
ORDER BY billions_usd DESC
LIMIT 8;| country_name | billions_usd |
|------------------------|-------------:|
| World | 831.73 |
| Israel | 389.86 |
| Afghanistan | 312.38 |
| Egypt | 224.24 |
| Iraq | 192.46 |
| Vietnam (former South) | 150.24 |
| Pakistan | 104.03 |
| Ukraine | 101.00 |
World is the aggregated multilateral/global category from the State
Department's accounting. Bilateral country aid follows.
SELECT recipient_first_name || ' ' || recipient_last_name AS doctor,
recipient_state,
ROUND(SUM(total_amount)::DOUBLE, 0) AS total_usd,
COUNT(*) AS n_payments
FROM 'output/open_payments_general/_year=2024/*.parquet'
WHERE recipient_last_name IS NOT NULL
GROUP BY doctor, recipient_state
ORDER BY total_usd DESC
LIMIT 5;| doctor | recipient_state | total_usd | n_payments |
|------------------|-----------------|-----------:|-----------:|
| CHARLES GOODIS | FL | 91,082,706 | 1 |
| ROBERT MEDOFF | HI | 26,746,052 | 3 |
| NITIN GOYAL | VA | 25,000,342 | 7 |
| IVAN OSORIO | KS | 17,456,053 | 5 |
| STEPHEN BURKHART | TX | 17,332,176 | 3 |
These are CMS Sunshine Act records of industry payments to physicians — typically large numbers represent royalty payments or device-license buyouts, not single doctor visits.
SELECT filing_year,
COUNT(*) AS filings,
ROUND(SUM(income)::DOUBLE / 1e6, 0) AS income_millions,
ROUND(SUM(expenses)::DOUBLE / 1e6, 0) AS expenses_millions
FROM 'output/lobbying_filings.parquet'
WHERE filing_year BETWEEN 2020 AND 2024
GROUP BY filing_year
ORDER BY filing_year;| filing_year | filings | income_millions | expenses_millions |
|------------:|--------:|----------------:|------------------:|
| 2020 | 84081 | 1808 | 2760 |
| 2021 | 90755 | 2013 | 2838 |
| 2022 | 93475 | 2175 | 3234 |
| 2023 | 95226 | 2240 | 3196 |
| 2024 | 96603 | 2336 | 3265 |
Federal lobbying expenses cross $3B/year for the first time in 2022 and keep growing. Filing volume up ~15% over five years.
SELECT issue_code,
COUNT(*) AS activities
FROM 'output/lobbying_activities.parquet'
GROUP BY issue_code
ORDER BY activities DESC
LIMIT 8;| issue_code | activities |
|------------|-----------:|
| BUD | 201288 | -- Budget/Appropriations
| HCR | 143979 | -- Health Care Reform
| TAX | 121123 | -- Taxation
| DEF | 90904 | -- Defense
| TRA | 77648 | -- Transportation
| MMM | 66599 | -- Medicare/Medicaid
| ENG | 65870 | -- Energy
| TRD | 62607 | -- Trade/Tariffs
Issue codes follow the Senate LDA scheme — the upstream source has the full key.
WITH ppp AS (
SELECT borrower_name, SUM(current_approval_amount) AS ppp_total
FROM 'output/ppp_loans/**/*.parquet'
GROUP BY borrower_name
),
contracts AS (
SELECT recipient_name, SUM(federal_action_obligation) AS contract_total
FROM 'output/federal_contracts_v2/**/*.parquet'
GROUP BY recipient_name
)
SELECT ppp.borrower_name AS company,
ROUND(ppp_total::DOUBLE / 1e6, 1) AS ppp_millions,
ROUND(contract_total::DOUBLE / 1e9, 1) AS contract_billions
FROM ppp JOIN contracts
ON UPPER(ppp.borrower_name) = UPPER(contracts.recipient_name)
WHERE ppp_total > 1000000
AND contract_total > 1e9
ORDER BY contract_total DESC
LIMIT 5;| company | ppp_millions | contract_billions |
|------------------------------|-------------:|------------------:|
| ATLANTIC DIVING SUPPLY, INC. | 5.0 | 4.9 |
A federal contractor with billions in DoD work who also drew PPP — the
kind of cross-source pattern that's invisible from any single dataset.
This is a basic UPPER(name) join; for production work, join via
organizations.organization_id for cleaner entity matching.
SELECT name,
state,
ROUND(asset_amount::DOUBLE / 1e9, 2) AS assets_billions
FROM 'output/nonprofits.parquet'
WHERE asset_amount > 0
ORDER BY asset_amount DESC
LIMIT 5;| name | state | assets_billions |
|-------------------------------------------|-------|----------------:|
| LILLY ENDOWMENT INC | IN | 79.91 |
| GATES FOUNDATION | WA | 78.67 |
| GATES FOUNDATION TRUST | WA | 75.53 |
| PRESIDENT AND FELLOWS OF HARVARD COLLEGE | MA | 74.37 |
| FIDELITY INVESTMENTS CHARITABLE GIFT FUND | MA | 66.79 |
Asset values from IRS Form 990 filings via the Business Master File.
-- Quick column list + types
DESCRIBE SELECT * FROM 'output/eidl_loans.parquet' LIMIT 0;
-- Row count + min/max date range
SELECT COUNT(*) AS rows,
MIN(action_date) AS earliest,
MAX(action_date) AS latest
FROM 'output/eidl_loans.parquet';| rows | earliest | latest |
|-------:|------------|------------|
| 192385 | 2020-04-03 | 2021-09-15 |
For any partitioned table, look at the directory structure to see available years:
ls output/ppp_loans/ # _year=2020/ _year=2021/
ls output/state_checkbook/ # _year=2003/ ... _year=2026/- Reading multiple tables at once:
FROM 'output/{ppp_loans,eidl_loans}/**/*.parquet'(DuckDB's brace expansion) - Streaming a huge result to a new Parquet:
COPY (SELECT ...) TO 'my_subset.parquet' (FORMAT 'parquet') - Use from Python/pandas:
import duckdb; df = duckdb.sql("SELECT * FROM 'eidl_loans.parquet' LIMIT 1000").df() - Use from R:
library(duckdb); con <- dbConnect(duckdb()); dbGetQuery(con, "SELECT * FROM 'eidl_loans.parquet' LIMIT 100") - Use from polars:
import polars as pl; df = pl.read_parquet('eidl_loans.parquet')
For larger cross-table investigations, see the examples/
directory.