Skip to content

Latest commit

 

History

History
340 lines (275 loc) · 10.1 KB

File metadata and controls

340 lines (275 loc) · 10.1 KB

Query cookbook

Ten working DuckDB queries with real output from the SomaliScan archive. Every example here has been run against this dataset; the numbers you see are the actual results.

Setup

The whole archive is Apache Parquet — use DuckDB:

# macOS
brew install duckdb

# Linux / Windows
# see https://duckdb.org/docs/installation/

Two ways to access the data:

-- (a) Query from Hugging Face without downloading anything
SELECT * FROM 'hf://datasets/somaliscan/spending-archive/eidl_loans.parquet' LIMIT 10;

-- (b) After local download: query the file path directly
SELECT * FROM '/path/to/somaliscan-data/eidl_loans.parquet' LIMIT 10;

The examples below use the local-path form. To run them against the Hugging Face mirror, replace each 'output/...' path with 'hf://datasets/somaliscan/spending-archive/...'.


1. Top PPP loans in California, 2020

SELECT borrower_name,
       current_approval_amount::INT AS approved,
       date_approved
FROM 'output/ppp_loans/_year=2020/*.parquet'
WHERE borrower_state = 'CA'
ORDER BY current_approval_amount DESC
LIMIT 5;
|           borrower_name           | approved | date_approved |
|-----------------------------------|---------:|---------------|
| RUBIO'S RESTAURANTS, INC.         | 10000000 | 2020-04-09    |
| PRODUCERS DAIRY FOODS INC.        | 10000000 | 2020-04-10    |
| DRILL TECH DRILLING & SHORING INC | 10000000 | 2020-04-03    |
| UNICAL AVIATION INC.              | 10000000 | 2020-04-03    |
| KING'S SEAFOOD COMPANY, LLC       | 10000000 | 2020-04-14    |

Notice the path pattern: _year=2020 is a Hive-style partition directory. DuckDB only reads that year's files — no need to scan all of ppp_loans.


2. Largest federal contractors (all-time)

SELECT recipient_name,
       ROUND(SUM(federal_action_obligation)::DOUBLE / 1e9, 1) AS billions
FROM 'output/federal_contracts_v2/**/*.parquet'
WHERE recipient_name IS NOT NULL
GROUP BY recipient_name
ORDER BY billions DESC
LIMIT 5;
|            recipient_name            | billions |
|--------------------------------------|---------:|
| LOCKHEED MARTIN CORPORATION          |    39.5  |
| THE BOEING COMPANY                   |    26.5  |
| OPTUM PUBLIC SECTOR SOLUTIONS, INC.  |    20.1  |
| RAYTHEON COMPANY                     |    17.7  |
| NORTHROP GRUMMAN SYSTEMS CORPORATION |    14.8  |

**/*.parquet is a recursive glob — reads every yearly partition under federal_contracts_v2/. This is how you aggregate across all years of a partitioned table.


3. PPP loans by state

SELECT borrower_state,
       ROUND(SUM(current_approval_amount)::DOUBLE / 1e9, 2) AS billions,
       COUNT(*) AS loans
FROM 'output/ppp_loans/**/*.parquet'
WHERE borrower_state IS NOT NULL
GROUP BY borrower_state
ORDER BY billions DESC
LIMIT 8;
| borrower_state | billions |  loans   |
|----------------|---------:|---------:|
| CA             |   103.16 | 1270587  |
| TX             |    62.75 |  938551  |
| NY             |    60.73 |  734797  |
| FL             |    50.06 |  983317  |
| IL             |    37.30 |  620021  |
| PA             |    30.47 |  342420  |
| OH             |    27.19 |  351984  |
| NJ             |    25.59 |  301929  |

11.5M PPP loans, $793B total, in roughly a second.


4. Foreign aid by country (constant dollars)

SELECT country_name,
       ROUND(SUM(constant_amount)::DOUBLE / 1e9, 2) AS billions_usd
FROM 'output/foreign_aid.parquet'
WHERE country_name IS NOT NULL
  AND constant_amount IS NOT NULL
GROUP BY country_name
ORDER BY billions_usd DESC
LIMIT 8;
|      country_name      | billions_usd |
|------------------------|-------------:|
| World                  |       831.73 |
| Israel                 |       389.86 |
| Afghanistan            |       312.38 |
| Egypt                  |       224.24 |
| Iraq                   |       192.46 |
| Vietnam (former South) |       150.24 |
| Pakistan               |       104.03 |
| Ukraine                |       101.00 |

World is the aggregated multilateral/global category from the State Department's accounting. Bilateral country aid follows.


5. Top pharma payments to physicians (2024)

SELECT recipient_first_name || ' ' || recipient_last_name AS doctor,
       recipient_state,
       ROUND(SUM(total_amount)::DOUBLE, 0) AS total_usd,
       COUNT(*) AS n_payments
FROM 'output/open_payments_general/_year=2024/*.parquet'
WHERE recipient_last_name IS NOT NULL
GROUP BY doctor, recipient_state
ORDER BY total_usd DESC
LIMIT 5;
|      doctor      | recipient_state | total_usd  | n_payments |
|------------------|-----------------|-----------:|-----------:|
| CHARLES GOODIS   | FL              | 91,082,706 |          1 |
| ROBERT MEDOFF    | HI              | 26,746,052 |          3 |
| NITIN GOYAL      | VA              | 25,000,342 |          7 |
| IVAN OSORIO      | KS              | 17,456,053 |          5 |
| STEPHEN BURKHART | TX              | 17,332,176 |          3 |

These are CMS Sunshine Act records of industry payments to physicians — typically large numbers represent royalty payments or device-license buyouts, not single doctor visits.


6. Lobbying spending growth, federal LDA

SELECT filing_year,
       COUNT(*) AS filings,
       ROUND(SUM(income)::DOUBLE   / 1e6, 0) AS income_millions,
       ROUND(SUM(expenses)::DOUBLE / 1e6, 0) AS expenses_millions
FROM 'output/lobbying_filings.parquet'
WHERE filing_year BETWEEN 2020 AND 2024
GROUP BY filing_year
ORDER BY filing_year;
| filing_year | filings | income_millions | expenses_millions |
|------------:|--------:|----------------:|------------------:|
|        2020 |   84081 |            1808 |              2760 |
|        2021 |   90755 |            2013 |              2838 |
|        2022 |   93475 |            2175 |              3234 |
|        2023 |   95226 |            2240 |              3196 |
|        2024 |   96603 |            2336 |              3265 |

Federal lobbying expenses cross $3B/year for the first time in 2022 and keep growing. Filing volume up ~15% over five years.


7. What lobbyists lobby about most

SELECT issue_code,
       COUNT(*) AS activities
FROM 'output/lobbying_activities.parquet'
GROUP BY issue_code
ORDER BY activities DESC
LIMIT 8;
| issue_code | activities |
|------------|-----------:|
| BUD        |     201288 |  -- Budget/Appropriations
| HCR        |     143979 |  -- Health Care Reform
| TAX        |     121123 |  -- Taxation
| DEF        |      90904 |  -- Defense
| TRA        |      77648 |  -- Transportation
| MMM        |      66599 |  -- Medicare/Medicaid
| ENG        |      65870 |  -- Energy
| TRD        |      62607 |  -- Trade/Tariffs

Issue codes follow the Senate LDA scheme — the upstream source has the full key.


8. Cross-table: companies that got PPP loans AND federal contracts

WITH ppp AS (
  SELECT borrower_name, SUM(current_approval_amount) AS ppp_total
  FROM 'output/ppp_loans/**/*.parquet'
  GROUP BY borrower_name
),
contracts AS (
  SELECT recipient_name, SUM(federal_action_obligation) AS contract_total
  FROM 'output/federal_contracts_v2/**/*.parquet'
  GROUP BY recipient_name
)
SELECT ppp.borrower_name AS company,
       ROUND(ppp_total::DOUBLE      / 1e6, 1) AS ppp_millions,
       ROUND(contract_total::DOUBLE / 1e9, 1) AS contract_billions
FROM ppp JOIN contracts
  ON UPPER(ppp.borrower_name) = UPPER(contracts.recipient_name)
WHERE ppp_total > 1000000
  AND contract_total > 1e9
ORDER BY contract_total DESC
LIMIT 5;
|           company            | ppp_millions | contract_billions |
|------------------------------|-------------:|------------------:|
| ATLANTIC DIVING SUPPLY, INC. |          5.0 |               4.9 |

A federal contractor with billions in DoD work who also drew PPP — the kind of cross-source pattern that's invisible from any single dataset. This is a basic UPPER(name) join; for production work, join via organizations.organization_id for cleaner entity matching.


9. Largest US nonprofits by assets

SELECT name,
       state,
       ROUND(asset_amount::DOUBLE / 1e9, 2) AS assets_billions
FROM 'output/nonprofits.parquet'
WHERE asset_amount > 0
ORDER BY asset_amount DESC
LIMIT 5;
|                   name                    | state | assets_billions |
|-------------------------------------------|-------|----------------:|
| LILLY ENDOWMENT INC                       | IN    |           79.91 |
| GATES FOUNDATION                          | WA    |           78.67 |
| GATES FOUNDATION TRUST                    | WA    |           75.53 |
| PRESIDENT AND FELLOWS OF HARVARD COLLEGE  | MA    |           74.37 |
| FIDELITY INVESTMENTS CHARITABLE GIFT FUND | MA    |           66.79 |

Asset values from IRS Form 990 filings via the Business Master File.


10. Schema inspection — discovering what's in a table

-- Quick column list + types
DESCRIBE SELECT * FROM 'output/eidl_loans.parquet' LIMIT 0;

-- Row count + min/max date range
SELECT COUNT(*) AS rows,
       MIN(action_date) AS earliest,
       MAX(action_date) AS latest
FROM 'output/eidl_loans.parquet';
| rows   | earliest   | latest     |
|-------:|------------|------------|
| 192385 | 2020-04-03 | 2021-09-15 |

For any partitioned table, look at the directory structure to see available years:

ls output/ppp_loans/         # _year=2020/  _year=2021/
ls output/state_checkbook/   # _year=2003/ ... _year=2026/

More patterns

  • Reading multiple tables at once: FROM 'output/{ppp_loans,eidl_loans}/**/*.parquet' (DuckDB's brace expansion)
  • Streaming a huge result to a new Parquet: COPY (SELECT ...) TO 'my_subset.parquet' (FORMAT 'parquet')
  • Use from Python/pandas: import duckdb; df = duckdb.sql("SELECT * FROM 'eidl_loans.parquet' LIMIT 1000").df()
  • Use from R: library(duckdb); con <- dbConnect(duckdb()); dbGetQuery(con, "SELECT * FROM 'eidl_loans.parquet' LIMIT 100")
  • Use from polars: import polars as pl; df = pl.read_parquet('eidl_loans.parquet')

For larger cross-table investigations, see the examples/ directory.