GitHub - Builder106/CapitolAlpha: CapitolAlpha — a Wesleyan QAC 420 project that pipelines 16,203 disclosed Congressional stock trades (2020–2024) and finds a +2.58% Jensen's alpha vs the S&P 500, statistically significant at p < 0.05.

Members of Congress disclosed 16,203 stock trades between 2020 and 2024. Their purchases beat the S&P 500 by an average of +2.58% over the following 90 days — statistically significant at p < 0.05. This repo is the end-to-end pipeline behind that finding.

Findings page: capitolalpha.vercel.app — the headline stat, the four charts, and the PDFs in one place. The Python pipeline that produced them lives in this repo.

What this is

A semester project for QAC 420 — Data for Good at Wesleyan University. The course's frame is using public data to answer questions of civic accountability; this project asks one of the oldest of them: do federal legislators have an investing edge they shouldn't?

The answer is a reproducible Python pipeline that:

Scrapes Senate Periodic Transaction Reports (PTRs) directly from efdsearch.senate.gov (Playwright + form automation).
Scrapes House PTR PDFs from disclosures-clerk.house.gov (Playwright + pdfplumber table extraction).
Normalizes both chambers into a unified legislative_trades.csv (18 columns, 16,203 rows for 2020–2024).
Pulls market data via yfinance and computes Jensen's alpha, rolling 30/90/180-day returns, and a one-sample t-test against the S&P 500 benchmark.

Full deliverables:

Abstract — 1-page academic abstract.
Final Reflection — 5-page essay on method, ethics, and future work.
Statistics writeup — the formal statistical workup behind the Jensen's-alpha number.

Key findings

Metric	Congressional purchases	S&P 500 (benchmark)	Effect
Mean 90-day buy-and-hold ROI	13.74%	~11.16%	+2.58 pp
Jensen's alpha (90-day)	—	—	+2.58% (p < 0.05)
Pre-crash sell concentration	Top 5% of sellers timed COVID exits before Feb 20, 2020	—	suggestive, not causal
Trades analyzed	220 benchmarkable purchases out of 16,203 raw rows	—	—
Window	2020-01-01 – 2024-12-31	—	covers COVID + recovery

The full statistical workup — t-statistic, confidence interval, Sharpe ratio, and the KDE of returns — is in notebooks/Financial_Analysis.ipynb.

Pipeline

flowchart LR
  senate[efdsearch.senate.gov<br/>Senate PTRs] --> sf[senate_fetcher.py<br/>Playwright + HTML parse]
  house[disclosures-clerk.house.gov<br/>House PTR PDFs] --> hf[house_fetcher.py<br/>Playwright + pdfplumber]
  sf --> merge[merge_to_csv.py<br/>normalize + dedupe]
  hf --> merge
  merge --> csv[(data/legislative_trades.csv<br/>16,203 rows × 18 cols)]
  csv --> eda[notebooks/EDA.ipynb<br/>descriptive stats]
  csv --> fin[notebooks/Financial_Analysis.ipynb]
  yf[yfinance<br/>SPY + ticker prices] --> fin
  fin --> alpha[Jensen's alpha<br/>1-sample t-test<br/>Sharpe + KDE]
  alpha --> deliverables[abstract.pdf<br/>Final_Reflection.pdf<br/>Presentation deck]

run_pipeline.py is the orchestration entry point; the per-step modules live in pipeline/. Unit tests for the fetchers and merge step are in tests/.

Reproducing the analysis

# 1. Environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
playwright install chromium

# 2. Fetch data (Option A: scrape official sites — slow, authoritative)
python -m pipeline.run_pipeline --use-official

# 2b. (Option B: pre-aggregated Senate JSON — fast, fallback)
python -m pipeline.run_pipeline

# 3. Run tests
pytest

# 4. Open the analysis notebooks
jupyter lab notebooks/Financial_Analysis.ipynb

Pipeline-level options (--fresh, --senate-only, --house-only) and source-fallback behavior are documented in pipeline/README_PIPELINE.md.

Caveats

This is an undergraduate course project, not a quantitative-research paper. The findings are best read as a starting point that justifies a deeper study, not a final claim:

Disclosure lag is large. PTRs can be filed up to 30–45 days after a trade, so a 90-day alpha measured from the transaction date is not a 90-day alpha available to a public follower in real time. The signal is retrospective.
Selection bias on the benchmarked subset. Only 220 trades out of 16,203 had a clean ticker, sufficient price history in yfinance, and a clean entry/exit window. Heavy-tail trades (private placements, fund-of-fund holdings, options) are excluded.
No multiple-testing correction. The 1-sample t-test against SPY at the 90-day horizon was the pre-registered question, but other horizons (30, 180 days) and per-legislator slices were also explored; results in those slices should be treated as exploratory.
Jensen's alpha assumes CAPM, which is a strong assumption on a 220-trade sample over a window that includes both the COVID crash and the post-2020 recovery.
Ticker resolution is noisy. PTR asset descriptions are often free-text ("Common Stock - Apple Inc.") rather than CUSIPs; the matcher resolves the obvious ones and drops the ambiguous ones.
Causality is unidentified. Outperformance is consistent with non-public information, but it is also consistent with sector concentration, age/wealth effects, or skilled advisors. The data here cannot distinguish them.

The Final Reflection PDF expands on all of these.

Tech stack

Python 3.11+ with pandas, scipy, matplotlib, seaborn
Playwright for browser automation on JavaScript-rendered disclosure sites
pdfplumber for House PTR PDF table extraction
yfinance for SPY and per-ticker price history
pytest for unit tests on the fetchers and merge step
Jupyter for the EDA and financial-analysis notebooks
Flourish for the presentation visualization

Acknowledgments

The Wesleyan QAC for the Data for Good course frame and the methodological feedback through the semester.
timothycarambat/senate-stock-watcher-data as the JSON fallback when official scraping fails.
The U.S. Senate Office of Public Records and the Office of the Clerk of the House for publishing the disclosure data that makes this analysis possible at all.

License

Code released under the MIT License. The underlying disclosure filings are public records in the public domain; the dataset assembled here (data/legislative_trades.csv) is released under the same terms.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
assets		assets
data		data
docs		docs
notebooks		notebooks
pipeline		pipeline
site		site
tests		tests
utils		utils
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What this is

Key findings

Pipeline

Reproducing the analysis

Caveats

Tech stack

Acknowledgments

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What this is

Key findings

Pipeline

Reproducing the analysis

Caveats

Tech stack

Acknowledgments

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages