Skip to content

Add README scraper + initial README corpus#12

Merged
AriESQ merged 3 commits into
masterfrom
scrape-readmes
May 5, 2026
Merged

Add README scraper + initial README corpus#12
AriESQ merged 3 commits into
masterfrom
scrape-readmes

Conversation

@AriESQ
Copy link
Copy Markdown
Owner

@AriESQ AriESQ commented May 5, 2026

Summary

  • Adds scripts/scrape_readmes.py: fetches READMEs for all starred repos via GitHub REST API, stores at readmes/<owner>+<repo>+<filename>, persists ETag/sha/size/status metadata in github_stars.json.
  • Adds tests/test_scrape_readmes.py: 13 unit tests (mocked HTTP, tmp_path sandbox). All passing.
  • Adds .github/workflows/scrape-readmes.yml: reusable workflow, called by sync-stars.yml as pipeline step 3.
  • Updates sync-stars.yml to chain scrape-readmes after update-lists, before deploy.
  • Adds .github/secret_scanning.yml to exclude readmes/** from push protection (READMEs are mirrored public content; example/demo secrets in them are not ours to rotate).
  • Includes the initial README corpus (6,640 files) and updated github_stars.json with ETags for all repos.

Why the corpus is in the PR

The initial scrape took a long time. GHA must not repeat it. Committing the corpus + ETags means every subsequent GHA run issues conditional GETs and gets 304 Not Modified for unchanged READMEs — no content re-downloaded, no cold start.

GHA metadata preservation

All three scrapers load-merge-save github_stars.json; the readme block survives scrape_stars.py and update_star_lists.py write cycles intact (verified by code review).

Smoke test plan (before merging)

  • Trigger scrape-readmes workflow manually on this branch
  • Confirm logs show not_modified for essentially all repos
  • Confirm the workflow's commit step exits cleanly with no changes to push

🤖 Generated with Claude Code

AriESQ and others added 3 commits May 4, 2026 21:33
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6,640 README files fetched locally; ETags stored in github_stars.json so
subsequent GHA runs get 304 Not Modified and never repeat the cold start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
READMEs are public content mirrored from public repos. Secret-looking
strings in them (example webhooks, demo OAuth credentials, scanner
test patterns) are not real secrets we own or can rotate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@AriESQ AriESQ merged commit bddd2fa into master May 5, 2026
1 of 2 checks passed
@AriESQ AriESQ deleted the scrape-readmes branch May 5, 2026 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant