What
A scheduled crawler that ingests new SKUs from canonical upstream sources, normalizes them into TechAPI's JSON schema, and opens a PR against the TechAPI repo. Curators review before merge — fully automated end-to-end is not the goal; reducing toil is.
Builds on the coverage detector (#1) — coverage tells us what is missing; ingest tries to propose the data.
Why
Even with #1 producing a missing-SKU list, manually authoring each JSON record from scratch is the bottleneck. Most fields (cores, clocks, TDP, socket, process node) are available verbatim on vendor product pages and Wikipedia infoboxes; a crawler that drafts an 80%-complete record cuts curation per SKU from minutes to seconds-of-review.
Architecture
app/ingest/ package with one fetcher per source family
- Shared
app/ingest/normalize.py — maps source-specific fields → TechAPI schema (units, slug shape, segment enum, manufacturer brand slug lookup)
app/ingest/pipeline.py — orchestrates: pull coverage gaps → for each gap, fetch detail page → normalize → write JSON to a worktree of TechAPI → open PR
- Authentication: PR open via a fine-grained PAT scoped to
GetTechAPI/TechAPI content+PR write, stored as TECHAPI_PR_TOKEN secret on TechEngine
- New workflow
.github/workflows/weekly-ingest.yml — Mondays after the coverage report runs
Output contract
Each opened PR:
- Modifies only
data/<category>/<...>/<slug>.json
- One PR per category per run (so reviews are bounded)
- PR body lists every added SKU with a one-line provenance string and links the source URL
- Marked as
draft if any field is missing or required by the schema and could not be confidently parsed; otherwise ready-for-review
Safety rails
- Never modify existing files in the same PR (additions only)
- Slug conflicts: skip + log
- Each record carries
verified: false and the source URL in source_urls; humans flip verified on review
- Rate-limit per host with a polite User-Agent
Out of scope
- Updates/corrections to existing records (separate workflow later)
- Sources without stable structured data (e.g. forum-only specs)
Acceptance
python -m app.ingest --category cpu --limit 5 drafts 5 candidate JSON files and prints the diff
- Weekly workflow opens at least one PR per run against TechAPI
- All proposed records pass
python -m app.validate
What
A scheduled crawler that ingests new SKUs from canonical upstream sources, normalizes them into TechAPI's JSON schema, and opens a PR against the TechAPI repo. Curators review before merge — fully automated end-to-end is not the goal; reducing toil is.
Builds on the coverage detector (#1) — coverage tells us what is missing; ingest tries to propose the data.
Why
Even with #1 producing a missing-SKU list, manually authoring each JSON record from scratch is the bottleneck. Most fields (cores, clocks, TDP, socket, process node) are available verbatim on vendor product pages and Wikipedia infoboxes; a crawler that drafts an 80%-complete record cuts curation per SKU from minutes to seconds-of-review.
Architecture
app/ingest/package with one fetcher per source familyapp/ingest/normalize.py— maps source-specific fields → TechAPI schema (units, slug shape, segment enum, manufacturer brand slug lookup)app/ingest/pipeline.py— orchestrates: pull coverage gaps → for each gap, fetch detail page → normalize → write JSON to a worktree of TechAPI → open PRGetTechAPI/TechAPIcontent+PR write, stored asTECHAPI_PR_TOKENsecret on TechEngine.github/workflows/weekly-ingest.yml— Mondays after the coverage report runsOutput contract
Each opened PR:
data/<category>/<...>/<slug>.jsondraftif any field is missing or required by the schema and could not be confidently parsed; otherwise ready-for-reviewSafety rails
verified: falseand the source URL insource_urls; humans flipverifiedon reviewOut of scope
Acceptance
python -m app.ingest --category cpu --limit 5drafts 5 candidate JSON files and prints the diffpython -m app.validate