What
Detect SKUs that exist in upstream catalogs but are missing from the curated TechAPI dataset, and surface them as an actionable list (issue body or PR comment) so curators can fill the gaps.
Why
The current curated dataset is by design a subset, but it's hard to know what's missing without manual auditing. As of the initial scaffold, the CPU coverage alone is ~925 records — credible but visibly sparse for older Intel/AMD lineups and mid-tier Xeon/EPYC. We need a continuous signal of "what does upstream have that we don't?" so curation effort is targeted.
Sources (per category, kebab-case slugs reconciled)
- CPU — Intel ARK (vendor product index), AMD product pages, Wikipedia
List_of_Intel_*_microprocessors, TechPowerUp CPU DB
- GPU — NVIDIA / AMD / Intel product pages, TechPowerUp GPU DB, Wikipedia
List_of_*_graphics_processing_units
- SoC / Smartphone / Brand — vendor product pages + Wikipedia infobox tables
Per source: produce a canonical set of slugs, then compute set(upstream) - set(curated).
Deliverables
app/coverage/ package
- One module per source (
intel_ark.py, wikipedia_cpu.py, …) that fetches & yields a normalized list of (category, slug, name, url) tuples
app/coverage/report.py — aggregates and writes a Markdown report (top-N missing per category, with source link for each)
- New workflow
.github/workflows/coverage-report.yml — weekly cron, runs the aggregator, opens or updates a single sticky issue on TechAPI titled "Coverage gaps (auto-generated)" with the latest report
- Tests covering at least the normalization layer (input HTML/JSON → slug set), with vendored fixtures
Out of scope
- Actually adding the missing records (that's #2)
- Quality scoring of existing records (deferred)
Acceptance
python -m app.coverage exits 0 and writes coverage-report.md
- Workflow runs weekly and updates the sticky issue on TechAPI
- At least 3 sources wired up (1 for CPU, 1 for GPU, 1 for smartphone/SoC) end-to-end
What
Detect SKUs that exist in upstream catalogs but are missing from the curated TechAPI dataset, and surface them as an actionable list (issue body or PR comment) so curators can fill the gaps.
Why
The current curated dataset is by design a subset, but it's hard to know what's missing without manual auditing. As of the initial scaffold, the CPU coverage alone is ~925 records — credible but visibly sparse for older Intel/AMD lineups and mid-tier Xeon/EPYC. We need a continuous signal of "what does upstream have that we don't?" so curation effort is targeted.
Sources (per category, kebab-case slugs reconciled)
List_of_Intel_*_microprocessors, TechPowerUp CPU DBList_of_*_graphics_processing_unitsPer source: produce a canonical set of slugs, then compute
set(upstream) - set(curated).Deliverables
app/coverage/packageintel_ark.py,wikipedia_cpu.py, …) that fetches & yields a normalized list of(category, slug, name, url)tuplesapp/coverage/report.py— aggregates and writes a Markdown report (top-N missing per category, with source link for each).github/workflows/coverage-report.yml— weekly cron, runs the aggregator, opens or updates a single sticky issue on TechAPI titled "Coverage gaps (auto-generated)" with the latest reportOut of scope
Acceptance
python -m app.coverageexits 0 and writescoverage-report.md