Skip to content

feat(ingest): weekly crawler that opens PRs to TechAPI with new SKUs #2

@Seungpyo1007

Description

@Seungpyo1007

What

A scheduled crawler that ingests new SKUs from canonical upstream sources, normalizes them into TechAPI's JSON schema, and opens a PR against the TechAPI repo. Curators review before merge — fully automated end-to-end is not the goal; reducing toil is.

Builds on the coverage detector (#1) — coverage tells us what is missing; ingest tries to propose the data.

Why

Even with #1 producing a missing-SKU list, manually authoring each JSON record from scratch is the bottleneck. Most fields (cores, clocks, TDP, socket, process node) are available verbatim on vendor product pages and Wikipedia infoboxes; a crawler that drafts an 80%-complete record cuts curation per SKU from minutes to seconds-of-review.

Architecture

  • app/ingest/ package with one fetcher per source family
  • Shared app/ingest/normalize.py — maps source-specific fields → TechAPI schema (units, slug shape, segment enum, manufacturer brand slug lookup)
  • app/ingest/pipeline.py — orchestrates: pull coverage gaps → for each gap, fetch detail page → normalize → write JSON to a worktree of TechAPI → open PR
  • Authentication: PR open via a fine-grained PAT scoped to GetTechAPI/TechAPI content+PR write, stored as TECHAPI_PR_TOKEN secret on TechEngine
  • New workflow .github/workflows/weekly-ingest.yml — Mondays after the coverage report runs

Output contract

Each opened PR:

  • Modifies only data/<category>/<...>/<slug>.json
  • One PR per category per run (so reviews are bounded)
  • PR body lists every added SKU with a one-line provenance string and links the source URL
  • Marked as draft if any field is missing or required by the schema and could not be confidently parsed; otherwise ready-for-review

Safety rails

  • Never modify existing files in the same PR (additions only)
  • Slug conflicts: skip + log
  • Each record carries verified: false and the source URL in source_urls; humans flip verified on review
  • Rate-limit per host with a polite User-Agent

Out of scope

  • Updates/corrections to existing records (separate workflow later)
  • Sources without stable structured data (e.g. forum-only specs)

Acceptance

  • python -m app.ingest --category cpu --limit 5 drafts 5 candidate JSON files and prints the diff
  • Weekly workflow opens at least one PR per run against TechAPI
  • All proposed records pass python -m app.validate

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions