feat: weekly TechAPI refresh pipeline + benchmark enrichment#4
Merged
Conversation
Add a variant-safe enrichment runner (app/ingest/enrich.py) that fills null benchmark columns on existing TechAPI CPU/GPU records without ever overwriting, writing only on exact heading matches. Backed by per-source scrapers (PassMark, technical.city, cgdirector, notebookcheck, SPEC CPU2006, topcpu.net, Blender, videocardbenchmark) registered in a SOURCES table. Extend the CPU/GPU models with legacy + cross-aggregator benchmark fields, add network-free unit tests for the source parsers, and wire a cpu-only enrich step into weekly-ingest.
Add .github/workflows/weekly-refresh.yml: a Monday cron (and manual dispatch) that live-scrapes every CPU/GPU benchmark source into a TechAPI checkout, gates the full dataset on app.validate plus a strict integrity_check, regenerates the static v1 dump and openapi.json into site/public, and opens a dated refresh/<date> PR via peter-evans/create-pull-request. The cross-repo PR step is guarded by secrets.TECHAPI_TOKEN; without it the job still collects, validates, dumps, and uploads artifacts. Add a --strict mode to integrity_check.py that exits non-zero on hard anomalies (duplicate slugs, slug/file mismatch, single>multi) while keeping statistical outliers advisory.
Pin the public TechAPI repo as a submodule tracking main, mirroring TechAPI's link back to TechEngine. Browsing/link only — the weekly-refresh workflow uses a separate token-authenticated checkout for writes.
Sort the import block and wrap an over-long assert in test_gpu_sources.py so 'ruff check app tests' passes in CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
개요
주 1회 자동으로 라이브 벤치마크 수집 → 전체 무결성 검증 → 정적 덤프 생성 → 공개 TechAPI 저장소에 날짜 브랜치 + 자동 PR을 거는 파이프라인.
포함 커밋
feat(ingest): 멀티소스 벤치마크 enrichment 러너 + 스크레이퍼 9종 (PassMark, technical.city, cgdirector, notebookcheck, SPEC CPU2006, topcpu.net, Blender, videocardbenchmark). null 컬럼만 채우고 절대 덮어쓰지 않음. CPU/GPU 모델 벤치 필드 확장 + 네트워크 없는 단위 테스트.feat(ci):.github/workflows/weekly-refresh.yml— 월요일 06:00 UTC cron + 수동 실행. 12개 소스 수집 →app.validate+integrity_check.py --strict게이트 →app.dump→peter-evans/create-pull-request로refresh/<날짜>PR.integrity_check.py에 하드 이상치(중복 슬러그·슬러그≠파일명·single>multi)만 차단하는--strict모드 추가.chore: TechAPI를 서브모듈(gitlink, main 추적)로 연결 — 브라우징/링크 전용.토큰 동작
크로스 레포 PR 단계는
secrets.TECHAPI_TOKEN으로 가드. 토큰 없으면 수집·검증·덤프 후 아티팩트 업로드까지만 동작하고 PR만 스킵.검증
integrity_check.py --strict현재 데이터로 통과 확인