Add GitHub Actions workflow to sync skills from product repos#8
Conversation
Implements the automated sync pipeline (Step 5 of onboarding) that sparse-checkouts the skills directory from each registered product repo and mirrors them into this catalog. Runs twice daily on a cron schedule and supports manual dispatch. Registered repos: cuOpt, TensorRT-LLM, nemotron-voice-agent, NeMo Gym. Signed-off-by: Sayali Kandarkar <skandarkar@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mosheabr
left a comment
There was a problem hiding this comment.
Good start on the sync workflow, Sayali — the sparse-checkout approach and idempotent commit logic are solid. A few things to address before this is ready to merge:
Critical
-
Cross-repo auth will fail for private repos —
actions/checkout@v4uses the defaultGITHUB_TOKEN, which only has access toNVIDIA/skills. If any of the product repos (NVIDIA/cuopt,NVIDIA/TensorRT-LLM, etc.) are private, the checkout steps will 403. You'll need a PAT or GitHub App token:with: token: ${{ secrets.SKILLS_SYNC_PAT }}
-
Data loss risk if a checkout fails — Each product block does
rm -rf skills/<product>before rsync. If the checkout step fails (repo moved, branch renamed, transient error), you've deleted the existing catalog copy with nothing to replace it. Fix: guard therm -rfso it only runs when.tmp/<product>/skills/actually exists and is non-empty, or move the delete into a conditional. -
Missing NeMo Evaluator — The catalog currently lists 5 products (cuOpt, TensorRT-LLM, Nemotron Voice Agent, NeMo Gym, NeMo Evaluator). The workflow only syncs 4 — NeMo Evaluator needs a block added.
Important
-
Direct push to
mainbypasses branch protection — Consider usingpeter-evans/create-pull-request@v6to open a PR instead of pushing directly, so changes can be reviewed before landing. -
No fault isolation — If one product checkout fails, the entire job fails and no other products get synced. Consider
continue-on-error: trueon each checkout step, or a matrix strategy per product. -
No concurrency control — If a manual dispatch overlaps with a cron run, two pushes could race. Add:
concurrency: group: sync-skills cancel-in-progress: true
Minor
-
rm -rf+rsync --deleteis redundant —rsync --deletealready handles file removals from source. Therm -rf+mkdir -pbefore it is unnecessary. -
Static commit message —
"chore: sync skills from product repos"doesn't indicate which products changed. Would be helpful to include a summary. -
No failure notification — If the cron sync silently fails, nobody knows. Consider adding a Slack or email notification step on failure.
Critical fixes: - Use SKILLS_SYNC_PAT secret for all product repo checkouts (default GITHUB_TOKEN will 403 on private repos) - Guard rm -rf behind existence + non-empty checks so a failed checkout preserves the existing catalog copy instead of deleting it - Add missing products from upstream README: Model-Optimizer, Megatron-Core, Megatron-Bridge, NeMo Evaluator (Launcher + Evaluator synced into separate catalog directories to avoid conflicts) Signed-off-by: Sayali Kandarkar <skandarkar@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rsync --delete already removes destination files not present in the source. The rm -rf + mkdir -p before each rsync was unnecessary — mkdir -p alone handles the first-ever run. Signed-off-by: Sayali Kandarkar <skandarkar@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add continue-on-error: true to each checkout step so a single repo failure (transient 503, repo renamed, branch deleted) does not block the remaining products from syncing. The existing non-empty guard on each copy step already handles the case where a checkout produced nothing. Signed-off-by: Sayali Kandarkar <skandarkar@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If a manual workflow_dispatch overlaps with a scheduled cron run, two jobs could race and produce conflicting pushes. The concurrency group ensures only one sync runs at a time, cancelling the in-progress run if a new one is triggered. Signed-off-by: Sayali Kandarkar <skandarkar@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the direct commit-and-push to main with peter-evans/create-pull-request@v6. Changes now land on a automated/sync-skills branch and open a PR for review, respecting branch protection rules. The action handles idempotency — if no files changed, no PR is created. The branch is auto-deleted after merge. Signed-off-by: Sayali Kandarkar <skandarkar@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each copy step now logs which products were synced. The PR title includes the product names (e.g. "chore: sync skills (cuOpt, TensorRT-LLM)") and the body lists them with the trigger source. Replaces the static "chore: sync skills from product repos" message. Signed-off-by: Sayali Kandarkar <skandarkar@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the workflow fails, a GitHub issue is automatically created with a link to the failed run, the trigger type, and a sync-failure label. This ensures silent cron failures get noticed instead of drifting undetected. Signed-off-by: Sayali Kandarkar <skandarkar@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sync log was initialized with echo "" which wrote a blank line, causing a leading comma in the product list. Use truncate -s 0 to create a truly empty file instead. Signed-off-by: Sayali Kandarkar <skandarkar@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mosheabr
left a comment
There was a problem hiding this comment.
Great revision, Sayali. All 9 items from the first review are addressed: PAT auth, data loss guards, fault isolation, concurrency, PR-based commits, dynamic commit messages, and failure notifications. This is solid.
One thing to add before merging: CUDA-Q was just merged into the catalog (#7). The sync workflow needs a block for it:
# -- CUDA-Q --
- name: Checkout CUDA-Q
continue-on-error: true
uses: actions/checkout@v4
with:
repository: NVIDIA/cuda-quantum
ref: main
path: .tmp/cuda-quantum
token: ${{ secrets.SKILLS_SYNC_PAT }}
sparse-checkout: |
.claude/skills/
- name: Copy CUDA-Q skills into catalog
run: |
if [ -d ".tmp/cuda-quantum/.claude/skills" ] && [ -n "$(ls -A .tmp/cuda-quantum/.claude/skills)" ]; then
mkdir -p skills/CUDA-Q
rsync -a --delete .tmp/cuda-quantum/.claude/skills/ skills/CUDA-Q/
echo "- CUDA-Q" >> /tmp/synced-products.txt
else
echo "⚠ CUDA-Q checkout empty or missing — skipping to preserve existing catalog"
fiOnce that's added, this is ready to go.
CUDA-Q was merged into the catalog (NVIDIA#7). Add checkout + copy block for NVIDIA/cuda-quantum → skills/CUDA-Q. Signed-off-by: Sayali Kandarkar <skandarkar@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
CUDA-Q block looks good. All 10 products covered, all review items addressed. This is ready to merge whenever you mark it ready for review. |
|
Thank you @mosheabr |
Implements the automated sync pipeline (Step 5 of onboarding) that sparse-checkouts the skills directory from each registered product repo and mirrors them into this catalog. Runs twice daily on a cron schedule and supports manual dispatch.
Registered repos: cuOpt, TensorRT-LLM, nemotron-voice-agent, NeMo Gym.