Skip to content

ci: Add merge queue retry if CI_TIMEOUT#1111

Merged
chtruong814 merged 7 commits intomainfrom
chtruong/queue-retry
Sep 18, 2025
Merged

ci: Add merge queue retry if CI_TIMEOUT#1111
chtruong814 merged 7 commits intomainfrom
chtruong/queue-retry

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Sep 10, 2025

What does this PR do ?

Add merge queue retry if CI_TIMEOUT

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Adds an active Merge Queue Auto-Retry that automatically requeues pull requests removed for CI timeout, with up to 3 auto-retry attempts.
    • Posts clear PR comments for each auto-retry, when maximum retries are reached, and if the workflow fails.
  • Chores

    • Added repository workflow to manage automated requeue attempts and notifications.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 10, 2025

Walkthrough

Adds a new GitHub Actions workflow .github/workflows/merge-queue-retry.yml that triggers on pull_request dequeued, obtains a GitHub App token, inspects dequeue reason and prior retry comments, conditionally posts retry/max-retries comments, and requeues the PR via a GraphQL enqueuePullRequest mutation.

Changes

Cohort / File(s) Summary of changes
Merge Queue Auto-Retry Workflow
.github/workflows/merge-queue-retry.yml
New workflow file added. Trigger: pull_request dequeued. Job requeue-pr obtains a GitHub App token (actions/create-github-app-token@v1), checks dequeue reason (CI_TIMEOUT) and counts prior "Auto-retry attempt" comments to compute retry policy (MAX_RETRIES=3), emits should_retry/retry_count, posts an auto-retry comment and calls GraphQL enqueuePullRequest(input: {pullRequestId: PR_NODE_ID}) using the app token when retrying, posts a max-retries comment when not retrying, and posts a failure comment on workflow errors. Dynamic PR identifiers are derived from the event payload.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Dequeue as "Event: pull_request (dequeued)"
    participant WF as "Workflow: Merge Queue Auto-Retry"
    participant App as "GitHub App (create-github-app-token)"
    participant REST as "GitHub REST API (comments)"
    participant GraphQL as "GitHub GraphQL API"

    Dequeue->>WF: workflow triggered (pull_request dequeued)
    WF->>WF: extract PR_NUMBER, PR_NODE_ID, reason
    alt reason == "CI_TIMEOUT"
        WF->>App: request app token (vars.BOT_ID, secrets.BOT_KEY)
        App-->>WF: installation token
        WF->>REST: list PR comments -> count "Auto-retry attempt" => RETRY_COUNT
        WF->>WF: compare RETRY_COUNT < MAX_RETRIES (3)
        alt should_retry == true
            WF->>REST: post comment "🔄 Auto-retry attempt N..."
            WF->>GraphQL: enqueuePullRequest(pullRequestId: PR_NODE_ID) (with token)
            GraphQL-->>WF: data.enqueuePullRequest (success) / error
        else should_retry == false
            WF->>REST: post comment "⚠️ Maximum auto-retry attempts reached..."
        end
    else reason != "CI_TIMEOUT"
        WF->>REST: no retry (no-op)
    end
    par on workflow failure
        WF->>REST: post comment "❌ Auto-retry failed due to an error..."
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I twitch my whiskers, check the queue,
A little hop to try anew—
I count my tries, then nudge once more,
If three's the limit, I won't soar—
I nibble logs and hum for sure. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "ci: Add merge queue retry if CI_TIMEOUT" is concise, uses the conventional "ci:" scope, and directly describes the primary change of adding automatic merge-queue retry behavior for CI timeout events. It is specific and clear enough for a reviewer scanning history to understand the main intent of the PR.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chtruong/queue-retry

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the CI Relating to CI label Sep 10, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
.github/workflows/merge-queue-retry.yml (3)

22-25: Add concurrency to prevent duplicate retries racing.

 jobs:
   requeue-pr:
     runs-on: ubuntu-latest
+    concurrency:
+      group: ${{ github.workflow }}-${{ github.event.pull_request.node_id }}
+      cancel-in-progress: true

68-79: Include the dequeue reason in the retry comment for traceability.

-          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-            -f body="🔄 Auto-retry attempt ${RETRY_COUNT}: PR was removed from merge queue, automatically requeuing..."
+          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+            -f body="🔄 Auto-retry attempt ${RETRY_COUNT}: PR was removed from merge queue (reason: ${{ github.event.reason }}). Automatically requeuing…"

100-109: Also include the reason in the max-retries comment.

-          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-            -f body="⚠️ Maximum auto-retry attempts reached. PR was removed from merge queue multiple times. Please investigate the issue and manually requeue if needed."
+          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+            -f body="⚠️ Maximum auto-retry attempts reached (last reason: ${{ github.event.reason }}). Please investigate and requeue manually if needed."
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b060d1d and facdc66.

📒 Files selected for processing (1)
  • .github/workflows/merge-queue-retry.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Lint check
🔇 Additional comments (3)
.github/workflows/merge-queue-retry.yml (3)

1-14: License header LGTM.


26-32: Verify GitHub App scopes match needs.

The App must be installed on the repo with at least: Issues: write (comments), Pull requests: write, and permission to enqueue via GraphQL (Merge Queue capability). Please confirm the App has these. (docs.github.com)

Do you want me to add a README snippet listing the exact App permissions to grant?


110-119: Failure notification step LGTM.

terrykong
terrykong previously approved these changes Sep 10, 2025
Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing @chtruong814 . this has plagued us for a while. i'll approve, but feel free to assign another reviewer if you'd like another set of eyes

@terrykong terrykong assigned chtruong814 and unassigned terrykong Sep 10, 2025
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
.github/workflows/merge-queue-retry.yml (3)

17-21: Use pull_request_target and set minimal permissions (safer for forks and needed for secrets).

Switch to pull_request_target so the workflow can read secrets and write comments, and add a minimal permissions block. Do not check out or execute PR code under pull_request_target.

-on:
-  pull_request:
-    types:
-      - dequeued
+on:
+  pull_request_target:
+    types:
+      - dequeued
+
+permissions:
+  contents: read
+  pull-requests: write
+  issues: write
For GitHub Actions, does the pull_request event have a 'dequeued' action with a 'reason' field, and is pull_request_target appropriate for accessing secrets on forked PRs?

33-63: Harden reason check, enable strict bash, reduce noisy logs, and quote GITHUB_OUTPUT.

Current step is brittle on reason string, dumps all comments, and doesn’t fail fast.

       - name: Check dequeue reason and retry count
         id: check_retry
-        if: github.event.reason == 'CI_TIMEOUT'
+        if: contains(fromJSON('["CI_TIMEOUT","CHECKS_TIMEOUT","timed_out","TIMEOUT","timeout"]'), github.event.reason)
         env:
           GH_TOKEN: ${{ steps.generate_token.outputs.token }}
         run: |
-          PR_NUMBER=${{ github.event.pull_request.number }}
-
-          # Debug: Show all comments first
-          echo "=== All PR Comments ==="
-          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-            --jq '.[] | {id: .id, created_at: .created_at, body: .body[:100]}'
+          set -euo pipefail
+          PR_NUMBER=${{ github.event.pull_request.number }}
+          echo "Dequeued reason: '${{ github.event.reason }}'"
 
           echo "=== Filtering for retry comments ==="
 
           # Get the current number of retry attempts from PR comments
           RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
             --jq '[.[] | select(.body | contains("Auto-retry attempt")) | .body] | length')
 
           echo "Current retry count: $RETRY_COUNT"
 
           MAX_RETRIES=3
 
           if [ "$RETRY_COUNT" -lt "$MAX_RETRIES" ]; then
-            echo "should_retry=true" >> $GITHUB_OUTPUT
-            echo "retry_count=$((RETRY_COUNT + 1))" >> $GITHUB_OUTPUT
+            echo "should_retry=true" >> "$GITHUB_OUTPUT"
+            echo "retry_count=$((RETRY_COUNT + 1))" >> "$GITHUB_OUTPUT"
             echo "✅ Will retry (attempt $((RETRY_COUNT + 1))/$MAX_RETRIES)"
           else
-            echo "should_retry=false" >> $GITHUB_OUTPUT
+            echo "should_retry=false" >> "$GITHUB_OUTPUT"
             echo "❌ Max retries ($MAX_RETRIES) reached for PR #${PR_NUMBER}"
           fi

76-96: Use enqueuePullRequest with expectedHeadOid, detect failures, and exit non-zero.

Current curl call lacks expectedHeadOid, ignores GraphQL errors, and never fails the step.

       - name: Requeue Pull Request
         if: steps.check_retry.outputs.should_retry == 'true'
         env:
           GH_TOKEN: ${{ steps.generate_token.outputs.token }}
         run: |
-          PR_NUMBER=${{ github.event.pull_request.number }}
-          PR_NODE_ID="${{ github.event.pull_request.node_id }}"
-
-          echo "Requeuing PR #${PR_NUMBER}..."
-
-          # First, try using GraphQL API to enqueue the PR directly
-          GRAPHQL_RESPONSE=$(curl -s -X POST \
-            -H "Authorization: Bearer ${{ steps.generate_token.outputs.token }}" \
-            -H "Content-Type: application/json" \
-            -d "{\"query\": \"mutation { enqueuePullRequest(input: {pullRequestId: \\\"${PR_NODE_ID}\\\"}) { clientMutationId } }\"}" \
-            https://api.github.com/graphql)
-
-          if echo "$GRAPHQL_RESPONSE" | jq -e '.data.enqueuePullRequest' > /dev/null; then
-            echo "PR #${PR_NUMBER} has been successfully requeued"
-          fi
+          set -euo pipefail
+          PR_NUMBER=${{ github.event.pull_request.number }}
+          PR_NODE_ID="${{ github.event.pull_request.node_id }}"
+          HEAD_SHA="${{ github.event.pull_request.head.sha }}"
+          echo "Requeuing PR #${PR_NUMBER}..."
+          RESP=$(gh api graphql -f query='
+            mutation($id:ID!, $oid:GitObjectID) {
+              enqueuePullRequest(input:{pullRequestId:$id, expectedHeadOid:$oid}) {
+                mergeQueueEntry { id }
+              }
+            }' -f id="$PR_NODE_ID" -f oid="$HEAD_SHA")
+          if echo "$RESP" | jq -e '.data.enqueuePullRequest.mergeQueueEntry.id' >/dev/null; then
+            echo "✅ PR #${PR_NUMBER} successfully requeued"
+          else
+            echo "GraphQL response: $RESP"
+            gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+              -f body="❌ Auto-retry attempted but requeue GraphQL call failed. Please requeue manually."
+            exit 1
+          fi
🧹 Nitpick comments (4)
.github/workflows/merge-queue-retry.yml (4)

23-25: Add job-level concurrency to avoid duplicate retries when multiple dequeues fire.

Prevents races posting multiple comments and enqueuing twice.

   requeue-pr:
     runs-on: ubuntu-latest
+    concurrency:
+      group: requeue-pr-${{ github.event.pull_request.number }}
+      cancel-in-progress: false

65-76: Minor: ensure shell interpolation occurs as intended and consider consistent wording.

The interpolation is fine, but consider a consistent, grep-able prefix.

-          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-            -f body="🔄 Auto-retry attempt ${RETRY_COUNT}: PR was removed from merge queue, automatically requeuing..."
+          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+            -f body="🔄 Auto-retry attempt ${RETRY_COUNT}: PR dequeued (reason: ${{ github.event.reason }}). Automatically requeuing..."

97-116: Tighten notifications; include reason for context.

Make messages more actionable for on-call.

-          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-            -f body="⚠️ Maximum auto-retry attempts reached. PR was removed from merge queue multiple times. Please investigate the issue and manually requeue if needed."
+          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+            -f body="⚠️ Maximum auto-retry attempts reached (reason: ${{ github.event.reason }}). Please investigate and manually requeue if needed."
-          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-            -f body="❌ Auto-retry failed due to an error in the workflow. Please manually requeue the PR."
+          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+            -f body="❌ Auto-retry workflow error. Requeue did not complete. Please manually requeue the PR."

28-31: Security hardening: pin actions to commit SHAs.

Prevents supply-chain surprises from mutable tags.

-        uses: actions/create-github-app-token@v1
+        uses: actions/create-github-app-token@a38b9d0c6e530c4b1f1a7f7e23a2d0bb76b0a52e # v1
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between facdc66 and f610bf4.

📒 Files selected for processing (1)
  • .github/workflows/merge-queue-retry.yml (1 hunks)

Copy link
Contributor

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why we're running into this issue in the first place? In the settings, we've configured a timeout of 6hrs, and if I'm reading this correctly an average merge item takes 3.5hrs to complete? So with 2 items in the queue, it's expected that the 2nd item will timeout?

I'm new to merge-queues so I'm pretty sure there's a mistake in my assumption.

So the two questions I have are:

  1. Why 360min and not something larger?
  2. Is it true that a merge-item takes 3.5hrs to complete? That seems really heavy for GitHub CI. If this is the case, I would recommend to offload some testing to main and deal with it via reverts

@chtruong814
Copy link
Contributor Author

@ko3n1g Regarding the 360 limit, the upper limit is 360. If you attempt to make it longer, the UI prevents you. So, if many items are in the queue and the CI takes a while, then it will remove PRs that have exceeded that overall time.

It's a fair call out on the overall Github CI time and a good suggestion for us to keep in mind. It's a known issue but not something we'll be addressing in this PR. It's hard to say what's optimal at the moment given the current CI infra limitations.

ko3n1g
ko3n1g previously approved these changes Sep 16, 2025
Copy link
Contributor

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for educating me on this. This upper limit is very unfortunate. Would probably be good to have this workflow in the FW-templates toolbox eventually.

@chtruong814 chtruong814 added this pull request to the merge queue Sep 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 17, 2025
@nemo-automation-bot
Copy link

🔄 Auto-retry attempt 2: PR was removed from merge queue, automatically requeuing...

@nemo-automation-bot
Copy link

🔄 Auto-retry attempt 3: PR was removed from merge queue, automatically requeuing...

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@github-actions
Copy link

ℹ️ File Consistency Check

Check based on commit: 795591a (PR #1111 from chtruong/queue-retry)

This is a test comment


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@github-actions
Copy link

ℹ️ File Consistency Check

Check based on commit: 7a66f30 (PR #1111 from chtruong/queue-retry)

This is a test comment


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

terrykong
terrykong previously approved these changes Sep 17, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
.github/workflows/merge-queue-retry.yml (3)

34-65: Uncomment and harden the retry gate; match timeout reasons robustly; fail fast.

Re-enable this with strict bash, tolerant reason matching, and clear outputs. This prevents infinite retries and noisy logs.

Apply:

-      # - name: Check dequeue reason and retry count
-      #   id: check_retry
-      #   if: github.event.reason == 'CI_TIMEOUT'
-      #   env:
-      #     GH_TOKEN: ${{ steps.generate_token.outputs.token }}
-      #   run: |
-      #     PR_NUMBER=${{ github.event.pull_request.number }}
-      #     # Debug: Show all comments first
-      #     echo "=== All PR Comments ==="
-      #     gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-      #       --jq '.[] | {id: .id, created_at: .created_at, body: .body[:100]}'
-      #     echo "=== Filtering for retry comments ==="
-      #     # Get the current number of retry attempts from PR comments
-      #     RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-      #       --jq '[.[] | select(.body | contains("Auto-retry attempt")) | .body] | length')
-      #     echo "Current retry count: $RETRY_COUNT"
-      #     MAX_RETRIES=3
-      #     if [ "$RETRY_COUNT" -lt "$MAX_RETRIES" ]; then
-      #       echo "should_retry=true" >> $GITHUB_OUTPUT
-      #       echo "retry_count=$((RETRY_COUNT + 1))" >> $GITHUB_OUTPUT
-      #       echo "✅ Will retry (attempt $((RETRY_COUNT + 1))/$MAX_RETRIES)"
-      #     else
-      #       echo "should_retry=false" >> $GITHUB_OUTPUT
-      #       echo "❌ Max retries ($MAX_RETRIES) reached for PR #${PR_NUMBER}"
-      #     fi
+      - name: Check dequeue reason and retry count
+        id: check_retry
+        if: contains(fromJSON('["CI_TIMEOUT","CHECKS_TIMEOUT","TIMEOUT","timed_out","timeout"]'), github.event.reason)
+        env:
+          GH_TOKEN: ${{ steps.generate_token.outputs.token }}
+        run: |
+          set -euo pipefail
+          PR_NUMBER='${{ github.event.pull_request.number }}'
+          MAX_RETRIES="${MAX_RETRIES:-3}"
+          echo "Dequeued reason: '${{ github.event.reason }}'"
+          RETRY_COUNT=$(
+            gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+              --jq '[ .[] | select(.body | contains("Auto-retry attempt")) ] | length'
+          )
+          echo "Current retry count: $RETRY_COUNT"
+          if [ "$RETRY_COUNT" -lt "$MAX_RETRIES" ]; then
+            {
+              echo "should_retry=true"
+              echo "retry_count=$((RETRY_COUNT + 1))"
+            } >> "$GITHUB_OUTPUT"
+            echo "✅ Will retry (attempt $((RETRY_COUNT + 1))/$MAX_RETRIES)"
+          else
+            echo "should_retry=false" >> "$GITHUB_OUTPUT"
+            echo "❌ Max retries ($MAX_RETRIES) reached for PR #${PR_NUMBER}"
+          fi

17-22: Trigger mismatch: switch to pull_request_target: dequeued (and add workflow_dispatch).

Current push trigger can’t access github.event.pull_request.* and won’t fire on merge-queue dequeues. Use pull_request_target with types: [dequeued] to receive event.reason and PR context; add workflow_dispatch for manual tests.

Apply:

 name: "Merge Queue Auto-Retry"

-on:
-  push:
-  # pull_request:
-  #   types:
-  #     - dequeued
+on:
+  pull_request_target:
+    types: [dequeued]
+  workflow_dispatch:
+
+# Minimal base token perms (App token is used for writes).
+permissions:
+  contents: read
+  pull-requests: write
+  issues: write

77-97: Hard-coded PR ids; missing expectedHeadOid; no error handling; step won’t fail on GraphQL errors.

This will always requeue PR 1111, ignores the dequeued PR, and may silently “succeed.” Pull values from the event, include expectedHeadOid, enable bash safety, and fail on error. Gate on retry decision.

Apply:

-      - name: Requeue Pull Request
-        # if: steps.check_retry.outputs.should_retry == 'true'
+      - name: Requeue Pull Request
+        if: steps.check_retry.outputs.should_retry == 'true'
         env:
           GH_TOKEN: ${{ steps.generate_token.outputs.token }}
         run: |
-          PR_NUMBER="1111"
-          PR_NODE_ID="PR_kwDOOJjv8s6nvYsV"
-
-          echo "Requeuing PR #${PR_NUMBER}..."
-
-          # First, try using GraphQL API to enqueue the PR directly
-          GRAPHQL_RESPONSE=$(curl -s -X POST \
-            -H "Authorization: Bearer ${{ steps.generate_token.outputs.token }}" \
-            -H "Content-Type: application/json" \
-            -d "{\"query\": \"mutation { enqueuePullRequest(input: {pullRequestId: \\\"${PR_NODE_ID}\\\"}) { clientMutationId } }\"}" \
-            https://api.github.com/graphql)
-
-          echo "GRAPHQL_RESPONSE: $GRAPHQL_RESPONSE"
-          if echo "$GRAPHQL_RESPONSE" | jq -e '.data.enqueuePullRequest' > /dev/null; then
-            echo "PR #${PR_NUMBER} has been successfully requeued"
-          fi
+          set -euo pipefail
+          PR_NUMBER="${{ github.event.pull_request.number }}"
+          PR_NODE_ID="${{ github.event.pull_request.node_id }}"
+          HEAD_SHA="${{ github.event.pull_request.head.sha }}"
+          echo "Requeuing PR #${PR_NUMBER}..."
+          RESP=$(gh api graphql -f query='
+            mutation($id:ID!, $oid:GitObjectID){
+              enqueuePullRequest(input:{pullRequestId:$id, expectedHeadOid:$oid}) {
+                mergeQueueEntry { id }
+              }
+            }' -f id="$PR_NODE_ID" -f oid="$HEAD_SHA")
+          if echo "$RESP" | jq -e '.data.enqueuePullRequest.mergeQueueEntry.id' >/dev/null; then
+            echo "✅ PR #${PR_NUMBER} successfully requeued"
+          else
+            echo "GraphQL response: $RESP"
+            exit 1
+          fi
🧹 Nitpick comments (5)
.github/workflows/merge-queue-retry.yml (5)

66-76: Post a retry marker comment to track attempts.

Re-enable this so the count logic has a durable marker.

Apply:

-      # - name: Add retry comment
-      #   if: steps.check_retry.outputs.should_retry == 'true'
-      #   env:
-      #     GH_TOKEN: ${{ steps.generate_token.outputs.token }}
-      #   run: |
-      #     PR_NUMBER=${{ github.event.pull_request.number }}
-      #     RETRY_COUNT=${{ steps.check_retry.outputs.retry_count }}
-      #     gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-      #       -f body="🔄 Auto-retry attempt ${RETRY_COUNT}: PR was removed from merge queue, automatically requeuing..."
+      - name: Add retry comment
+        if: steps.check_retry.outputs.should_retry == 'true'
+        env:
+          GH_TOKEN: ${{ steps.generate_token.outputs.token }}
+        run: |
+          PR_NUMBER='${{ github.event.pull_request.number }}'
+          RETRY_COUNT='${{ steps.check_retry.outputs.retry_count }}'
+          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+            -f body="🔄 Auto-retry attempt ${RETRY_COUNT}: PR was removed from merge queue due to '${{ github.event.reason }}'. Requeuing…"

99-108: Surface max-retries reached to the PR.

Notify the author when auto-retries stop.

Apply:

-      # - name: Max retries reached comment
-      #   if: steps.check_retry.outputs.should_retry == 'false'
-      #   env:
-      #     GH_TOKEN: ${{ steps.generate_token.outputs.token }}
-      #   run: |
-      #     PR_NUMBER=${{ github.event.pull_request.number }}
-      #     gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-      #       -f body="⚠️ Maximum auto-retry attempts reached. PR was removed from merge queue multiple times. Please investigate the issue and manually requeue if needed."
+      - name: Max retries reached comment
+        if: steps.check_retry.outputs.should_retry == 'false'
+        env:
+          GH_TOKEN: ${{ steps.generate_token.outputs.token }}
+        run: |
+          PR_NUMBER='${{ github.event.pull_request.number }}'
+          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+            -f body="⚠️ Maximum auto-retry attempts reached. Please investigate the flake/timeouts and requeue manually if needed."

23-26: Add a concurrency group to avoid duplicate requeues.

Prevents two dequeues from racing and enqueuing twice.

Apply:

 jobs:
   requeue-pr:
     runs-on: ubuntu-latest
+    concurrency:
+      group: merge-queue-retry-${{ github.event.pull_request.number || github.run_id }}
+      cancel-in-progress: true

27-33: Validate App token creation inputs.

If vars.BOT_ID/secrets.BOT_KEY are missing, fail early with a clearer message.

Apply:

       - name: Generate GitHub App token
         id: generate_token
         uses: actions/create-github-app-token@v1
         with:
           app-id: ${{ vars.BOT_ID }}
           private-key: ${{ secrets.BOT_KEY }}
+      - name: Verify App token
+        run: |
+          test -n "${{ steps.generate_token.outputs.token }}" || { echo "Missing App token"; exit 1; }

7-7: Use HTTPS for the license URL.

Minor polish.

Apply:

-#     http://www.apache.org/licenses/LICENSE-2.0
+#     https://www.apache.org/licenses/LICENSE-2.0
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f610bf4 and 7a66f30.

📒 Files selected for processing (1)
  • .github/workflows/merge-queue-retry.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Post submodule check comment / Comment on PR

@nemo-automation-bot nemo-automation-bot bot added this pull request to the merge queue Sep 17, 2025
@chtruong814 chtruong814 removed this pull request from the merge queue due to a manual request Sep 17, 2025
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@github-actions
Copy link

ℹ️ File Consistency Check

Check based on commit: c3ec29a (PR #1111 from chtruong/queue-retry)

This is a test comment


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@github-actions
Copy link

ℹ️ File Consistency Check

Check based on commit: c6a6fb6 (PR #1111 from chtruong/queue-retry)

This is a test comment


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
.github/workflows/merge-queue-retry.yml (5)

33-36: Reason check is brittle; tolerate variants of timeout.

-        if: github.event.reason == 'CI_TIMEOUT'
+        if: contains(fromJSON('["CI_TIMEOUT","CHECKS_TIMEOUT","timed_out","TIMEOUT","timeout"]'), github.event.reason)

38-45: Harden shell, cut noisy logs, and quote outputs.

-        run: |
-          PR_NUMBER=${{ github.event.pull_request.number }}
-
-          # Debug: Show all comments first
-          echo "=== All PR Comments ==="
-          gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-            --jq '.[] | {id: .id, created_at: .created_at, body: .body[:100]}'
+        run: |
+          set -euo pipefail
+          PR_NUMBER='${{ github.event.pull_request.number }}'
+          echo "Dequeued reason: '${{ github.event.reason }}'"
@@
-          RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+          RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
             --jq '[.[] | select(.body | contains("Auto-retry attempt")) | .body] | length')
@@
-            echo "should_retry=true" >> $GITHUB_OUTPUT
-            echo "retry_count=$((RETRY_COUNT + 1))" >> $GITHUB_OUTPUT
+            echo "should_retry=true" >> "$GITHUB_OUTPUT"
+            echo "retry_count=$((RETRY_COUNT + 1))" >> "$GITHUB_OUTPUT"
             echo "✅ Will retry (attempt $((RETRY_COUNT + 1))/$MAX_RETRIES)"
           else
-            echo "should_retry=false" >> $GITHUB_OUTPUT
+            echo "should_retry=false" >> "$GITHUB_OUTPUT"
             echo "❌ Max retries ($MAX_RETRIES) reached for PR #${PR_NUMBER}"
           fi

Also applies to: 48-63


111-119: Failure notifier: shell safety and token fallback.

-        env:
-          GH_TOKEN: ${{ steps.generate_token.outputs.token }}
-        run: |
-          PR_NUMBER=${{ github.event.pull_request.number }}
+        env:
+          GH_TOKEN: ${{ steps.generate_token.outputs.token || github.token }}
+        run: |
+          set -euo pipefail
+          PR_NUMBER='${{ github.event.pull_request.number }}'

17-21: Blocker: use pull_request_target and declare minimal permissions (fork safety + secrets).

pull_request won’t expose secrets (BOT_KEY) on forks; this workflow will fail there. Switch to pull_request_target and add minimal perms.

 on:
-  pull_request:
+  pull_request_target:
     types:
       - dequeued
+
+# Minimal base token perms; App token is used for writes.
+permissions:
+  contents: read
+  pull-requests: write
+  issues: write

76-99: Use gh graphql with expectedHeadOid; fail and notify on GraphQL errors.

-        run: |
-          PR_NUMBER=${{ github.event.pull_request.number }}
-          PR_NODE_ID="${{ github.event.pull_request.node_id }}"
-
-          echo "Requeuing PR #${PR_NUMBER}..."
-
-          # First, try using GraphQL API to enqueue the PR directly
-          GRAPHQL_RESPONSE=$(curl -s -X POST \
-            -H "Authorization: Bearer ${{ steps.generate_token.outputs.token }}" \
-            -H "Content-Type: application/json" \
-            -d "{\"query\": \"mutation { enqueuePullRequest(input: {pullRequestId: \\\"${PR_NODE_ID}\\\"}) { clientMutationId } }\"}" \
-            https://api.github.com/graphql)
-
-          if echo "$GRAPHQL_RESPONSE" | jq -e '.data.enqueuePullRequest' > /dev/null; then
-            echo "PR #${PR_NUMBER} has been successfully requeued"
-          else
-            echo "❌ Failed to enqueue PR #${PR_NUMBER}. GraphQL response for debugging:"
-            echo "$GRAPHQL_RESPONSE"
-            exit 1
-          fi
+        run: |
+          set -euo pipefail
+          PR_NUMBER='${{ github.event.pull_request.number }}'
+          PR_NODE_ID='${{ github.event.pull_request.node_id }}'
+          HEAD_SHA='${{ github.event.pull_request.head.sha }}'
+
+          echo "Requeuing PR #${PR_NUMBER}..."
+          RESP=$(gh api graphql -f query='
+            mutation($id:ID!, $oid:GitObjectID) {
+              enqueuePullRequest(input:{pullRequestId:$id, expectedHeadOid:$oid}) {
+                mergeQueueEntry { id }
+              }
+            }' -f id="$PR_NODE_ID" -f oid="$HEAD_SHA")
+
+          if echo "$RESP" | jq -e '.data.enqueuePullRequest.mergeQueueEntry.id' >/dev/null; then
+            echo "✅ PR #${PR_NUMBER} successfully requeued"
+          else
+            echo "GraphQL response: $RESP"
+            gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+              -f body="❌ Auto-retry attempted but requeue GraphQL call failed. Please requeue manually."
+            exit 1
+          fi
🧹 Nitpick comments (4)
.github/workflows/merge-queue-retry.yml (4)

22-25: Optional: avoid duplicate runs per PR with concurrency.

 jobs:
   requeue-pr:
     runs-on: ubuntu-latest
+    concurrency:
+      group: auto-retry-${{ github.event.pull_request.number }}
+      cancel-in-progress: true

49-51: Count only bot-authored retry comments to avoid false positives.

-          RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
-            --jq '[.[] | select(.body | contains("Auto-retry attempt")) | .body] | length')
+          RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUMBER}/comments" \
+            --jq '[.[] | select((.body | contains("Auto-retry attempt")) and (.user.type=="Bot" or (.user.login|test("bot$")))) | .body] | length')

65-75: Quote interpolations and harden shell in comment step.

-        run: |
-          PR_NUMBER=${{ github.event.pull_request.number }}
-          RETRY_COUNT=${{ steps.check_retry.outputs.retry_count }}
+        run: |
+          set -euo pipefail
+          PR_NUMBER='${{ github.event.pull_request.number }}'
+          RETRY_COUNT='${{ steps.check_retry.outputs.retry_count }}'

101-110: Shell safety for “max retries reached” step.

-        run: |
-          PR_NUMBER=${{ github.event.pull_request.number }}
+        run: |
+          set -euo pipefail
+          PR_NUMBER='${{ github.event.pull_request.number }}'
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a66f30 and c6a6fb6.

📒 Files selected for processing (1)
  • .github/workflows/merge-queue-retry.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR

@chtruong814 chtruong814 merged commit ee8f5aa into main Sep 18, 2025
26 checks passed
@chtruong814 chtruong814 deleted the chtruong/queue-retry branch September 18, 2025 03:02
yfw pushed a commit that referenced this pull request Sep 23, 2025
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Oct 9, 2025
4 tasks
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Relating to CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants