Skip to content

[luv-307] fix: bump translate-docs retries 2→5 and concurrency 2→4 post-LiteLLM-scale#306

Merged
NiveditJain merged 2 commits into
mainfrom
luv-307
May 8, 2026
Merged

[luv-307] fix: bump translate-docs retries 2→5 and concurrency 2→4 post-LiteLLM-scale#306
NiveditJain merged 2 commits into
mainfrom
luv-307

Conversation

@NiveditJain
Copy link
Copy Markdown
Member

@NiveditJain NiveditJain commented May 8, 2026

Summary

  • scripts/translate-docs/translator.ts — pass maxRetries (default 5, up from the SDK default of 2) to new Anthropic({ maxRetries }). Override via TRANSLATE_MAX_RETRIES (integer ≥ 0; 0 disables retries). The SDK already retries APIConnectionError / APIConnectionTimeoutError / 408/409/429/5xx with exponential backoff + jitter; we just need it to retry more so the LB has more chances to route a retry to a healthy LiteLLM replica.
  • scripts/translate-docs/cli.ts — raise per-job MAX_CONCURRENT from 2 to 4 (env-overridable via TRANSLATE_MAX_CONCURRENT, integer ≥ 1) and rewrite the inline comment that cited the pre-scale gateway-cliff state. With CI matrix max-parallel: 4 unchanged, global ceiling moves from 4×2=8 to 4×4=16 in flight — still half the failure-mode threshold of ~28 from [luv-306] fix: cap translate-docs matrix at max-parallel: 4 #305 even before the LiteLLM scale-out.
  • CHANGELOG.md — Unreleased→Fixes entry documenting the change, prior context ([translate-docs] use claude-haiku-4-5 alias instead of dated snapshot #300, [luv-306] fix: cap translate-docs matrix at max-parallel: 4 #305), and motivation (LiteLLM was horizontally scaled per github.com/exospherehost/platform; previous cap leaves capacity on the floor; remaining transient Connection error. failures are per-request and need explicit retry budget).
  • Env parsing hardening (per CodeRabbit feedback): both env vars now use Number.parseInt + Number.isInteger range checks instead of Number(env) || N. The prior form treated "0" as unset and silently accepted negatives — for MAX_CONCURRENT, a 0 or negative override would spawn zero workers via Math.min(MAX_CONCURRENT, tasks.length) and hang the run.
  • .gitignore — drive-by chore: ignore local .monitor-*.sh / .monitor-*.log scratch files used for ad-hoc CI watch loops.

Empirically observed before this change: bun scripts/translate-docs/cli.ts --languages zh --force returned 2 successes + 2 Connection error. lines on the 5 uncached Tier-1 (Sonnet) MDX pages. Post-fix the SDK's 5-retry budget should absorb those transients; concurrency bump takes back the throughput the prior cap forfeited.

Test plan

  • bunx tsc --noEmit clean
  • bun run lint scripts/translate-docs/translator.ts scripts/translate-docs/cli.ts clean (only a pre-existing unrelated <img> warning in app/)
  • bun scripts/translate-docs/cli.ts --languages zh --dry-run runs end-to-end with no errors
  • CI green (quality / test / build / test-e2e) on the latest commit
  • Live Translate Docs workflow run on main goes green across the full 14-language matrix (currently being monitored on run 25535770708)
  • Local re-run with default settings: bun scripts/translate-docs/cli.ts --languages zh --force finishes with no Connection error. lines (live API)
  • Override path with custom limits via TRANSLATE_MAX_CONCURRENT and TRANSLATE_MAX_RETRIES succeeds (live API)

🤖 Generated with Claude Code

…st-LiteLLM-scale

Make both env-overridable via TRANSLATE_MAX_RETRIES and TRANSLATE_MAX_CONCURRENT
so future scaling adjustments need no code change. The LiteLLM proxy behind
ANTHROPIC_BASE_URL has been horizontally scaled, so the prior cap of 2 leaves
capacity on the floor; remaining transient `Connection error.` failures are
per-request (replica flapping, LB hashing) and need explicit retry budget so
the LB has more chances to route a retry to a healthy replica.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d2e37b4e-6121-4050-9337-9673433da6f7

📥 Commits

Reviewing files that changed from the base of the PR and between eeb58e1 and 92788f5.

📒 Files selected for processing (2)
  • scripts/translate-docs/cli.ts
  • scripts/translate-docs/translator.ts

📝 Walkthrough

Walkthrough

This PR makes translate-docs retry and per-job concurrency configurable via environment variables (TRANSLATE_MAX_RETRIES default 5, TRANSLATE_MAX_CONCURRENT default 4), updates the Anthropic client to use the retry setting, and documents capping the workflow matrix at max-parallel: 4.

Changes

Translate-docs Concurrency and Retry Configuration

Layer / File(s) Summary
SDK Client Retry Configuration
scripts/translate-docs/translator.ts
getClient() reads TRANSLATE_MAX_RETRIES from environment (default 5) and passes it as maxRetries to new Anthropic({ maxRetries }).
CLI Concurrency Limiter Configuration
scripts/translate-docs/cli.ts
MAX_CONCURRENT is computed from process.env.TRANSLATE_MAX_CONCURRENT (integer ≥1, default 4) and used by runWithConcurrency to limit concurrent translation jobs.
Changelog Documentation
CHANGELOG.md
Unreleased notes updated to record SDK retry tuning, CLI concurrency increase, env var overrides, and workflow matrix cap at max-parallel: 4.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I nibble env vars, gentle and spry,

Retries set to five so failures pass by,
Four hops at a time, concurrency neat,
Docs note the change — the pipeline's heartbeat,
A rabbit-approved tune for a steadier sky.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main changes: increasing retry policy to 5 and concurrency to 4 for the translate-docs script, with context about the LiteLLM scaling that motivated these changes.
Description check ✅ Passed The description comprehensively covers the required template sections with detailed explanations of changes, motivation, env-var hardening, observed vs. expected behavior, and a test plan with specific verification steps.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/translate-docs/cli.ts (1)

189-199: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate TRANSLATE_MAX_CONCURRENT before using it as worker count.

Line 189 accepts negative values and Line 199 can then create zero workers, causing queued translation tasks to never execute.

Suggested fix
-  const MAX_CONCURRENT = Number(process.env.TRANSLATE_MAX_CONCURRENT) || 4;
+  const parsedConcurrent = Number.parseInt(process.env.TRANSLATE_MAX_CONCURRENT ?? "", 10);
+  const MAX_CONCURRENT =
+    Number.isInteger(parsedConcurrent) && parsedConcurrent > 0
+      ? parsedConcurrent
+      : 4;
@@
-    await Promise.all(Array.from({ length: Math.min(MAX_CONCURRENT, tasks.length) }, () => next()));
+    const workers = Math.min(MAX_CONCURRENT, tasks.length);
+    await Promise.all(Array.from({ length: workers }, () => next()));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/translate-docs/cli.ts` around lines 189 - 199, The MAX_CONCURRENT
value can be negative or zero causing Promise.all to spawn 0 workers and hang;
validate and normalize it when constructing MAX_CONCURRENT used by
runWithConcurrency: parse the env value as an integer (e.g., parseInt or
Number), clamp it to a sensible range (>=1 and optionally an upper cap), and
fallback to the default 4 only if parsing fails; update the constant
MAX_CONCURRENT and ensure the Promise.all Array.from length uses this validated
positive integer so next() workers are always created. Reference symbols:
MAX_CONCURRENT, runWithConcurrency, and the Promise.all(Array.from(..., () =>
next())) worker creation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/translate-docs/translator.ts`:
- Around line 12-13: The env parsing for TRANSLATE_MAX_RETRIES is brittle
(treats "0" as falsy and allows negatives); update the logic that sets
maxRetries so it explicitly parses TRANSLATE_MAX_RETRIES (use parseInt or Number
and Number.isInteger), verify the parsed value is a finite integer >= 0, and
only fall back to the default 5 when the env is absent or invalid; then pass
that validated maxRetries into the Anthropic client constructor (the variable
maxRetries and the new Anthropic({ maxRetries }) call).

---

Outside diff comments:
In `@scripts/translate-docs/cli.ts`:
- Around line 189-199: The MAX_CONCURRENT value can be negative or zero causing
Promise.all to spawn 0 workers and hang; validate and normalize it when
constructing MAX_CONCURRENT used by runWithConcurrency: parse the env value as
an integer (e.g., parseInt or Number), clamp it to a sensible range (>=1 and
optionally an upper cap), and fallback to the default 4 only if parsing fails;
update the constant MAX_CONCURRENT and ensure the Promise.all Array.from length
uses this validated positive integer so next() workers are always created.
Reference symbols: MAX_CONCURRENT, runWithConcurrency, and the
Promise.all(Array.from(..., () => next())) worker creation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4e156f5a-95b9-4da6-ab05-20de4c885f82

📥 Commits

Reviewing files that changed from the base of the PR and between 8492575 and eeb58e1.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • scripts/translate-docs/cli.ts
  • scripts/translate-docs/translator.ts

Comment thread scripts/translate-docs/translator.ts Outdated
Prior `Number(env) || N` treated "0" as unset and silently accepted
negatives. For MAX_CONCURRENT a 0/negative override would spawn zero
workers via Math.min(MAX_CONCURRENT, tasks.length) and hang. Switch to
explicit Number.parseInt + Number.isInteger range checks. Address
CodeRabbit findings on PR #306.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@NiveditJain NiveditJain merged commit 290847e into main May 8, 2026
9 checks passed
NiveditJain added a commit that referenced this pull request May 8, 2026
…00s InvokeModel wall (#307)

Two consecutive translate-docs matrix runs post-platform#345 (which
lifted LiteLLM request_timeout 300s -> 600s) kept failing the 4
largest Tier-1 (Sonnet) pages with `Connection error.` at ~317s and
~367s. Both timestamps are below the new 600s ceiling, so the wall
isn't ours. The actual ceiling is AWS Bedrock's synchronous
InvokeModel API, which severs at ~300s — and Bedrock is one of the
two upstream deployments behind models.aikin.club's claude-sonnet-4-6
route (weighted 1:1 with anthropic-direct, simple-shuffle).

Switching to messages.stream(...).finalMessage() routes Bedrock via
InvokeModelWithResponseStream (no 300s wall) and Anthropic-direct via
streaming (no 5-min cap). The Message return shape is identical so
no other call sites need to change.

Refs: failproofai #306, FailproofAI/platform#345
Failed runs: 25540656053, 25541614351

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
NiveditJain added a commit that referenced this pull request May 8, 2026
Used during ad-hoc CI watch loops; not part of the project source.
Same chore that was committed on the abandoned luv-307 (post-squash-merge
of #306) but never made it into main.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
NiveditJain added a commit that referenced this pull request May 8, 2026
Promote the six entries accumulated under `## Unreleased` to a versioned
heading `## 0.0.10-beta.6 — 2026-05-08`. Add a fresh `## Unreleased`
heading at the top for the next development cycle.

package.json was already at 0.0.10-beta.6 (pre-bumped); no version edit
needed here. The CHANGELOG cut completes the release-prep handshake.

Entries promoted:
- Cursor Stop hook enforcement fix (this PR)
- 5 scripts/translate-docs fixes from #305, #306, #307, #312, #313

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
NiveditJain added a commit that referenced this pull request May 8, 2026
…a.6 (#318)

* [luv-319] fix: enforce Stop hook on Cursor Agent CLI (followup_message + SubagentStop parity)

Cursor's `stop` hook ignores the flat `{permission: "deny"}` shape — that's
honored on tool events only. The only force-retry channel for Stop is
`{followup_message}` on stdout (exit 0), per https://cursor.com/docs/hooks.
The instruct branch already used this shape correctly since #245; the deny
path needed the same treatment, mirroring Copilot's #299 fix.

Without this, the 5 require-*-before-stop builtins were observation-only on
Cursor — the deny was logged but the agent stopped cleanly. User repro:
session 1b510ad4-906c-4f30-9467-ff2e6c581cce at /home/nivedit/dev-purge.

Also subscribes to `subagentStop` (CURSOR_HOOK_EVENT_TYPES + CURSOR_EVENT_MAP)
and widens both deny and instruct branches to match it, for parity with the
Copilot SubagentStop widening from #299.

Cloud Agents caveat: Cursor Cloud Agent VMs do NOT run stop/subagentStop
hooks at all, so this fix only covers local Cursor sessions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: cut 0.0.10-beta.6 release in CHANGELOG

Promote the six entries accumulated under `## Unreleased` to a versioned
heading `## 0.0.10-beta.6 — 2026-05-08`. Add a fresh `## Unreleased`
heading at the top for the next development cycle.

package.json was already at 0.0.10-beta.6 (pre-bumped); no version edit
needed here. The CHANGELOG cut completes the release-prep handshake.

Entries promoted:
- Cursor Stop hook enforcement fix (this PR)
- 5 scripts/translate-docs fixes from #305, #306, #307, #312, #313

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant