[luv-307] fix: bump translate-docs retries 2→5 and concurrency 2→4 post-LiteLLM-scale#306
Conversation
…st-LiteLLM-scale Make both env-overridable via TRANSLATE_MAX_RETRIES and TRANSLATE_MAX_CONCURRENT so future scaling adjustments need no code change. The LiteLLM proxy behind ANTHROPIC_BASE_URL has been horizontally scaled, so the prior cap of 2 leaves capacity on the floor; remaining transient `Connection error.` failures are per-request (replica flapping, LB hashing) and need explicit retry budget so the LB has more chances to route a retry to a healthy replica. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR makes translate-docs retry and per-job concurrency configurable via environment variables ( ChangesTranslate-docs Concurrency and Retry Configuration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/translate-docs/cli.ts (1)
189-199:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate
TRANSLATE_MAX_CONCURRENTbefore using it as worker count.Line 189 accepts negative values and Line 199 can then create zero workers, causing queued translation tasks to never execute.
Suggested fix
- const MAX_CONCURRENT = Number(process.env.TRANSLATE_MAX_CONCURRENT) || 4; + const parsedConcurrent = Number.parseInt(process.env.TRANSLATE_MAX_CONCURRENT ?? "", 10); + const MAX_CONCURRENT = + Number.isInteger(parsedConcurrent) && parsedConcurrent > 0 + ? parsedConcurrent + : 4; @@ - await Promise.all(Array.from({ length: Math.min(MAX_CONCURRENT, tasks.length) }, () => next())); + const workers = Math.min(MAX_CONCURRENT, tasks.length); + await Promise.all(Array.from({ length: workers }, () => next()));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/translate-docs/cli.ts` around lines 189 - 199, The MAX_CONCURRENT value can be negative or zero causing Promise.all to spawn 0 workers and hang; validate and normalize it when constructing MAX_CONCURRENT used by runWithConcurrency: parse the env value as an integer (e.g., parseInt or Number), clamp it to a sensible range (>=1 and optionally an upper cap), and fallback to the default 4 only if parsing fails; update the constant MAX_CONCURRENT and ensure the Promise.all Array.from length uses this validated positive integer so next() workers are always created. Reference symbols: MAX_CONCURRENT, runWithConcurrency, and the Promise.all(Array.from(..., () => next())) worker creation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/translate-docs/translator.ts`:
- Around line 12-13: The env parsing for TRANSLATE_MAX_RETRIES is brittle
(treats "0" as falsy and allows negatives); update the logic that sets
maxRetries so it explicitly parses TRANSLATE_MAX_RETRIES (use parseInt or Number
and Number.isInteger), verify the parsed value is a finite integer >= 0, and
only fall back to the default 5 when the env is absent or invalid; then pass
that validated maxRetries into the Anthropic client constructor (the variable
maxRetries and the new Anthropic({ maxRetries }) call).
---
Outside diff comments:
In `@scripts/translate-docs/cli.ts`:
- Around line 189-199: The MAX_CONCURRENT value can be negative or zero causing
Promise.all to spawn 0 workers and hang; validate and normalize it when
constructing MAX_CONCURRENT used by runWithConcurrency: parse the env value as
an integer (e.g., parseInt or Number), clamp it to a sensible range (>=1 and
optionally an upper cap), and fallback to the default 4 only if parsing fails;
update the constant MAX_CONCURRENT and ensure the Promise.all Array.from length
uses this validated positive integer so next() workers are always created.
Reference symbols: MAX_CONCURRENT, runWithConcurrency, and the
Promise.all(Array.from(..., () => next())) worker creation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4e156f5a-95b9-4da6-ab05-20de4c885f82
📒 Files selected for processing (3)
CHANGELOG.mdscripts/translate-docs/cli.tsscripts/translate-docs/translator.ts
Prior `Number(env) || N` treated "0" as unset and silently accepted negatives. For MAX_CONCURRENT a 0/negative override would spawn zero workers via Math.min(MAX_CONCURRENT, tasks.length) and hang. Switch to explicit Number.parseInt + Number.isInteger range checks. Address CodeRabbit findings on PR #306. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…00s InvokeModel wall (#307) Two consecutive translate-docs matrix runs post-platform#345 (which lifted LiteLLM request_timeout 300s -> 600s) kept failing the 4 largest Tier-1 (Sonnet) pages with `Connection error.` at ~317s and ~367s. Both timestamps are below the new 600s ceiling, so the wall isn't ours. The actual ceiling is AWS Bedrock's synchronous InvokeModel API, which severs at ~300s — and Bedrock is one of the two upstream deployments behind models.aikin.club's claude-sonnet-4-6 route (weighted 1:1 with anthropic-direct, simple-shuffle). Switching to messages.stream(...).finalMessage() routes Bedrock via InvokeModelWithResponseStream (no 300s wall) and Anthropic-direct via streaming (no 5-min cap). The Message return shape is identical so no other call sites need to change. Refs: failproofai #306, FailproofAI/platform#345 Failed runs: 25540656053, 25541614351 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Used during ad-hoc CI watch loops; not part of the project source. Same chore that was committed on the abandoned luv-307 (post-squash-merge of #306) but never made it into main. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Promote the six entries accumulated under `## Unreleased` to a versioned heading `## 0.0.10-beta.6 — 2026-05-08`. Add a fresh `## Unreleased` heading at the top for the next development cycle. package.json was already at 0.0.10-beta.6 (pre-bumped); no version edit needed here. The CHANGELOG cut completes the release-prep handshake. Entries promoted: - Cursor Stop hook enforcement fix (this PR) - 5 scripts/translate-docs fixes from #305, #306, #307, #312, #313 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…a.6 (#318) * [luv-319] fix: enforce Stop hook on Cursor Agent CLI (followup_message + SubagentStop parity) Cursor's `stop` hook ignores the flat `{permission: "deny"}` shape — that's honored on tool events only. The only force-retry channel for Stop is `{followup_message}` on stdout (exit 0), per https://cursor.com/docs/hooks. The instruct branch already used this shape correctly since #245; the deny path needed the same treatment, mirroring Copilot's #299 fix. Without this, the 5 require-*-before-stop builtins were observation-only on Cursor — the deny was logged but the agent stopped cleanly. User repro: session 1b510ad4-906c-4f30-9467-ff2e6c581cce at /home/nivedit/dev-purge. Also subscribes to `subagentStop` (CURSOR_HOOK_EVENT_TYPES + CURSOR_EVENT_MAP) and widens both deny and instruct branches to match it, for parity with the Copilot SubagentStop widening from #299. Cloud Agents caveat: Cursor Cloud Agent VMs do NOT run stop/subagentStop hooks at all, so this fix only covers local Cursor sessions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: cut 0.0.10-beta.6 release in CHANGELOG Promote the six entries accumulated under `## Unreleased` to a versioned heading `## 0.0.10-beta.6 — 2026-05-08`. Add a fresh `## Unreleased` heading at the top for the next development cycle. package.json was already at 0.0.10-beta.6 (pre-bumped); no version edit needed here. The CHANGELOG cut completes the release-prep handshake. Entries promoted: - Cursor Stop hook enforcement fix (this PR) - 5 scripts/translate-docs fixes from #305, #306, #307, #312, #313 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
scripts/translate-docs/translator.ts— passmaxRetries(default 5, up from the SDK default of 2) tonew Anthropic({ maxRetries }). Override viaTRANSLATE_MAX_RETRIES(integer ≥ 0; 0 disables retries). The SDK already retriesAPIConnectionError/APIConnectionTimeoutError/ 408/409/429/5xx with exponential backoff + jitter; we just need it to retry more so the LB has more chances to route a retry to a healthy LiteLLM replica.scripts/translate-docs/cli.ts— raise per-jobMAX_CONCURRENTfrom 2 to 4 (env-overridable viaTRANSLATE_MAX_CONCURRENT, integer ≥ 1) and rewrite the inline comment that cited the pre-scale gateway-cliff state. With CI matrixmax-parallel: 4unchanged, global ceiling moves from4×2=8to4×4=16in flight — still half the failure-mode threshold of ~28 from [luv-306] fix: cap translate-docs matrix at max-parallel: 4 #305 even before the LiteLLM scale-out.CHANGELOG.md— Unreleased→Fixes entry documenting the change, prior context ([translate-docs] use claude-haiku-4-5 alias instead of dated snapshot #300, [luv-306] fix: cap translate-docs matrix at max-parallel: 4 #305), and motivation (LiteLLM was horizontally scaled per github.com/exospherehost/platform; previous cap leaves capacity on the floor; remaining transientConnection error.failures are per-request and need explicit retry budget).Number.parseInt+Number.isIntegerrange checks instead ofNumber(env) || N. The prior form treated"0"as unset and silently accepted negatives — forMAX_CONCURRENT, a0or negative override would spawn zero workers viaMath.min(MAX_CONCURRENT, tasks.length)and hang the run..gitignore— drive-by chore: ignore local.monitor-*.sh/.monitor-*.logscratch files used for ad-hoc CI watch loops.Empirically observed before this change:
bun scripts/translate-docs/cli.ts --languages zh --forcereturned 2 successes + 2Connection error.lines on the 5 uncached Tier-1 (Sonnet) MDX pages. Post-fix the SDK's 5-retry budget should absorb those transients; concurrency bump takes back the throughput the prior cap forfeited.Test plan
bunx tsc --noEmitcleanbun run lint scripts/translate-docs/translator.ts scripts/translate-docs/cli.tsclean (only a pre-existing unrelated<img>warning inapp/)bun scripts/translate-docs/cli.ts --languages zh --dry-runruns end-to-end with no errorsTranslate Docsworkflow run on main goes green across the full 14-language matrix (currently being monitored on run 25535770708)bun scripts/translate-docs/cli.ts --languages zh --forcefinishes with noConnection error.lines (live API)🤖 Generated with Claude Code