[luv-307] fix: bump translate-docs retries 2→5 and concurrency 2→4 post-LiteLLM-scale by NiveditJain · Pull Request #306 · FailproofAI/failproofai

NiveditJain · 2026-05-08T03:46:00Z

Summary

scripts/translate-docs/translator.ts — pass maxRetries (default 5, up from the SDK default of 2) to new Anthropic({ maxRetries }). Override via TRANSLATE_MAX_RETRIES (integer ≥ 0; 0 disables retries). The SDK already retries APIConnectionError / APIConnectionTimeoutError / 408/409/429/5xx with exponential backoff + jitter; we just need it to retry more so the LB has more chances to route a retry to a healthy LiteLLM replica.
scripts/translate-docs/cli.ts — raise per-job MAX_CONCURRENT from 2 to 4 (env-overridable via TRANSLATE_MAX_CONCURRENT, integer ≥ 1) and rewrite the inline comment that cited the pre-scale gateway-cliff state. With CI matrix max-parallel: 4 unchanged, global ceiling moves from 4×2=8 to 4×4=16 in flight — still half the failure-mode threshold of ~28 from [luv-306] fix: cap translate-docs matrix at max-parallel: 4 #305 even before the LiteLLM scale-out.
CHANGELOG.md — Unreleased→Fixes entry documenting the change, prior context ([translate-docs] use claude-haiku-4-5 alias instead of dated snapshot #300, [luv-306] fix: cap translate-docs matrix at max-parallel: 4 #305), and motivation (LiteLLM was horizontally scaled per github.com/exospherehost/platform; previous cap leaves capacity on the floor; remaining transient Connection error. failures are per-request and need explicit retry budget).
Env parsing hardening (per CodeRabbit feedback): both env vars now use Number.parseInt + Number.isInteger range checks instead of Number(env) || N. The prior form treated "0" as unset and silently accepted negatives — for MAX_CONCURRENT, a 0 or negative override would spawn zero workers via Math.min(MAX_CONCURRENT, tasks.length) and hang the run.
.gitignore — drive-by chore: ignore local .monitor-*.sh / .monitor-*.log scratch files used for ad-hoc CI watch loops.

Empirically observed before this change: bun scripts/translate-docs/cli.ts --languages zh --force returned 2 successes + 2 Connection error. lines on the 5 uncached Tier-1 (Sonnet) MDX pages. Post-fix the SDK's 5-retry budget should absorb those transients; concurrency bump takes back the throughput the prior cap forfeited.

Test plan

bunx tsc --noEmit clean
bun run lint scripts/translate-docs/translator.ts scripts/translate-docs/cli.ts clean (only a pre-existing unrelated <img> warning in app/)
bun scripts/translate-docs/cli.ts --languages zh --dry-run runs end-to-end with no errors
CI green (quality / test / build / test-e2e) on the latest commit
Live Translate Docs workflow run on main goes green across the full 14-language matrix (currently being monitored on run 25535770708)
Local re-run with default settings: bun scripts/translate-docs/cli.ts --languages zh --force finishes with no Connection error. lines (live API)
Override path with custom limits via TRANSLATE_MAX_CONCURRENT and TRANSLATE_MAX_RETRIES succeeds (live API)

🤖 Generated with Claude Code

…st-LiteLLM-scale Make both env-overridable via TRANSLATE_MAX_RETRIES and TRANSLATE_MAX_CONCURRENT so future scaling adjustments need no code change. The LiteLLM proxy behind ANTHROPIC_BASE_URL has been horizontally scaled, so the prior cap of 2 leaves capacity on the floor; remaining transient `Connection error.` failures are per-request (replica flapping, LB hashing) and need explicit retry budget so the LB has more chances to route a retry to a healthy replica. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai · 2026-05-08T03:46:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d2e37b4e-6121-4050-9337-9673433da6f7

📥 Commits

Reviewing files that changed from the base of the PR and between eeb58e1 and 92788f5.

📒 Files selected for processing (2)

scripts/translate-docs/cli.ts
scripts/translate-docs/translator.ts

📝 Walkthrough

Walkthrough

This PR makes translate-docs retry and per-job concurrency configurable via environment variables (TRANSLATE_MAX_RETRIES default 5, TRANSLATE_MAX_CONCURRENT default 4), updates the Anthropic client to use the retry setting, and documents capping the workflow matrix at max-parallel: 4.

Changes

Translate-docs Concurrency and Retry Configuration

Layer / File(s)	Summary
SDK Client Retry Configuration `scripts/translate-docs/translator.ts`	`getClient()` reads `TRANSLATE_MAX_RETRIES` from environment (default 5) and passes it as `maxRetries` to `new Anthropic({ maxRetries })`.
CLI Concurrency Limiter Configuration `scripts/translate-docs/cli.ts`	`MAX_CONCURRENT` is computed from `process.env.TRANSLATE_MAX_CONCURRENT` (integer ≥1, default 4) and used by `runWithConcurrency` to limit concurrent translation jobs.
Changelog Documentation `CHANGELOG.md`	Unreleased notes updated to record SDK retry tuning, CLI concurrency increase, env var overrides, and workflow matrix cap at `max-parallel: 4`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

exospherehost/failproofai#305: Caps translate-docs workflow matrix at max-parallel: 4 to reduce cross-job in-flight requests.
exospherehost/failproofai#300: Modifies MAX_CONCURRENT in scripts/translate-docs/cli.ts concurrency configuration.

Poem

🐰 I nibble env vars, gentle and spry,

Retries set to five so failures pass by,
Four hops at a time, concurrency neat,
Docs note the change — the pipeline's heartbeat,
A rabbit-approved tune for a steadier sky.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: increasing retry policy to 5 and concurrency to 4 for the translate-docs script, with context about the LiteLLM scaling that motivated these changes.
Description check	✅ Passed	The description comprehensively covers the required template sections with detailed explanations of changes, motivation, env-var hardening, observed vs. expected behavior, and a test plan with specific verification steps.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/translate-docs/cli.ts (1)

189-199: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate TRANSLATE_MAX_CONCURRENT before using it as worker count.

Line 189 accepts negative values and Line 199 can then create zero workers, causing queued translation tasks to never execute.

Suggested fix

-  const MAX_CONCURRENT = Number(process.env.TRANSLATE_MAX_CONCURRENT) || 4;
+  const parsedConcurrent = Number.parseInt(process.env.TRANSLATE_MAX_CONCURRENT ?? "", 10);
+  const MAX_CONCURRENT =
+    Number.isInteger(parsedConcurrent) && parsedConcurrent > 0
+      ? parsedConcurrent
+      : 4;
@@
-    await Promise.all(Array.from({ length: Math.min(MAX_CONCURRENT, tasks.length) }, () => next()));
+    const workers = Math.min(MAX_CONCURRENT, tasks.length);
+    await Promise.all(Array.from({ length: workers }, () => next()));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/translate-docs/cli.ts` around lines 189 - 199, The MAX_CONCURRENT
value can be negative or zero causing Promise.all to spawn 0 workers and hang;
validate and normalize it when constructing MAX_CONCURRENT used by
runWithConcurrency: parse the env value as an integer (e.g., parseInt or
Number), clamp it to a sensible range (>=1 and optionally an upper cap), and
fallback to the default 4 only if parsing fails; update the constant
MAX_CONCURRENT and ensure the Promise.all Array.from length uses this validated
positive integer so next() workers are always created. Reference symbols:
MAX_CONCURRENT, runWithConcurrency, and the Promise.all(Array.from(..., () =>
next())) worker creation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/translate-docs/translator.ts`:
- Around line 12-13: The env parsing for TRANSLATE_MAX_RETRIES is brittle
(treats "0" as falsy and allows negatives); update the logic that sets
maxRetries so it explicitly parses TRANSLATE_MAX_RETRIES (use parseInt or Number
and Number.isInteger), verify the parsed value is a finite integer >= 0, and
only fall back to the default 5 when the env is absent or invalid; then pass
that validated maxRetries into the Anthropic client constructor (the variable
maxRetries and the new Anthropic({ maxRetries }) call).

---

Outside diff comments:
In `@scripts/translate-docs/cli.ts`:
- Around line 189-199: The MAX_CONCURRENT value can be negative or zero causing
Promise.all to spawn 0 workers and hang; validate and normalize it when
constructing MAX_CONCURRENT used by runWithConcurrency: parse the env value as
an integer (e.g., parseInt or Number), clamp it to a sensible range (>=1 and
optionally an upper cap), and fallback to the default 4 only if parsing fails;
update the constant MAX_CONCURRENT and ensure the Promise.all Array.from length
uses this validated positive integer so next() workers are always created.
Reference symbols: MAX_CONCURRENT, runWithConcurrency, and the
Promise.all(Array.from(..., () => next())) worker creation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4e156f5a-95b9-4da6-ab05-20de4c885f82

📥 Commits

Reviewing files that changed from the base of the PR and between 8492575 and eeb58e1.

📒 Files selected for processing (3)

CHANGELOG.md
scripts/translate-docs/cli.ts
scripts/translate-docs/translator.ts

Prior `Number(env) || N` treated "0" as unset and silently accepted negatives. For MAX_CONCURRENT a 0/negative override would spawn zero workers via Math.min(MAX_CONCURRENT, tasks.length) and hang. Switch to explicit Number.parseInt + Number.isInteger range checks. Address CodeRabbit findings on PR #306. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…00s InvokeModel wall (#307) Two consecutive translate-docs matrix runs post-platform#345 (which lifted LiteLLM request_timeout 300s -> 600s) kept failing the 4 largest Tier-1 (Sonnet) pages with `Connection error.` at ~317s and ~367s. Both timestamps are below the new 600s ceiling, so the wall isn't ours. The actual ceiling is AWS Bedrock's synchronous InvokeModel API, which severs at ~300s — and Bedrock is one of the two upstream deployments behind models.aikin.club's claude-sonnet-4-6 route (weighted 1:1 with anthropic-direct, simple-shuffle). Switching to messages.stream(...).finalMessage() routes Bedrock via InvokeModelWithResponseStream (no 300s wall) and Anthropic-direct via streaming (no 5-min cap). The Message return shape is identical so no other call sites need to change. Refs: failproofai #306, FailproofAI/platform#345 Failed runs: 25540656053, 25541614351 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Used during ad-hoc CI watch loops; not part of the project source. Same chore that was committed on the abandoned luv-307 (post-squash-merge of #306) but never made it into main. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Promote the six entries accumulated under `## Unreleased` to a versioned heading `## 0.0.10-beta.6 — 2026-05-08`. Add a fresh `## Unreleased` heading at the top for the next development cycle. package.json was already at 0.0.10-beta.6 (pre-bumped); no version edit needed here. The CHANGELOG cut completes the release-prep handshake. Entries promoted: - Cursor Stop hook enforcement fix (this PR) - 5 scripts/translate-docs fixes from #305, #306, #307, #312, #313 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…a.6 (#318) * [luv-319] fix: enforce Stop hook on Cursor Agent CLI (followup_message + SubagentStop parity) Cursor's `stop` hook ignores the flat `{permission: "deny"}` shape — that's honored on tool events only. The only force-retry channel for Stop is `{followup_message}` on stdout (exit 0), per https://cursor.com/docs/hooks. The instruct branch already used this shape correctly since #245; the deny path needed the same treatment, mirroring Copilot's #299 fix. Without this, the 5 require-*-before-stop builtins were observation-only on Cursor — the deny was logged but the agent stopped cleanly. User repro: session 1b510ad4-906c-4f30-9467-ff2e6c581cce at /home/nivedit/dev-purge. Also subscribes to `subagentStop` (CURSOR_HOOK_EVENT_TYPES + CURSOR_EVENT_MAP) and widens both deny and instruct branches to match it, for parity with the Copilot SubagentStop widening from #299. Cloud Agents caveat: Cursor Cloud Agent VMs do NOT run stop/subagentStop hooks at all, so this fix only covers local Cursor sessions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: cut 0.0.10-beta.6 release in CHANGELOG Promote the six entries accumulated under `## Unreleased` to a versioned heading `## 0.0.10-beta.6 — 2026-05-08`. Add a fresh `## Unreleased` heading at the top for the next development cycle. package.json was already at 0.0.10-beta.6 (pre-bumped); no version edit needed here. The CHANGELOG cut completes the release-prep handshake. Entries promoted: - Cursor Stop hook enforcement fix (this PR) - 5 scripts/translate-docs fixes from #305, #306, #307, #312, #313 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread scripts/translate-docs/translator.ts Outdated

NiveditJain merged commit 290847e into main May 8, 2026
9 checks passed

coderabbitai Bot mentioned this pull request May 8, 2026

[luv-308] fix: stream messages so large Sonnet docs dodge Bedrock's 300s InvokeModel wall #307

Merged

6 tasks

NiveditJain mentioned this pull request May 8, 2026

[luv-309] chore: gitignore local .monitor-*.sh/.log scratch files #308

Merged

2 tasks

This was referenced May 8, 2026

[luv-310] fix: strip stray trailing ``` fences from translated MDX (Mintlify deploy unblock) #311

Closed

[luv-313] fix: strip stray trailing ``` fences from translated MDX (Mintlify deploy unblock) #313

Merged

NiveditJain mentioned this pull request May 8, 2026

[luv-319] fix: enforce Stop hook on Cursor Agent CLI + cut 0.0.10-beta.6 #318

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[luv-307] fix: bump translate-docs retries 2→5 and concurrency 2→4 post-LiteLLM-scale#306

[luv-307] fix: bump translate-docs retries 2→5 and concurrency 2→4 post-LiteLLM-scale#306
NiveditJain merged 2 commits into
mainfrom
luv-307

NiveditJain commented May 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NiveditJain commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NiveditJain commented May 8, 2026 •

edited

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading