bugfix(CubeProxy): solve stale routing by removing refresh-on-hit and implementing proactive cache invalidation by novahe · Pull Request #133 · TencentCloud/CubeSandbox

novahe · 2026-05-03T11:31:30Z

1. The Problem

CubeProxy suffered from stale routing issues when sandboxes were destroyed or migrated. The root cause was a "refresh-on-hit" pattern in the local routing cache:

Every successful cache hit extended the entry's TTL.
For active sandboxes, routing entries never naturally expired.
When CubeMaster updated Redis with a new backend IP/Port, the Proxy would continue using the old cached data until a manual restart or an extremely long timeout.

2. The Solution

This PR implements two major changes:

Natural Expiry: Removed the cache:set call on hits in rewrite_phase.lua. Routing entries now follow their original random TTL (60-300s).
Proactive Invalidation: Implemented a precise identification mechanism in header_filter_phase.lua to detect connection failures and clear the local cache immediately.

3. Key Logic: Bytes vs. Status

To avoid "false positives" (app-level 5xx), we use $upstream_bytes_received as the primary signal:

Network Failure: upstream_bytes_received == 0. Action: Clear Cache.
App-level Error: upstream_bytes_received > 0. Action: Keep Cache.

Refactored with guard clauses for performance and detailed logging for observability.

4. Verification Results (Manual E2E):

Network Failures: Confirmed connection failures trigger immediate cache invalidation.
App-level 5xx: Confirmed business 5xx responses preserve the cache.
TTL Behavior: Verified TTL no longer refreshes on hit.
Convergence Speed: Verified backend switching completes in a single request after Redis update.

local_cache refresh-on-hit pins stale routing forever. For any active sandbox, the cached routing entry never expires because every hit extends the TTL. This leads to stale routing when sandboxes are destroyed or migrated. Fix: 1. Remove refresh-on-hit logic in rewrite_phase.lua to allow natural expiry (60-300s). 2. Implement active invalidation in header_filter_phase.lua. 3. Distinguish network failures (e.g. connection refused) from application-level 5xx responses by analyzing upstream_bytes_received: - Network failure: 0 bytes received from upstream across all attempts. - App-level error: > 0 bytes received (headers/body). 4. Refactor header_filter_phase.lua with guard clauses and early-break optimizations for better performance and readability. 5. Log upstream_status and upstream_bytes_received for enhanced observability. Signed-off-by: novahe <heqianfly@gmail.com>

staryxchen · 2026-05-04T09:05:01Z

Thanks for the contribution — the root cause analysis is spot-on. Cube doesn't support live migration yet (planned). Without it, the backend IP for a given sandbox ID won't change during its lifetime, meaning:

Removing refresh-on-hit adds Redis round-trips without benefit.
Proactive cache invalidation clears entries that Redis will repopulate with the same (dead) value.

I’d suggest we defer this until live migration support is underway, so the two can ship and be validated together. CC @chenhengqi

novahe requested review from chenhengqi, fslongjin, ls-ggg, tinklone and up2wing as code owners May 3, 2026 11:31

novahe changed the title ~~fix(CubeProxy): solve stale routing by removing refresh-on-hit and implementing proactive cache invalidation~~ bugfix(CubeProxy): solve stale routing by removing refresh-on-hit and implementing proactive cache invalidation May 3, 2026

staryxchen mentioned this pull request May 4, 2026

fix(CubeProxy): implement singleflight pattern for routing and health checks #135

Open

novahe closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix(CubeProxy): solve stale routing by removing refresh-on-hit and implementing proactive cache invalidation#133

bugfix(CubeProxy): solve stale routing by removing refresh-on-hit and implementing proactive cache invalidation#133
novahe wants to merge 1 commit into
TencentCloud:masterfrom
novahe:fix/stale-routing-cache

novahe commented May 3, 2026 •

edited

Loading

Uh oh!

staryxchen commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

novahe commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. The Problem

2. The Solution

3. Key Logic: Bytes vs. Status

4. Verification Results (Manual E2E):

Uh oh!

staryxchen commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

novahe commented May 3, 2026 •

edited

Loading