Skip to content

bugfix(CubeProxy): solve stale routing by removing refresh-on-hit and implementing proactive cache invalidation#133

Closed
novahe wants to merge 1 commit into
TencentCloud:masterfrom
novahe:fix/stale-routing-cache
Closed

bugfix(CubeProxy): solve stale routing by removing refresh-on-hit and implementing proactive cache invalidation#133
novahe wants to merge 1 commit into
TencentCloud:masterfrom
novahe:fix/stale-routing-cache

Conversation

@novahe
Copy link
Copy Markdown
Contributor

@novahe novahe commented May 3, 2026

1. The Problem

CubeProxy suffered from stale routing issues when sandboxes were destroyed or migrated. The root cause was a "refresh-on-hit" pattern in the local routing cache:

  • Every successful cache hit extended the entry's TTL.
  • For active sandboxes, routing entries never naturally expired.
  • When CubeMaster updated Redis with a new backend IP/Port, the Proxy would continue using the old cached data until a manual restart or an extremely long timeout.

2. The Solution

This PR implements two major changes:

  • Natural Expiry: Removed the cache:set call on hits in rewrite_phase.lua. Routing entries now follow their original random TTL (60-300s).
  • Proactive Invalidation: Implemented a precise identification mechanism in header_filter_phase.lua to detect connection failures and clear the local cache immediately.

3. Key Logic: Bytes vs. Status

To avoid "false positives" (app-level 5xx), we use $upstream_bytes_received as the primary signal:

  • Network Failure: upstream_bytes_received == 0. Action: Clear Cache.
  • App-level Error: upstream_bytes_received > 0. Action: Keep Cache.

Refactored with guard clauses for performance and detailed logging for observability.

4. Verification Results (Manual E2E):

  • Network Failures: Confirmed connection failures trigger immediate cache invalidation.
  • App-level 5xx: Confirmed business 5xx responses preserve the cache.
  • TTL Behavior: Verified TTL no longer refreshes on hit.
  • Convergence Speed: Verified backend switching completes in a single request after Redis update.

local_cache refresh-on-hit pins stale routing forever.
For any active sandbox, the cached routing entry never expires because
every hit extends the TTL. This leads to stale routing when sandboxes
are destroyed or migrated.

Fix:
1. Remove refresh-on-hit logic in rewrite_phase.lua to allow natural expiry (60-300s).
2. Implement active invalidation in header_filter_phase.lua.
3. Distinguish network failures (e.g. connection refused) from application-level
   5xx responses by analyzing upstream_bytes_received:
   - Network failure: 0 bytes received from upstream across all attempts.
   - App-level error: > 0 bytes received (headers/body).
4. Refactor header_filter_phase.lua with guard clauses and early-break
   optimizations for better performance and readability.
5. Log upstream_status and upstream_bytes_received for enhanced observability.

Signed-off-by: novahe <heqianfly@gmail.com>
@novahe novahe changed the title fix(CubeProxy): solve stale routing by removing refresh-on-hit and implementing proactive cache invalidation bugfix(CubeProxy): solve stale routing by removing refresh-on-hit and implementing proactive cache invalidation May 3, 2026
@staryxchen
Copy link
Copy Markdown
Collaborator

Thanks for the contribution — the root cause analysis is spot-on. Cube doesn't support live migration yet (planned). Without it, the backend IP for a given sandbox ID won't change during its lifetime, meaning:

  • Removing refresh-on-hit adds Redis round-trips without benefit.
  • Proactive cache invalidation clears entries that Redis will repopulate with the same (dead) value.

I’d suggest we defer this until live migration support is underway, so the two can ship and be validated together. CC @chenhengqi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants