bugfix(CubeProxy): solve stale routing by removing refresh-on-hit and implementing proactive cache invalidation#133
Closed
novahe wants to merge 1 commit into
Closed
Conversation
local_cache refresh-on-hit pins stale routing forever. For any active sandbox, the cached routing entry never expires because every hit extends the TTL. This leads to stale routing when sandboxes are destroyed or migrated. Fix: 1. Remove refresh-on-hit logic in rewrite_phase.lua to allow natural expiry (60-300s). 2. Implement active invalidation in header_filter_phase.lua. 3. Distinguish network failures (e.g. connection refused) from application-level 5xx responses by analyzing upstream_bytes_received: - Network failure: 0 bytes received from upstream across all attempts. - App-level error: > 0 bytes received (headers/body). 4. Refactor header_filter_phase.lua with guard clauses and early-break optimizations for better performance and readability. 5. Log upstream_status and upstream_bytes_received for enhanced observability. Signed-off-by: novahe <heqianfly@gmail.com>
Collaborator
|
Thanks for the contribution — the root cause analysis is spot-on. Cube doesn't support live migration yet (planned). Without it, the backend IP for a given sandbox ID won't change during its lifetime, meaning:
I’d suggest we defer this until live migration support is underway, so the two can ship and be validated together. CC @chenhengqi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1. The Problem
CubeProxysuffered from stale routing issues when sandboxes were destroyed or migrated. The root cause was a "refresh-on-hit" pattern in the local routing cache:CubeMasterupdated Redis with a new backend IP/Port, the Proxy would continue using the old cached data until a manual restart or an extremely long timeout.2. The Solution
This PR implements two major changes:
cache:setcall on hits inrewrite_phase.lua. Routing entries now follow their original random TTL (60-300s).header_filter_phase.luato detect connection failures and clear the local cache immediately.3. Key Logic: Bytes vs. Status
To avoid "false positives" (app-level 5xx), we use
$upstream_bytes_receivedas the primary signal:upstream_bytes_received == 0. Action: Clear Cache.upstream_bytes_received > 0. Action: Keep Cache.Refactored with guard clauses for performance and detailed logging for observability.
4. Verification Results (Manual E2E):