fix(rate-limiter): production hardening — Redis connection path#78
fix(rate-limiter): production hardening — Redis connection path#78
Conversation
Without a time-bound on ``Client::get_multiplexed_tokio_connection().await``,
the rate-limiter blocks indefinitely against any Redis endpoint where the
TCP handshake completes but the application layer never speaks
(e.g. plain ``redis://`` against a TLS-required server, network ACL drops,
firewalled cluster, etc.). Every request through the plugin then pays
the framework's outer 30-second plugin timeout, and ``fail_mode`` cannot
engage because the connection-acquisition future never returns to surface
an error.
This commit wraps the connection acquisition in
``tokio::time::timeout(Duration::from_secs(2), …)`` and maps the elapsed
error into a ``redis::ErrorKind::IoError``-shaped ``RedisError`` so the
existing ``fail_mode`` path routes it the same way as any other
connection-side failure.
Test coverage added at both layers:
* Rust unit test (``redis_backend::tests::connection_async_fails_fast_against_hanging_redis``)
— binds a TCP listener that accepts but never reads/writes; asserts
``connection_async`` returns within ~3s with an ``IoError``-shaped
error.
* Python integration test
(``TestRedisFailModeAndViolationContext::test_hanging_redis_fails_fast_via_connect_timeout``)
— same setup pattern at the public-API layer; asserts
``tool_pre_invoke`` completes within ~5s and the default
``fail_mode=open`` allows the request through.
Both tests fail-by-hang against the prior implementation and pass against
this commit; an outer ``tokio::time::timeout`` / ``asyncio.wait_for``
guards each so a regression doesn't hang the test run.
The 2-second value is hardcoded as ``CONNECT_TIMEOUT`` for now to keep
the change small. Promoting it to a ``redis_connect_timeout_ms`` config
key (extending the existing config-key list and warning machinery) is a
trivial follow-up if operators in slow-network deployments need a longer
budget.
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Companion bump to the connection-acquisition timeout fix on this branch. 0.0.6 was the TLS-support release (cpex-plugins#74); 0.0.7 ships the timeout fix on top of it. No behavioural change in this commit on its own — version bump only. Cargo.lock regenerated via ``cargo update -p rate_limiter``. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
0cb67dd to
0e9bbeb
Compare
|
Rechecked current head Still missing from the previous review:
I did not rerun the test suite; this is a diff re-check only. |
…osed hanging-Redis test Addresses two review comments on PR #78: 1. Rust unit test ``connection_async_fails_fast_against_hanging_redis`` previously accepted both ``IoError`` and ``ResponseError`` for the timeout error. The implementation explicitly maps the elapsed-error into ``redis::ErrorKind::IoError``, so the test now pins exactly that variant (``assert_eq!(err.kind(), IoError)``). Anything else would mean the timeout is being routed through a different code path than the rest of the fail-mode logic. 2. Add ``test_hanging_redis_with_fail_mode_closed_blocks_with_backend_unavailable`` as a sibling to the existing ``fail_mode=open`` test. Same hanging-socket setup; flips ``fail_mode`` to ``closed`` and asserts the documented BACKEND_UNAVAILABLE response envelope: ``continue_processing=False``, ``violation.code='BACKEND_UNAVAILABLE'``, ``http_status_code=503``, and ``Retry-After`` present in ``violation.http_headers``. The pair now pin both halves of the operator's policy contract under the hanging-Redis failure shape. Out of scope of this commit: promoting the hardcoded ``CONNECT_TIMEOUT`` to a config key — addressed in a separate follow-up commit on this branch. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds a second comment block alongside the existing "why bound this at all" rationale, explaining why the value is a hardcoded constant rather than a config key — keep the plugin's config surface small, 2 s covers typical production paths, and promoting to a knob is a trivial follow-up if a slow-network deployment ever surfaces. Captures the rationale in code so it's discoverable on next review of this constant, rather than only in PR-description history. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
|
Thanks for the re-check. On (2) — kept hardcoded, with rationale captured inline at The trade-off honestly: 2s is on the aggressive side compared to Lettuce/ioredis (10s) or .NET StackExchange (5s), and you're right that it's operator policy in principle. The reason we'd like to leave it as a constant for this PR: the plugin's config surface is small by design, and adding a knob that the vast majority of operators won't tune expands the schema for everyone else. 2 seconds covers typical production paths comfortably — intra-VPC and cross-AZ Redis connection establishment is well under 100ms, and managed Redis with TLS handshake adds only ~100-300ms on top, so the "slow managed/TLS Redis flip" requires a genuinely unusual ~1.5+ second handshake to bite. If a deployment with deliberately slow networks surfaces, promoting this to Let me know if you'd rather we add the knob now anyway — defensible either way and your call. |
|
On (1) and (3) — both addressed in
|
Summary
Wraps
Client::get_multiplexed_tokio_connection().awaitin a 2-secondtokio::time::timeoutso the rate-limiter fails fast when the configured Redis endpoint accepts TCP but never speaks at the application layer (plainredis://against a TLS-required server, network ACL drops, firewalled cluster, etc.). Without the bound, the call blocks indefinitely; the framework's outer 30-second plugin timeout eventually kills the hook, andfail_modecannot engage because the connection-acquisition future never returns to surface an error.The timeout error is mapped into a
redis::ErrorKind::IoError-shapedRedisErrorso the existingfail_modepath routes it the same way as any other connection-side failure.Test coverage
Two layers — both fail-by-hang against the prior implementation, both pass against this commit:
redis_backend::tests::connection_async_fails_fast_against_hanging_redis) — binds a TCP listener that accepts but never reads/writes; assertsconnection_asyncreturns within ~3s with anIoError-shaped error.TestRedisFailModeAndViolationContext::test_hanging_redis_fails_fast_via_connect_timeout) — same setup pattern at the public-API layer; assertstool_pre_invokecompletes within ~5s and the defaultfail_mode=openallows the request through.Each test is wrapped with an outer runaway-guard (
tokio::time::timeout(5s)/asyncio.wait_for(..., 10.0)) so a regression doesn't hang the test run.Full local gate
Configurability
The 2-second
CONNECT_TIMEOUTis a hardcoded constant for now to keep the change small. Promoting it to aredis_connect_timeout_msconfig key (extending the existing config-key list and warning machinery) is a trivial follow-up if operators in slow-network deployments need a longer budget.