-
-
Notifications
You must be signed in to change notification settings - Fork 57
feat: add runtime resilience for file deletion survival #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement graceful degradation patterns that allow the proxy to continue running even if core files are deleted during runtime. Changes only take effect on restart, enabling safe development while the proxy is serving. ## Changes by Component ### Usage Manager (usage_manager.py) - Wrap `_save_usage()` in try/except with directory auto-recreation - Enhanced `_load_usage()` with explicit error handling - In-memory state continues working if file operations fail ### Failure Logger (failure_logger.py) - Add module-level `_file_handler` and `_fallback_mode` state - Create `_create_file_handler()` with directory auto-recreation - Create `_ensure_handler_valid()` for handler recovery - Use NullHandler as fallback when file logging fails ### Detailed Logger (detailed_logger.py) - Add class-level `_disk_available` and `_console_fallback_warned` flags - Add instance-level `_in_memory_logs` list for fallback storage - Skip disk writes gracefully when filesystem unavailable ### Google OAuth Base (google_oauth_base.py) - Update memory cache FIRST before disk write (memory-first pattern) - Use cached tokens as fallback when refresh/save fails - Log warnings but don't crash on persistence failures ### Provider Cache (provider_cache.py) - Add `_disk_available` health flag and `disk_errors` counter - Track disk health status in get_stats() - Gracefully degrade to memory-only caching on disk failures ### Documentation (DOCUMENTATION.md) - Add Section 5: Runtime Resilience with resilience hierarchy - Document "Develop While Running" workflow - Explain graceful degradation and data loss scenarios
|
Starting my review of the runtime resilience implementation. I'll be examining how the fallback patterns work across the usage manager, failure logger, detailed logger, OAuth base, and provider cache modules. Diving into the disk-to-memory degradation logic now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall Assessment
This is a well-designed implementation of runtime resilience patterns. The "memory-first, fail silently, log loudly" philosophy is consistently applied across all modified modules, and the documentation accurately reflects the implementation. The PR achieves its goal of allowing the proxy to survive filesystem disruptions during runtime.
Architectural Feedback
The resilience hierarchy is sound:
google_oauth_base.py: The memory-first caching in_save_credentialsand the multi-level fallback inget_auth_headerare particularly well-implementedprovider_cache.py: Excellent observability withdisk_errorstracking anddisk_availablehealth flag exposed viaget_stats()usage_manager.py: Good race condition handling in_load_usagewith specific exception handlers for different failure modesfailure_logger.py: The handler recreation pattern with_ensure_handler_valid()is robust
Key Suggestions
Two minor improvements suggested for detailed_logger.py:
- State consistency: The
_disk_availableflag should be updated in_write_jsonon both success (recovery) and failure, matching the pattern used inprovider_cache.py - Design documentation: Consider adding a comment explaining why stream chunks don't use in-memory fallback (likely intentional for memory management)
Nitpicks and Minor Points
- The documentation in
DOCUMENTATION.mdis clear and helpful, accurately describing the resilience levels - The use of
[RUNTIME RESILIENCE]markers in docstrings is a nice touch for code navigation
Questions for the Author
- Is the asymmetry between
_write_json(stores in memory on failure) andlog_stream_chunk(drops on failure) intentional? If so, a brief comment would help future maintainers understand this design choice.
This review was generated by an AI assistant.
|
There's still issue with this PR, somehow there's a case where the proxy didn't receive the api request from the client. This leads to the case where the client is stuck at requesting, while the proxy didn't receive it. So both are stuck. |
Yeah, it seems the changes are quite destructive for a lot of flows. If there is a problem - converting to draft. Make it ready for review when it is.. problem-free |
I just reset to main, somehow I still experiencing the issue. Is it related to caching or memory? Using antigravity gemini 3 and opus 4.5 |
|
I got no cases of client sending but proxy never getting it. Like, ever. @MasuRii |
Address bot review feedback on PR Mirrowel#32: - Add _disk_available flag update in _write_json exception handler - Add _disk_available flag update in log_stream_chunk (critical for streams) - Document intentional no-memory-fallback design for streams - Add _fallback_mode update in failure_logger exception handler - Add complete circuit breaker pattern to usage_manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up Review
Thank you for addressing all the feedback from the initial review.
Assessment of New Changes:
All previously raised concerns have been properly addressed:
✅ detailed_logger.py:
- Circuit breaker flag (
_disk_available = False) now set in_write_json()exception handler for state consistency - Circuit breaker flag also added to
log_stream_chunk()exception handler - this is particularly valuable for streaming responses where hundreds of chunks could trigger repeated failures - Added clear documentation explaining the intentional design decision to skip memory fallback for streams (OOM prevention)
✅ failure_logger.py:
_fallback_mode = Truenow correctly set in thelog_failure()exception handler to trigger handler recreation on subsequent calls
✅ usage_manager.py:
- Instance-level
_disk_availablecircuit breaker properly implemented - Early return check added to
_save_usage()to skip disk writes when unavailable - Circuit breaker reset logic in
_load_usage()re-enables disk writes after successful file load
Overall Status:
The circuit breaker patterns are now consistently applied across all three components. The implementation follows the "fail silently, log loudly" philosophy and ensures graceful degradation without log flooding or performance issues.
The PR is ready for merge from a code quality perspective.
This review was generated by an AI assistant.
…usage_manager.py by integrating runtime resilience with human-readable timestamps
Enhances application reliability by introducing a comprehensive I/O abstraction layer that eliminates crashes from filesystem issues. The system distinguishes between critical state files (credentials, usage data) that require memory buffering with retry logic, and disposable logs that can be safely dropped on failure. Key improvements: - New `ResilientStateWriter` class maintains in-memory state for critical files with background retry mechanism on disk failure - Introduced `safe_write_json`, `safe_log_write`, and `safe_mkdir` utility functions for one-shot operations with graceful degradation - Logging subsystems (`DetailedLogger`, `failure_logger`) now drop data on disk failure to prevent memory exhaustion during streaming - Authentication providers (`GoogleOAuthBase`, `IFlowAuthBase`, `QwenAuthBase`) preserve credentials in memory when filesystem becomes unavailable - `UsageManager` delegates persistence to `ResilientStateWriter` for automatic recovery from transient failures - `ProviderCache` disk operations now fail silently while maintaining in-memory functionality - Replaced scattered tempfile/atomic write patterns with centralized implementation featuring consistent error handling - All directory creation operations now proceed gracefully if parent paths are inaccessible - Thread-safe writer implementation supports concurrent usage from async contexts BREAKING CHANGE: `ProviderCache._save_to_disk()` no longer raises exceptions on filesystem errors. Consumers relying on exception handling for disk write failures must now check the `disk_available` field in `get_stats()` return value for monitoring disk health.
|
Okay, did a huge rework. It's same idea, but more centralized. And applied to all writes, not just some. |
This commit introduces a global buffered write registry with automatic shutdown flush, ensuring critical data (auth tokens, usage stats) is saved even when disk writes fail temporarily. - Add `BufferedWriteRegistry` singleton for centralized buffered write management - Implement periodic retry (30s interval) and atexit shutdown flush for pending writes - Enable `buffer_on_failure` parameter in `safe_write_json()` for credential files - Integrate buffering with `ResilientStateWriter` for automatic registry registration - Update OAuth providers (Google, Qwen, iFlow) to use buffered credential writes - Change provider cache `_save_to_disk()` to return success status for better tracking - Reduce log noise by changing missing thoughtSignature warnings to debug level - Export `BufferedWriteRegistry` from utils module for monitoring access The new architecture ensures data is never lost on graceful shutdown (Ctrl+C), with console output showing flush progress and results. All buffered writes are retried in a background thread and guaranteed a final save attempt on application exit.
…t variables Change credential loading strategy across all auth providers to prefer file-based credentials when an explicit file path is provided, falling back to legacy environment variables only when the file is not found. - Modified `_load_credentials()` in GoogleOAuthBase, IFlowAuthBase, and QwenAuthBase to attempt file loading first - Environment variable fallback now only triggers on FileNotFoundError, improving error clarity - Removed redundant exception handling in GoogleOAuthBase (duplicate catch blocks) - Fixed potential deadlock in credential refresh queue by removing nested lock acquisition - _refresh_token() already handles its own locking, so removed outer lock to prevent deadlock - Improved logging to indicate when fallback to environment variables occurs - Maintains backwards compatibility for existing deployments using environment variables This change addresses two issues: 1. Ensures explicit file paths are respected as the primary credential source 2. Prevents deadlock scenario where refresh queue would acquire lock before calling _refresh_token() which also acquires the same lock
|
This might need some usage testing. I had some weird bugs with credentials becoming unavailable, leaving one or even 0. It might be fixed already, but to be sure.. |
|
Gonna merge into Dev branch instead of Main - usage testing will go there. |
- Change antigravity cache logging from info to debug level to reduce noise - Replace Gemini CLI's delegated error parsing with native implementation - Add comprehensive duration parsing for multiple time formats (2s, 156h14m36s, 515092.73s) - Extract retry timing from human-readable error messages instead of relying on structured metadata - Improve error body extraction with multiple fallback strategies The Gemini CLI provider now handles its own quota error parsing rather than delegating to Antigravity, since the two providers use fundamentally different error formats: Gemini embeds reset times in human-readable messages while Antigravity uses structured RetryInfo/quotaResetDelay metadata.
Address bot review feedback on PR Mirrowel#32: - Add _disk_available flag update in _write_json exception handler - Add _disk_available flag update in log_stream_chunk (critical for streams) - Document intentional no-memory-fallback design for streams - Add _fallback_mode update in failure_logger exception handler - Add complete circuit breaker pattern to usage_manager
feat: add runtime resilience for file deletion survival
Summary
This PR implements Runtime Resilience - a feature that ensures the LLM API Key Proxy continues functioning even if core files are deleted during runtime. The proxy will only reflect changes on restart, enabling developers to safely modify or delete code while the proxy is actively serving requests.
Motivation
When developing or maintaining the proxy, users may want to:
Previously, deleting critical files (like
logs/orkey_usage.json) could cause the running proxy to fail. This PR ensures graceful degradation in all such scenarios.Changes
Core Resilience Patterns Applied
usage_manager.pyfailure_logger.py_fallback_modestate updatedetailed_logger.py_disk_availableflag + in-memory fallback + flag update in exception handlersgoogle_oauth_base.pyprovider_cache.py_disk_availablehealth flag + error trackingFollow-up Fixes (Commit
5e42536)Based on code review feedback, the following improvements were added:
detailed_logger.py:_disk_available = Falsein_write_json()exception handler for state consistency_disk_available = Falseinlog_stream_chunk()exception handler (critical for streaming - prevents hundreds of repeated failures per stream)failure_logger.py:_fallback_mode = Trueinlog_failure()exception handler to trigger handler recreationusage_manager.py:_disk_availableinstance variable as circuit breakerDocumentation
Added Section 5: Runtime Resilience to
DOCUMENTATION.mdcovering:Technical Details
Resilience Hierarchy
sys.modules. Deleting.pyfiles won't crash active requests.NullHandleror console if disk writes fail.Key Design Decisions
Testing
py_compile)Manual Testing Suggested
Files Changed
Commits
31c3d36-feat: add runtime resilience for file deletion survival5e42536-fix(resilience): complete circuit breaker patterns per PR reviewChecklist
Important
Introduces runtime resilience to ensure proxy functionality during file deletions by implementing in-memory fallbacks and directory auto-recreation across components.
usage_manager.py: Try/except for file operations, directory auto-recreation, usage tracking continues in memory.failure_logger.py: NullHandler fallback, handler recreation if file logging fails.detailed_logger.py:_disk_availableflag, in-memory logging fallback.google_oauth_base.py: Memory-first caching, cached token fallback if file operations fail.provider_cache.py:_disk_availablehealth flag, error tracking, caching degrades to memory-only.DOCUMENTATION.mddetailing resilience hierarchy and scenarios.This description was created by
for 5e42536. You can customize this summary. It will automatically update as commits are pushed.