feat: add runtime resilience for file deletion survival #32

MasuRii · 2025-12-06T14:20:17Z

Summary

This PR implements Runtime Resilience - a feature that ensures the LLM API Key Proxy continues functioning even if core files are deleted during runtime. The proxy will only reflect changes on restart, enabling developers to safely modify or delete code while the proxy is actively serving requests.

Motivation

When developing or maintaining the proxy, users may want to:

Modify source code while the proxy is running
Clean up log files without restarting
Reset usage statistics without interrupting service

Previously, deleting critical files (like logs/ or key_usage.json) could cause the running proxy to fail. This PR ensures graceful degradation in all such scenarios.

Changes

Core Resilience Patterns Applied

Component	Pattern	Impact
`usage_manager.py`	Try/except + directory auto-recreation + circuit breaker	Usage tracking continues in memory; prevents repeated failed writes
`failure_logger.py`	NullHandler fallback + handler recreation + `_fallback_mode` state update	Logging degrades gracefully; triggers handler recreation on next call
`detailed_logger.py`	`_disk_available` flag + in-memory fallback + flag update in exception handlers	Request logging continues in memory; circuit breaker trips on any write failure
`google_oauth_base.py`	Memory-first caching + cached token fallback	OAuth tokens survive file deletion
`provider_cache.py`	`_disk_available` health flag + error tracking	Caching degrades to memory-only

Follow-up Fixes (Commit `5e42536`)

Based on code review feedback, the following improvements were added:

detailed_logger.py:
- Added _disk_available = False in _write_json() exception handler for state consistency
- Added _disk_available = False in log_stream_chunk() exception handler (critical for streaming - prevents hundreds of repeated failures per stream)
- Added documentation comment explaining intentional no-memory-fallback design for streams (OOM prevention)
failure_logger.py:
- Added _fallback_mode = True in log_failure() exception handler to trigger handler recreation
usage_manager.py:
- Added _disk_available instance variable as circuit breaker
- Added early return check to skip disk writes when unavailable
- Added circuit breaker reset on successful file load

Documentation

Added Section 5: Runtime Resilience to DOCUMENTATION.md covering:

Resilience hierarchy (4 levels from Core API to Logging)
"Develop While Running" workflow
Graceful degradation and data loss scenarios

Technical Details

Resilience Hierarchy

Level 1 - Core API Handling: Python keeps all code in sys.modules. Deleting .py files won't crash active requests.
Level 2 - Credential Management: OAuth tokens cached in memory first.
Level 3 - Usage Tracking: Stats maintained in memory; files auto-recreated on next save.
Level 4 - Logging: Falls back to NullHandler or console if disk writes fail.

Key Design Decisions

Memory-First: All state updates go to memory first, then attempt disk persistence
Directory Auto-Recreation: Missing directories recreated before write operations
Fail Silently, Log Loudly: File errors logged as warnings but never crash the proxy
Circuit Breaker Pattern: After first disk failure, skip further disk attempts to prevent log flooding and performance degradation (especially critical for streaming responses with hundreds of chunks)
No API Changes: All modifications are internal; external interface unchanged

Testing

✅ All modified modules pass Python syntax validation (py_compile)
✅ All imports verified working
✅ Full package imports successful
✅ Circuit breaker flags confirmed in all exception handlers

Manual Testing Suggested

# Start the proxy
python -m src.proxy_app.main

# While running, delete the logs folder (Windows)
rmdir /s /q logs

# Make an API request - should still work
# Logs folder will be auto-recreated

# Delete key_usage.json
del key_usage.json

# Make another request - usage tracking continues in memory

Files Changed

 DOCUMENTATION.md                                   |  31 +++++
 src/proxy_app/detailed_logger.py                   |  35 ++++-
 src/rotator_library/failure_logger.py              | 103 +++++++++++----
 src/rotator_library/providers/google_oauth_base.py | 141 +++++++++++++--------
 src/rotator_library/providers/provider_cache.py    |  37 +++++-
 src/rotator_library/usage_manager.py               |  56 ++++++-
 6 files changed, 311 insertions(+), 92 deletions(-)

Commits

31c3d36 - feat: add runtime resilience for file deletion survival
- Initial implementation of resilience patterns across all components
5e42536 - fix(resilience): complete circuit breaker patterns per PR review
- Address review feedback: complete circuit breaker implementation in all exception handlers
- Add documentation for intentional design decisions
- Ensure state consistency across all failure scenarios

Checklist

Code follows project conventions
All imports verified working
Documentation updated
No breaking changes
Backwards compatible
Circuit breaker patterns complete in all exception handlers
All review feedback addressed

Important

Introduces runtime resilience to ensure proxy functionality during file deletions by implementing in-memory fallbacks and directory auto-recreation across components.

Behavior:
- Implements runtime resilience to ensure proxy continues functioning if critical files are deleted during runtime.
- In-memory fallbacks and directory auto-recreation applied across components.
Core Resilience Patterns:
- usage_manager.py: Try/except for file operations, directory auto-recreation, usage tracking continues in memory.
- failure_logger.py: NullHandler fallback, handler recreation if file logging fails.
- detailed_logger.py: _disk_available flag, in-memory logging fallback.
- google_oauth_base.py: Memory-first caching, cached token fallback if file operations fail.
- provider_cache.py: _disk_available health flag, error tracking, caching degrades to memory-only.
Documentation:
- Adds "Runtime Resilience" section to DOCUMENTATION.md detailing resilience hierarchy and scenarios.
Misc:
- No changes to external API; all modifications are internal.

^{This description was created by}^{for 5e42536. You can customize this summary. It will automatically update as commits are pushed.}

Implement graceful degradation patterns that allow the proxy to continue running even if core files are deleted during runtime. Changes only take effect on restart, enabling safe development while the proxy is serving. ## Changes by Component ### Usage Manager (usage_manager.py) - Wrap `_save_usage()` in try/except with directory auto-recreation - Enhanced `_load_usage()` with explicit error handling - In-memory state continues working if file operations fail ### Failure Logger (failure_logger.py) - Add module-level `_file_handler` and `_fallback_mode` state - Create `_create_file_handler()` with directory auto-recreation - Create `_ensure_handler_valid()` for handler recovery - Use NullHandler as fallback when file logging fails ### Detailed Logger (detailed_logger.py) - Add class-level `_disk_available` and `_console_fallback_warned` flags - Add instance-level `_in_memory_logs` list for fallback storage - Skip disk writes gracefully when filesystem unavailable ### Google OAuth Base (google_oauth_base.py) - Update memory cache FIRST before disk write (memory-first pattern) - Use cached tokens as fallback when refresh/save fails - Log warnings but don't crash on persistence failures ### Provider Cache (provider_cache.py) - Add `_disk_available` health flag and `disk_errors` counter - Track disk health status in get_stats() - Gracefully degrade to memory-only caching on disk failures ### Documentation (DOCUMENTATION.md) - Add Section 5: Runtime Resilience with resilience hierarchy - Document "Develop While Running" workflow - Explain graceful degradation and data loss scenarios

mirrobot-agent · 2025-12-06T17:43:40Z

Starting my review of the runtime resilience implementation. I'll be examining how the fallback patterns work across the usage manager, failure logger, detailed logger, OAuth base, and provider cache modules. Diving into the disk-to-memory degradation logic now.

mirrobot-agent

Overall Assessment

This is a well-designed implementation of runtime resilience patterns. The "memory-first, fail silently, log loudly" philosophy is consistently applied across all modified modules, and the documentation accurately reflects the implementation. The PR achieves its goal of allowing the proxy to survive filesystem disruptions during runtime.

Architectural Feedback

The resilience hierarchy is sound:

google_oauth_base.py: The memory-first caching in _save_credentials and the multi-level fallback in get_auth_header are particularly well-implemented
provider_cache.py: Excellent observability with disk_errors tracking and disk_available health flag exposed via get_stats()
usage_manager.py: Good race condition handling in _load_usage with specific exception handlers for different failure modes
failure_logger.py: The handler recreation pattern with _ensure_handler_valid() is robust

Key Suggestions

Two minor improvements suggested for detailed_logger.py:

State consistency: The _disk_available flag should be updated in _write_json on both success (recovery) and failure, matching the pattern used in provider_cache.py
Design documentation: Consider adding a comment explaining why stream chunks don't use in-memory fallback (likely intentional for memory management)

Nitpicks and Minor Points

The documentation in DOCUMENTATION.md is clear and helpful, accurately describing the resilience levels
The use of [RUNTIME RESILIENCE] markers in docstrings is a nice touch for code navigation

Questions for the Author

Is the asymmetry between _write_json (stores in memory on failure) and log_stream_chunk (drops on failure) intentional? If so, a brief comment would help future maintainers understand this design choice.

This review was generated by an AI assistant.

src/proxy_app/detailed_logger.py

MasuRii · 2025-12-06T18:59:09Z

There's still issue with this PR, somehow there's a case where the proxy didn't receive the api request from the client. This leads to the case where the client is stuck at requesting, while the proxy didn't receive it. So both are stuck.

Mirrowel · 2025-12-06T19:01:16Z

There's still issue with this PR, somehow there's a case where the proxy didn't receive the api request from the client. This leads to the case where the client is stuck at requesting, while the proxy didn't receive it. So both are stuck.

Yeah, it seems the changes are quite destructive for a lot of flows.

If there is a problem - converting to draft. Make it ready for review when it is.. problem-free

MasuRii · 2025-12-06T19:16:50Z

There's still issue with this PR, somehow there's a case where the proxy didn't receive the api request from the client. This leads to the case where the client is stuck at requesting, while the proxy didn't receive it. So both are stuck.

Yeah, it seems the changes are quite destructive for a lot of flows.

If there is a problem - converting to draft. Make it ready for review when it is.. problem-free

I just reset to main, somehow I still experiencing the issue. Is it related to caching or memory? Using antigravity gemini 3 and opus 4.5

Mirrowel · 2025-12-06T19:18:49Z

I got no cases of client sending but proxy never getting it. Like, ever. @MasuRii
The bot doing the review above is running on latest main commit.

Address bot review feedback on PR Mirrowel#32: - Add _disk_available flag update in _write_json exception handler - Add _disk_available flag update in log_stream_chunk (critical for streams) - Document intentional no-memory-fallback design for streams - Add _fallback_mode update in failure_logger exception handler - Add complete circuit breaker pattern to usage_manager

mirrobot-agent

Follow-up Review

Thank you for addressing all the feedback from the initial review.

Assessment of New Changes:

All previously raised concerns have been properly addressed:

✅ detailed_logger.py:

Circuit breaker flag (_disk_available = False) now set in _write_json() exception handler for state consistency
Circuit breaker flag also added to log_stream_chunk() exception handler - this is particularly valuable for streaming responses where hundreds of chunks could trigger repeated failures
Added clear documentation explaining the intentional design decision to skip memory fallback for streams (OOM prevention)

✅ failure_logger.py:

_fallback_mode = True now correctly set in the log_failure() exception handler to trigger handler recreation on subsequent calls

✅ usage_manager.py:

Instance-level _disk_available circuit breaker properly implemented
Early return check added to _save_usage() to skip disk writes when unavailable
Circuit breaker reset logic in _load_usage() re-enables disk writes after successful file load

Overall Status:

The circuit breaker patterns are now consistently applied across all three components. The implementation follows the "fail silently, log loudly" philosophy and ensures graceful degradation without log flooding or performance issues.

The PR is ready for merge from a code quality perspective.

This review was generated by an AI assistant.

…usage_manager.py by integrating runtime resilience with human-readable timestamps

Enhances application reliability by introducing a comprehensive I/O abstraction layer that eliminates crashes from filesystem issues. The system distinguishes between critical state files (credentials, usage data) that require memory buffering with retry logic, and disposable logs that can be safely dropped on failure. Key improvements: - New `ResilientStateWriter` class maintains in-memory state for critical files with background retry mechanism on disk failure - Introduced `safe_write_json`, `safe_log_write`, and `safe_mkdir` utility functions for one-shot operations with graceful degradation - Logging subsystems (`DetailedLogger`, `failure_logger`) now drop data on disk failure to prevent memory exhaustion during streaming - Authentication providers (`GoogleOAuthBase`, `IFlowAuthBase`, `QwenAuthBase`) preserve credentials in memory when filesystem becomes unavailable - `UsageManager` delegates persistence to `ResilientStateWriter` for automatic recovery from transient failures - `ProviderCache` disk operations now fail silently while maintaining in-memory functionality - Replaced scattered tempfile/atomic write patterns with centralized implementation featuring consistent error handling - All directory creation operations now proceed gracefully if parent paths are inaccessible - Thread-safe writer implementation supports concurrent usage from async contexts BREAKING CHANGE: `ProviderCache._save_to_disk()` no longer raises exceptions on filesystem errors. Consumers relying on exception handling for disk write failures must now check the `disk_available` field in `get_stats()` return value for monitoring disk health.

Mirrowel · 2025-12-08T08:29:52Z

Okay, did a huge rework. It's same idea, but more centralized. And applied to all writes, not just some.

This commit introduces a global buffered write registry with automatic shutdown flush, ensuring critical data (auth tokens, usage stats) is saved even when disk writes fail temporarily. - Add `BufferedWriteRegistry` singleton for centralized buffered write management - Implement periodic retry (30s interval) and atexit shutdown flush for pending writes - Enable `buffer_on_failure` parameter in `safe_write_json()` for credential files - Integrate buffering with `ResilientStateWriter` for automatic registry registration - Update OAuth providers (Google, Qwen, iFlow) to use buffered credential writes - Change provider cache `_save_to_disk()` to return success status for better tracking - Reduce log noise by changing missing thoughtSignature warnings to debug level - Export `BufferedWriteRegistry` from utils module for monitoring access The new architecture ensures data is never lost on graceful shutdown (Ctrl+C), with console output showing flush progress and results. All buffered writes are retried in a background thread and guaranteed a final save attempt on application exit.

…t variables Change credential loading strategy across all auth providers to prefer file-based credentials when an explicit file path is provided, falling back to legacy environment variables only when the file is not found. - Modified `_load_credentials()` in GoogleOAuthBase, IFlowAuthBase, and QwenAuthBase to attempt file loading first - Environment variable fallback now only triggers on FileNotFoundError, improving error clarity - Removed redundant exception handling in GoogleOAuthBase (duplicate catch blocks) - Fixed potential deadlock in credential refresh queue by removing nested lock acquisition - _refresh_token() already handles its own locking, so removed outer lock to prevent deadlock - Improved logging to indicate when fallback to environment variables occurs - Maintains backwards compatibility for existing deployments using environment variables This change addresses two issues: 1. Ensures explicit file paths are respected as the primary credential source 2. Prevents deadlock scenario where refresh queue would acquire lock before calling _refresh_token() which also acquires the same lock

Mirrowel · 2025-12-08T09:23:11Z

This might need some usage testing. I had some weird bugs with credentials becoming unavailable, leaving one or even 0. It might be fixed already, but to be sure..

Mirrowel · 2025-12-08T09:28:38Z

Gonna merge into Dev branch instead of Main - usage testing will go there.

- Change antigravity cache logging from info to debug level to reduce noise - Replace Gemini CLI's delegated error parsing with native implementation - Add comprehensive duration parsing for multiple time formats (2s, 156h14m36s, 515092.73s) - Extract retry timing from human-readable error messages instead of relying on structured metadata - Improve error body extraction with multiple fallback strategies The Gemini CLI provider now handles its own quota error parsing rather than delegating to Antigravity, since the two providers use fundamentally different error formats: Gemini embeds reset times in human-readable messages while Antigravity uses structured RetryInfo/quotaResetDelay metadata.

Address bot review feedback on PR Mirrowel#32: - Add _disk_available flag update in _write_json exception handler - Add _disk_available flag update in log_stream_chunk (critical for streams) - Document intentional no-memory-fallback design for streams - Add _fallback_mode update in failure_logger exception handler - Add complete circuit breaker pattern to usage_manager

feat: add runtime resilience for file deletion survival

mirrobot-agent bot reviewed Dec 6, 2025

View reviewed changes

src/proxy_app/detailed_logger.py Outdated Show resolved Hide resolved

src/proxy_app/detailed_logger.py Outdated Show resolved Hide resolved

Mirrowel marked this pull request as draft December 6, 2025 19:01

MasuRii marked this pull request as ready for review December 7, 2025 18:46

MasuRii mentioned this pull request Dec 7, 2025

fix(google-oauth): prevent credentials from becoming permanently stuck #34

Merged

mirrobot-agent bot approved these changes Dec 7, 2025

View reviewed changes

Mirrowel added 2 commits December 8, 2025 08:34

Merge origin/main into feat/runtime-resilience - Resolve conflict in …

01932f9

…usage_manager.py by integrating runtime resilience with human-readable timestamps

Mirrowel added 2 commits December 8, 2025 10:02

Mirrowel changed the base branch from main to dev December 8, 2025 09:29

Mirrowel merged commit 772abcf into Mirrowel:dev Dec 8, 2025
2 of 3 checks passed

MasuRii deleted the feat/runtime-resilience branch January 14, 2026 13:50

b3nw pushed a commit to b3nw/LLM-API-Key-Proxy that referenced this pull request Jan 21, 2026

Merge pull request Mirrowel#32 from MasuRii/feat/runtime-resilience

3b5fdd1

feat: add runtime resilience for file deletion survival

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add runtime resilience for file deletion survival #32

feat: add runtime resilience for file deletion survival #32

Uh oh!

MasuRii commented Dec 6, 2025 •

edited

Loading

Uh oh!

mirrobot-agent bot commented Dec 6, 2025

Uh oh!

mirrobot-agent bot left a comment

Uh oh!

Uh oh!

Uh oh!

MasuRii commented Dec 6, 2025

Uh oh!

Mirrowel commented Dec 6, 2025

Uh oh!

MasuRii commented Dec 6, 2025

Uh oh!

Mirrowel commented Dec 6, 2025 •

edited

Loading

Uh oh!

mirrobot-agent bot left a comment

Uh oh!

Mirrowel commented Dec 8, 2025 •

edited

Loading

Uh oh!

Mirrowel commented Dec 8, 2025 •

edited

Loading

Uh oh!

Mirrowel commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: add runtime resilience for file deletion survival #32

feat: add runtime resilience for file deletion survival #32

Uh oh!

Conversation

MasuRii commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Core Resilience Patterns Applied

Follow-up Fixes (Commit 5e42536)

Documentation

Technical Details

Resilience Hierarchy

Key Design Decisions

Testing

Manual Testing Suggested

Files Changed

Commits

Checklist

Uh oh!

mirrobot-agent bot commented Dec 6, 2025

Uh oh!

mirrobot-agent bot left a comment

Choose a reason for hiding this comment

Overall Assessment

Architectural Feedback

Key Suggestions

Nitpicks and Minor Points

Questions for the Author

Uh oh!

Uh oh!

Uh oh!

MasuRii commented Dec 6, 2025

Uh oh!

Mirrowel commented Dec 6, 2025

Uh oh!

MasuRii commented Dec 6, 2025

Uh oh!

Mirrowel commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mirrobot-agent bot left a comment

Choose a reason for hiding this comment

Follow-up Review

Uh oh!

Mirrowel commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mirrowel commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mirrowel commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MasuRii commented Dec 6, 2025 •

edited

Loading

Follow-up Fixes (Commit `5e42536`)

Mirrowel commented Dec 6, 2025 •

edited

Loading

Mirrowel commented Dec 8, 2025 •

edited

Loading

Mirrowel commented Dec 8, 2025 •

edited

Loading