Skip to content

Conversation

@MasuRii
Copy link
Contributor

@MasuRii MasuRii commented Dec 6, 2025

Summary

This PR implements Runtime Resilience - a feature that ensures the LLM API Key Proxy continues functioning even if core files are deleted during runtime. The proxy will only reflect changes on restart, enabling developers to safely modify or delete code while the proxy is actively serving requests.

Motivation

When developing or maintaining the proxy, users may want to:

  • Modify source code while the proxy is running
  • Clean up log files without restarting
  • Reset usage statistics without interrupting service

Previously, deleting critical files (like logs/ or key_usage.json) could cause the running proxy to fail. This PR ensures graceful degradation in all such scenarios.

Changes

Core Resilience Patterns Applied

Component Pattern Impact
usage_manager.py Try/except + directory auto-recreation + circuit breaker Usage tracking continues in memory; prevents repeated failed writes
failure_logger.py NullHandler fallback + handler recreation + _fallback_mode state update Logging degrades gracefully; triggers handler recreation on next call
detailed_logger.py _disk_available flag + in-memory fallback + flag update in exception handlers Request logging continues in memory; circuit breaker trips on any write failure
google_oauth_base.py Memory-first caching + cached token fallback OAuth tokens survive file deletion
provider_cache.py _disk_available health flag + error tracking Caching degrades to memory-only

Follow-up Fixes (Commit 5e42536)

Based on code review feedback, the following improvements were added:

  • detailed_logger.py:

    • Added _disk_available = False in _write_json() exception handler for state consistency
    • Added _disk_available = False in log_stream_chunk() exception handler (critical for streaming - prevents hundreds of repeated failures per stream)
    • Added documentation comment explaining intentional no-memory-fallback design for streams (OOM prevention)
  • failure_logger.py:

    • Added _fallback_mode = True in log_failure() exception handler to trigger handler recreation
  • usage_manager.py:

    • Added _disk_available instance variable as circuit breaker
    • Added early return check to skip disk writes when unavailable
    • Added circuit breaker reset on successful file load

Documentation

Added Section 5: Runtime Resilience to DOCUMENTATION.md covering:

  • Resilience hierarchy (4 levels from Core API to Logging)
  • "Develop While Running" workflow
  • Graceful degradation and data loss scenarios

Technical Details

Resilience Hierarchy

  1. Level 1 - Core API Handling: Python keeps all code in sys.modules. Deleting .py files won't crash active requests.
  2. Level 2 - Credential Management: OAuth tokens cached in memory first.
  3. Level 3 - Usage Tracking: Stats maintained in memory; files auto-recreated on next save.
  4. Level 4 - Logging: Falls back to NullHandler or console if disk writes fail.

Key Design Decisions

  • Memory-First: All state updates go to memory first, then attempt disk persistence
  • Directory Auto-Recreation: Missing directories recreated before write operations
  • Fail Silently, Log Loudly: File errors logged as warnings but never crash the proxy
  • Circuit Breaker Pattern: After first disk failure, skip further disk attempts to prevent log flooding and performance degradation (especially critical for streaming responses with hundreds of chunks)
  • No API Changes: All modifications are internal; external interface unchanged

Testing

  • ✅ All modified modules pass Python syntax validation (py_compile)
  • ✅ All imports verified working
  • ✅ Full package imports successful
  • ✅ Circuit breaker flags confirmed in all exception handlers

Manual Testing Suggested

# Start the proxy
python -m src.proxy_app.main

# While running, delete the logs folder (Windows)
rmdir /s /q logs

# Make an API request - should still work
# Logs folder will be auto-recreated

# Delete key_usage.json
del key_usage.json

# Make another request - usage tracking continues in memory

Files Changed

 DOCUMENTATION.md                                   |  31 +++++
 src/proxy_app/detailed_logger.py                   |  35 ++++-
 src/rotator_library/failure_logger.py              | 103 +++++++++++----
 src/rotator_library/providers/google_oauth_base.py | 141 +++++++++++++--------
 src/rotator_library/providers/provider_cache.py    |  37 +++++-
 src/rotator_library/usage_manager.py               |  56 ++++++-
 6 files changed, 311 insertions(+), 92 deletions(-)

Commits

  1. 31c3d36 - feat: add runtime resilience for file deletion survival

    • Initial implementation of resilience patterns across all components
  2. 5e42536 - fix(resilience): complete circuit breaker patterns per PR review

    • Address review feedback: complete circuit breaker implementation in all exception handlers
    • Add documentation for intentional design decisions
    • Ensure state consistency across all failure scenarios

Checklist

  • Code follows project conventions
  • All imports verified working
  • Documentation updated
  • No breaking changes
  • Backwards compatible
  • Circuit breaker patterns complete in all exception handlers
  • All review feedback addressed

Important

Introduces runtime resilience to ensure proxy functionality during file deletions by implementing in-memory fallbacks and directory auto-recreation across components.

  • Behavior:
    • Implements runtime resilience to ensure proxy continues functioning if critical files are deleted during runtime.
    • In-memory fallbacks and directory auto-recreation applied across components.
  • Core Resilience Patterns:
    • usage_manager.py: Try/except for file operations, directory auto-recreation, usage tracking continues in memory.
    • failure_logger.py: NullHandler fallback, handler recreation if file logging fails.
    • detailed_logger.py: _disk_available flag, in-memory logging fallback.
    • google_oauth_base.py: Memory-first caching, cached token fallback if file operations fail.
    • provider_cache.py: _disk_available health flag, error tracking, caching degrades to memory-only.
  • Documentation:
    • Adds "Runtime Resilience" section to DOCUMENTATION.md detailing resilience hierarchy and scenarios.
  • Misc:
    • No changes to external API; all modifications are internal.

This description was created by Ellipsis for 5e42536. You can customize this summary. It will automatically update as commits are pushed.

Implement graceful degradation patterns that allow the proxy to continue
running even if core files are deleted during runtime. Changes only take
effect on restart, enabling safe development while the proxy is serving.

## Changes by Component

### Usage Manager (usage_manager.py)
- Wrap `_save_usage()` in try/except with directory auto-recreation
- Enhanced `_load_usage()` with explicit error handling
- In-memory state continues working if file operations fail

### Failure Logger (failure_logger.py)
- Add module-level `_file_handler` and `_fallback_mode` state
- Create `_create_file_handler()` with directory auto-recreation
- Create `_ensure_handler_valid()` for handler recovery
- Use NullHandler as fallback when file logging fails

### Detailed Logger (detailed_logger.py)
- Add class-level `_disk_available` and `_console_fallback_warned` flags
- Add instance-level `_in_memory_logs` list for fallback storage
- Skip disk writes gracefully when filesystem unavailable

### Google OAuth Base (google_oauth_base.py)
- Update memory cache FIRST before disk write (memory-first pattern)
- Use cached tokens as fallback when refresh/save fails
- Log warnings but don't crash on persistence failures

### Provider Cache (provider_cache.py)
- Add `_disk_available` health flag and `disk_errors` counter
- Track disk health status in get_stats()
- Gracefully degrade to memory-only caching on disk failures

### Documentation (DOCUMENTATION.md)
- Add Section 5: Runtime Resilience with resilience hierarchy
- Document "Develop While Running" workflow
- Explain graceful degradation and data loss scenarios
@mirrobot-agent
Copy link
Contributor

mirrobot-agent bot commented Dec 6, 2025

Starting my review of the runtime resilience implementation. I'll be examining how the fallback patterns work across the usage manager, failure logger, detailed logger, OAuth base, and provider cache modules. Diving into the disk-to-memory degradation logic now.

Copy link
Contributor

@mirrobot-agent mirrobot-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Assessment

This is a well-designed implementation of runtime resilience patterns. The "memory-first, fail silently, log loudly" philosophy is consistently applied across all modified modules, and the documentation accurately reflects the implementation. The PR achieves its goal of allowing the proxy to survive filesystem disruptions during runtime.

Architectural Feedback

The resilience hierarchy is sound:

  • google_oauth_base.py: The memory-first caching in _save_credentials and the multi-level fallback in get_auth_header are particularly well-implemented
  • provider_cache.py: Excellent observability with disk_errors tracking and disk_available health flag exposed via get_stats()
  • usage_manager.py: Good race condition handling in _load_usage with specific exception handlers for different failure modes
  • failure_logger.py: The handler recreation pattern with _ensure_handler_valid() is robust

Key Suggestions

Two minor improvements suggested for detailed_logger.py:

  1. State consistency: The _disk_available flag should be updated in _write_json on both success (recovery) and failure, matching the pattern used in provider_cache.py
  2. Design documentation: Consider adding a comment explaining why stream chunks don't use in-memory fallback (likely intentional for memory management)

Nitpicks and Minor Points

  • The documentation in DOCUMENTATION.md is clear and helpful, accurately describing the resilience levels
  • The use of [RUNTIME RESILIENCE] markers in docstrings is a nice touch for code navigation

Questions for the Author

  • Is the asymmetry between _write_json (stores in memory on failure) and log_stream_chunk (drops on failure) intentional? If so, a brief comment would help future maintainers understand this design choice.

This review was generated by an AI assistant.

@MasuRii
Copy link
Contributor Author

MasuRii commented Dec 6, 2025

There's still issue with this PR, somehow there's a case where the proxy didn't receive the api request from the client. This leads to the case where the client is stuck at requesting, while the proxy didn't receive it. So both are stuck.

@Mirrowel
Copy link
Owner

Mirrowel commented Dec 6, 2025

There's still issue with this PR, somehow there's a case where the proxy didn't receive the api request from the client. This leads to the case where the client is stuck at requesting, while the proxy didn't receive it. So both are stuck.

Yeah, it seems the changes are quite destructive for a lot of flows.

If there is a problem - converting to draft. Make it ready for review when it is.. problem-free

@Mirrowel Mirrowel marked this pull request as draft December 6, 2025 19:01
@MasuRii
Copy link
Contributor Author

MasuRii commented Dec 6, 2025

There's still issue with this PR, somehow there's a case where the proxy didn't receive the api request from the client. This leads to the case where the client is stuck at requesting, while the proxy didn't receive it. So both are stuck.

Yeah, it seems the changes are quite destructive for a lot of flows.

If there is a problem - converting to draft. Make it ready for review when it is.. problem-free

I just reset to main, somehow I still experiencing the issue. Is it related to caching or memory? Using antigravity gemini 3 and opus 4.5

@Mirrowel
Copy link
Owner

Mirrowel commented Dec 6, 2025

I got no cases of client sending but proxy never getting it. Like, ever. @MasuRii
The bot doing the review above is running on latest main commit.

Address bot review feedback on PR Mirrowel#32:

- Add _disk_available flag update in _write_json exception handler

- Add _disk_available flag update in log_stream_chunk (critical for streams)

- Document intentional no-memory-fallback design for streams

- Add _fallback_mode update in failure_logger exception handler

- Add complete circuit breaker pattern to usage_manager
Copy link
Contributor

@mirrobot-agent mirrobot-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up Review

Thank you for addressing all the feedback from the initial review.

Assessment of New Changes:

All previously raised concerns have been properly addressed:

detailed_logger.py:

  • Circuit breaker flag (_disk_available = False) now set in _write_json() exception handler for state consistency
  • Circuit breaker flag also added to log_stream_chunk() exception handler - this is particularly valuable for streaming responses where hundreds of chunks could trigger repeated failures
  • Added clear documentation explaining the intentional design decision to skip memory fallback for streams (OOM prevention)

failure_logger.py:

  • _fallback_mode = True now correctly set in the log_failure() exception handler to trigger handler recreation on subsequent calls

usage_manager.py:

  • Instance-level _disk_available circuit breaker properly implemented
  • Early return check added to _save_usage() to skip disk writes when unavailable
  • Circuit breaker reset logic in _load_usage() re-enables disk writes after successful file load

Overall Status:

The circuit breaker patterns are now consistently applied across all three components. The implementation follows the "fail silently, log loudly" philosophy and ensures graceful degradation without log flooding or performance issues.

The PR is ready for merge from a code quality perspective.

This review was generated by an AI assistant.

…usage_manager.py by integrating runtime resilience with human-readable timestamps
Enhances application reliability by introducing a comprehensive I/O abstraction layer that eliminates crashes from filesystem issues. The system distinguishes between critical state files (credentials, usage data) that require memory buffering with retry logic, and disposable logs that can be safely dropped on failure.

Key improvements:

- New `ResilientStateWriter` class maintains in-memory state for critical files with background retry mechanism on disk failure
- Introduced `safe_write_json`, `safe_log_write`, and `safe_mkdir` utility functions for one-shot operations with graceful degradation
- Logging subsystems (`DetailedLogger`, `failure_logger`) now drop data on disk failure to prevent memory exhaustion during streaming
- Authentication providers (`GoogleOAuthBase`, `IFlowAuthBase`, `QwenAuthBase`) preserve credentials in memory when filesystem becomes unavailable
- `UsageManager` delegates persistence to `ResilientStateWriter` for automatic recovery from transient failures
- `ProviderCache` disk operations now fail silently while maintaining in-memory functionality
- Replaced scattered tempfile/atomic write patterns with centralized implementation featuring consistent error handling
- All directory creation operations now proceed gracefully if parent paths are inaccessible
- Thread-safe writer implementation supports concurrent usage from async contexts

BREAKING CHANGE: `ProviderCache._save_to_disk()` no longer raises exceptions on filesystem errors. Consumers relying on exception handling for disk write failures must now check the `disk_available` field in `get_stats()` return value for monitoring disk health.
@Mirrowel
Copy link
Owner

Mirrowel commented Dec 8, 2025

Okay, did a huge rework. It's same idea, but more centralized. And applied to all writes, not just some.

This commit introduces a global buffered write registry with automatic shutdown flush, ensuring critical data (auth tokens, usage stats) is saved even when disk writes fail temporarily.

- Add `BufferedWriteRegistry` singleton for centralized buffered write management
- Implement periodic retry (30s interval) and atexit shutdown flush for pending writes
- Enable `buffer_on_failure` parameter in `safe_write_json()` for credential files
- Integrate buffering with `ResilientStateWriter` for automatic registry registration
- Update OAuth providers (Google, Qwen, iFlow) to use buffered credential writes
- Change provider cache `_save_to_disk()` to return success status for better tracking
- Reduce log noise by changing missing thoughtSignature warnings to debug level
- Export `BufferedWriteRegistry` from utils module for monitoring access

The new architecture ensures data is never lost on graceful shutdown (Ctrl+C), with console output showing flush progress and results. All buffered writes are retried in a background thread and guaranteed a final save attempt on application exit.
…t variables

Change credential loading strategy across all auth providers to prefer file-based credentials when an explicit file path is provided, falling back to legacy environment variables only when the file is not found.

- Modified `_load_credentials()` in GoogleOAuthBase, IFlowAuthBase, and QwenAuthBase to attempt file loading first
- Environment variable fallback now only triggers on FileNotFoundError, improving error clarity
- Removed redundant exception handling in GoogleOAuthBase (duplicate catch blocks)
- Fixed potential deadlock in credential refresh queue by removing nested lock acquisition
- _refresh_token() already handles its own locking, so removed outer lock to prevent deadlock
- Improved logging to indicate when fallback to environment variables occurs
- Maintains backwards compatibility for existing deployments using environment variables

This change addresses two issues:
1. Ensures explicit file paths are respected as the primary credential source
2. Prevents deadlock scenario where refresh queue would acquire lock before calling _refresh_token() which also acquires the same lock
@Mirrowel
Copy link
Owner

Mirrowel commented Dec 8, 2025

This might need some usage testing. I had some weird bugs with credentials becoming unavailable, leaving one or even 0. It might be fixed already, but to be sure..

@Mirrowel
Copy link
Owner

Mirrowel commented Dec 8, 2025

Gonna merge into Dev branch instead of Main - usage testing will go there.

- Change antigravity cache logging from info to debug level to reduce noise
- Replace Gemini CLI's delegated error parsing with native implementation
- Add comprehensive duration parsing for multiple time formats (2s, 156h14m36s, 515092.73s)
- Extract retry timing from human-readable error messages instead of relying on structured metadata
- Improve error body extraction with multiple fallback strategies

The Gemini CLI provider now handles its own quota error parsing rather than delegating to Antigravity, since the two providers use fundamentally different error formats: Gemini embeds reset times in human-readable messages while Antigravity uses structured RetryInfo/quotaResetDelay metadata.
@Mirrowel Mirrowel changed the base branch from main to dev December 8, 2025 09:29
@Mirrowel Mirrowel merged commit 772abcf into Mirrowel:dev Dec 8, 2025
2 of 3 checks passed
@MasuRii MasuRii deleted the feat/runtime-resilience branch January 14, 2026 13:50
b3nw pushed a commit to b3nw/LLM-API-Key-Proxy that referenced this pull request Jan 21, 2026
Address bot review feedback on PR Mirrowel#32:

- Add _disk_available flag update in _write_json exception handler

- Add _disk_available flag update in log_stream_chunk (critical for streams)

- Document intentional no-memory-fallback design for streams

- Add _fallback_mode update in failure_logger exception handler

- Add complete circuit breaker pattern to usage_manager
b3nw pushed a commit to b3nw/LLM-API-Key-Proxy that referenced this pull request Jan 21, 2026
feat: add runtime resilience for file deletion survival
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants