Add artifact persistence and recovery system to kernel runtime#299
Merged
Edwardvaneechoud merged 9 commits intofeauture/kernel-implementationfrom Feb 3, 2026
Conversation
Artifacts (trained models, encoders, computed objects) are now automatically
persisted to disk using cloudpickle when published via flowfile.publish_artifact().
On kernel restart, artifacts can be recovered transparently — either lazily
(loaded on first access, the default) or eagerly (all pre-loaded at startup).
Key changes:
Kernel Runtime:
- Add ArtifactPersistence layer using cloudpickle with SHA-256 checksums
- Extend ArtifactStore with persistence integration (auto-save on publish,
lazy-load on get, auto-delete on clear)
- Add /recover, /recovery-status, /cleanup, /persistence API endpoints
- Switch from deprecated on_event to lifespan context manager
- Configure persistence via PERSISTENCE_ENABLED, PERSISTENCE_PATH,
KERNEL_ID, and RECOVERY_MODE environment variables
- Add cloudpickle>=3.0.0 dependency
FlowFile Core:
- Add RecoveryMode, RecoveryStatus, CleanupRequest, CleanupResult,
ArtifactPersistenceInfo models
- Add KernelManager proxy methods for recover, recovery-status,
cleanup, and persistence endpoints
- Add core API routes: POST /{id}/recover, GET /{id}/recovery-status,
POST /{id}/cleanup, GET /{id}/persistence
- Pass persistence env vars (KERNEL_ID, PERSISTENCE_ENABLED, etc.)
when starting kernel Docker containers
Tests: 138 passing (27 new persistence tests + 111 existing)
https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
- Unit-level tests (20 tests): model validation, KernelManager proxy methods with mocked HTTP, and Docker env var injection verification - Docker-based integration tests (17 tests, @pytest.mark.kernel): persistence basics, health/recovery endpoints, manual recovery, artifact cleanup, manager proxy methods, and re-execution durability https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
…istence-recovery-5sl5m Resolved conflict in artifact_store.py by keeping both documentation sections: - Persistence backend integration description - Tech debt note about memory pressure for large artifacts https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
The python_code.svg icon was added but the node definition was still referencing polars_code.png. https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
- Move env var reading from module import time to _setup_persistence() call time, allowing tests to set PERSISTENCE_PATH before TestClient triggers the lifespan handler - Update conftest.py client fixture to set PERSISTENCE_PATH to tmp_path - Reset all persistence-related module globals between tests - Tests now actually write to disk (temp dir) rather than being skipped https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
73f3307 to
a8db73a
Compare
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…rsistence-recovery-5sl5m
Critical fixes: - Fix race condition in publish(): track persist_pending state and update persisted flag only after successful disk write - Fix lazy load lock contention: use per-key locks and two-phase loading to avoid holding global lock during slow I/O operations - Initialize to_remove/lazy_remove before lock block in clear_by_node_ids() Design improvements: - Explicit metadata field selection in persistence.save() using whitelist approach (_PERSISTABLE_FIELDS) instead of filtering out 'object' - Add ArtifactIdentifier model for type-safe cleanup requests - Rename RECOVERY_MODE=none to CLEAR with deprecation warning for 'none' - Strip leading dots in _safe_dirname() to prevent hidden directories https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
- Add persistence_enabled and recovery_mode fields to KernelConfig and KernelInfo - Move RecoveryMode enum before KernelConfig to allow reference - Update manager.py to use kernel's persistence settings (not hardcoded) - Fix sync version of start_kernel which was missing persistence env vars - Update create_kernel to copy persistence settings from config to info - Fix misleading error message in get() when lazy load fails (now says "exists on disk but failed to load" instead of "not found") - Add test for per-kernel persistence configuration https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
Artifacts older than 24 hours are automatically removed when the kernel container starts. This ensures persistence serves as a recovery mechanism rather than permanent storage. Configurable via PERSISTENCE_CLEANUP_HOURS env var (default: 24, 0 to disable). https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements a comprehensive artifact persistence and recovery system for the kernel runtime, enabling automatic disk-based storage of published artifacts and manual/lazy recovery mechanisms.
Key Changes
Core Models (
kernel/models.py)RecoveryModeenum (lazy, eager, none) to control recovery behaviorRecoveryStatusmodel to track recovery progress and stateCleanupRequestandCleanupResultmodels for artifact cleanup operationsArtifactPersistenceInfomodel to expose persistence configuration and statisticsKernelManager Enhancements (
kernel/manager.py)start_kernel()to inject persistence configuration into Docker containers:KERNEL_ID: Kernel identifierPERSISTENCE_ENABLED: Enable/disable persistencePERSISTENCE_PATH: Disk location for persisted artifactsRECOVERY_MODE: Recovery strategy (lazy by default)recover_artifacts(): Trigger manual recovery from persisted storageget_recovery_status(): Query current recovery statecleanup_artifacts(): Remove old persisted artifacts by age or nameget_persistence_info(): Retrieve persistence stats and configurationAPI Routes (
kernel/routes.py)POST /{kernel_id}/recover: Manually trigger artifact recoveryGET /{kernel_id}/recovery-status: Check recovery progressPOST /{kernel_id}/cleanup: Clean up old artifactsGET /{kernel_id}/persistence: Get persistence info and statsModule Exports (
kernel/__init__.py)Docker Image (
kernel_runtime/Dockerfile)cloudpickle>=3.0.0dependency for artifact serializationTesting
test_artifact_persistence_integration.py):test_kernel_persistence_integration.py):Implementation Details
Testing Coverage
@pytest.mark.kernel)