Skip to content

Add artifact persistence and recovery system to kernel runtime#299

Merged
Edwardvaneechoud merged 9 commits intofeauture/kernel-implementationfrom
claude/artifact-persistence-recovery-5sl5m
Feb 3, 2026
Merged

Add artifact persistence and recovery system to kernel runtime#299
Edwardvaneechoud merged 9 commits intofeauture/kernel-implementationfrom
claude/artifact-persistence-recovery-5sl5m

Conversation

@Edwardvaneechoud
Copy link
Owner

@Edwardvaneechoud Edwardvaneechoud commented Feb 3, 2026

Summary

This PR implements a comprehensive artifact persistence and recovery system for the kernel runtime, enabling automatic disk-based storage of published artifacts and manual/lazy recovery mechanisms.

Key Changes

Core Models (kernel/models.py)

  • Added RecoveryMode enum (lazy, eager, none) to control recovery behavior
  • Added RecoveryStatus model to track recovery progress and state
  • Added CleanupRequest and CleanupResult models for artifact cleanup operations
  • Added ArtifactPersistenceInfo model to expose persistence configuration and statistics

KernelManager Enhancements (kernel/manager.py)

  • Persistence environment variables: Modified start_kernel() to inject persistence configuration into Docker containers:
    • KERNEL_ID: Kernel identifier
    • PERSISTENCE_ENABLED: Enable/disable persistence
    • PERSISTENCE_PATH: Disk location for persisted artifacts
    • RECOVERY_MODE: Recovery strategy (lazy by default)
  • New proxy methods:
    • recover_artifacts(): Trigger manual recovery from persisted storage
    • get_recovery_status(): Query current recovery state
    • cleanup_artifacts(): Remove old persisted artifacts by age or name
    • get_persistence_info(): Retrieve persistence stats and configuration

API Routes (kernel/routes.py)

  • Added four new endpoints:
    • POST /{kernel_id}/recover: Manually trigger artifact recovery
    • GET /{kernel_id}/recovery-status: Check recovery progress
    • POST /{kernel_id}/cleanup: Clean up old artifacts
    • GET /{kernel_id}/persistence: Get persistence info and stats
  • All endpoints include authorization checks and proper error handling

Module Exports (kernel/__init__.py)

  • Exported new models and types for public API consumption

Docker Image (kernel_runtime/Dockerfile)

  • Added cloudpickle>=3.0.0 dependency for artifact serialization

Testing

  • Unit tests (test_artifact_persistence_integration.py):
    • Model validation and serialization
    • KernelManager proxy methods with mocked HTTP responses
    • Docker environment variable injection verification
  • Integration tests (test_kernel_persistence_integration.py):
    • End-to-end persistence lifecycle (publish → persist → recover)
    • Artifact accessibility across node executions
    • Cleanup operations by age and name
    • Recovery status tracking
    • Persistence metadata in API responses

Implementation Details

  • Lazy recovery: Artifacts are persisted automatically on publish; recovery loads them on-demand
  • State validation: All persistence operations verify kernel is in IDLE or EXECUTING state
  • HTTP timeouts: Configured appropriately per operation (120s for recovery, 60s for cleanup, 30s for status)
  • Error handling: RuntimeError for invalid kernel state, HTTPException for API errors
  • Backward compatible: Persistence is opt-in via environment variables; existing code unaffected

Testing Coverage

  • Model instantiation and serialization
  • KernelManager proxy methods with mocked responses
  • Docker environment variable injection
  • Full Docker-based integration tests (marked with @pytest.mark.kernel)
  • Artifact persistence through node re-execution cycles

Artifacts (trained models, encoders, computed objects) are now automatically
persisted to disk using cloudpickle when published via flowfile.publish_artifact().
On kernel restart, artifacts can be recovered transparently — either lazily
(loaded on first access, the default) or eagerly (all pre-loaded at startup).

Key changes:

Kernel Runtime:
- Add ArtifactPersistence layer using cloudpickle with SHA-256 checksums
- Extend ArtifactStore with persistence integration (auto-save on publish,
  lazy-load on get, auto-delete on clear)
- Add /recover, /recovery-status, /cleanup, /persistence API endpoints
- Switch from deprecated on_event to lifespan context manager
- Configure persistence via PERSISTENCE_ENABLED, PERSISTENCE_PATH,
  KERNEL_ID, and RECOVERY_MODE environment variables
- Add cloudpickle>=3.0.0 dependency

FlowFile Core:
- Add RecoveryMode, RecoveryStatus, CleanupRequest, CleanupResult,
  ArtifactPersistenceInfo models
- Add KernelManager proxy methods for recover, recovery-status,
  cleanup, and persistence endpoints
- Add core API routes: POST /{id}/recover, GET /{id}/recovery-status,
  POST /{id}/cleanup, GET /{id}/persistence
- Pass persistence env vars (KERNEL_ID, PERSISTENCE_ENABLED, etc.)
  when starting kernel Docker containers

Tests: 138 passing (27 new persistence tests + 111 existing)

https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
- Unit-level tests (20 tests): model validation, KernelManager proxy
  methods with mocked HTTP, and Docker env var injection verification
- Docker-based integration tests (17 tests, @pytest.mark.kernel):
  persistence basics, health/recovery endpoints, manual recovery,
  artifact cleanup, manager proxy methods, and re-execution durability

https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
…istence-recovery-5sl5m

Resolved conflict in artifact_store.py by keeping both documentation sections:
- Persistence backend integration description
- Tech debt note about memory pressure for large artifacts

https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
The python_code.svg icon was added but the node definition was still
referencing polars_code.png.

https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
- Move env var reading from module import time to _setup_persistence()
  call time, allowing tests to set PERSISTENCE_PATH before TestClient
  triggers the lifespan handler
- Update conftest.py client fixture to set PERSISTENCE_PATH to tmp_path
- Reset all persistence-related module globals between tests
- Tests now actually write to disk (temp dir) rather than being skipped

https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
@Edwardvaneechoud Edwardvaneechoud force-pushed the claude/artifact-persistence-recovery-5sl5m branch from 73f3307 to a8db73a Compare February 3, 2026 17:07
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 60.55046% with 43 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
flowfile_core/flowfile_core/kernel/routes.py 16.66% 40 Missing ⚠️
flowfile_core/flowfile_core/kernel/manager.py 91.66% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Edwardvaneechoud and others added 4 commits February 3, 2026 19:45
Critical fixes:
- Fix race condition in publish(): track persist_pending state and update
  persisted flag only after successful disk write
- Fix lazy load lock contention: use per-key locks and two-phase loading
  to avoid holding global lock during slow I/O operations
- Initialize to_remove/lazy_remove before lock block in clear_by_node_ids()

Design improvements:
- Explicit metadata field selection in persistence.save() using whitelist
  approach (_PERSISTABLE_FIELDS) instead of filtering out 'object'
- Add ArtifactIdentifier model for type-safe cleanup requests
- Rename RECOVERY_MODE=none to CLEAR with deprecation warning for 'none'
- Strip leading dots in _safe_dirname() to prevent hidden directories

https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
- Add persistence_enabled and recovery_mode fields to KernelConfig and KernelInfo
- Move RecoveryMode enum before KernelConfig to allow reference
- Update manager.py to use kernel's persistence settings (not hardcoded)
- Fix sync version of start_kernel which was missing persistence env vars
- Update create_kernel to copy persistence settings from config to info
- Fix misleading error message in get() when lazy load fails (now says
  "exists on disk but failed to load" instead of "not found")
- Add test for per-kernel persistence configuration

https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
Artifacts older than 24 hours are automatically removed when the kernel
container starts. This ensures persistence serves as a recovery mechanism
rather than permanent storage.

Configurable via PERSISTENCE_CLEANUP_HOURS env var (default: 24, 0 to disable).

https://claude.ai/code/session_01SSw4fAnguDojqFRwAfZBjA
@Edwardvaneechoud Edwardvaneechoud merged commit 79d76a0 into feauture/kernel-implementation Feb 3, 2026
@Edwardvaneechoud Edwardvaneechoud deleted the claude/artifact-persistence-recovery-5sl5m branch February 3, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants