feat: add challenge integration infrastructure with checkpoint persistence #2

echobt · 2026-02-03T11:32:51Z

Summary

This PR adds comprehensive infrastructure for integrating challenge crates into Platform v2, with a focus on enabling zero-downtime updates.

Changes

Challenge Infrastructure

Add challenges/ directory structure for hosting challenge crates
Create platform-challenge-registry crate for challenge lifecycle management
- Challenge discovery and registration
- Version management (semver-based)
- Lifecycle state machine (registered/starting/running/stopping/stopped)
- Health monitoring with configurable checks
- State persistence and hot-reload support
- Migration planning for version upgrades

Checkpoint System

Add checkpoint system in platform-core for state persistence
Implement restoration manager for checkpoint recovery
Support automatic state recovery on restart

Rolling Updates

Add health check endpoints in RPC server for rolling updates
Implement graceful shutdown with checkpoint persistence
Create periodic checkpoints (every 5 minutes) for resilience

Documentation

Add challenge integration guide

Testing

Add comprehensive integration tests for checkpoint/restoration system

Benefits

Validators can update without losing evaluation progress
Challenges can be hot-reloaded without service interruption
Automatic recovery from unexpected shutdowns
Foundation for adding challenge crates (e.g., term-challenge)

Testing

All existing tests pass
New checkpoint tests validate persistence and recovery
Code compiles with cargo check --workspace

Summary by CodeRabbit

Release Notes

New Features
- Added challenge registry system for discovering, registering, and managing multiple challenges with version support
- Implemented state persistence and recovery system via checkpoints for resilient operation
- Added health monitoring with configurable checks for system components
- Introduced challenge lifecycle management with automatic restarts and migration support
Documentation
- Added comprehensive challenge integration guide and platform documentation
Infrastructure
- Enhanced validator with graceful shutdown and state checkpoint capabilities

Create new platform-challenge-registry crate with: - Challenge discovery and registration - Version management (semver-based) - Lifecycle state machine (registered/starting/running/stopping/stopped) - Health monitoring with configurable checks - State persistence and hot-reload support - Migration planning for version upgrades Modules: - registry: Main registry with CRUD operations - lifecycle: State machine for challenge states - health: Health monitoring and status tracking - state: State snapshots for hot-reload - discovery: Challenge discovery from various sources - migration: Version migration planning - version: Semantic versioning support - error: Registry-specific error types

- Add ShutdownHandler struct for checkpoint management - Create periodic checkpoints every 5 minutes - Save final checkpoint on graceful shutdown (Ctrl+C) - Persist evaluation state for hot-reload recovery This enables validators to update without losing evaluation progress.

coderabbitai · 2026-02-03T11:33:11Z

📝 Walkthrough

Walkthrough

Introduces a comprehensive challenge registry system with state persistence, checkpoint/restoration capabilities, and health monitoring infrastructure. Adds a new crates/challenge-registry crate with discovery, lifecycle, migration, versioning, and state management modules. Implements checkpoint-based state persistence in core for graceful shutdown and recovery, plus a health check system for the RPC server.

Changes

Cohort / File(s)	Summary
Workspace Configuration `Cargo.toml`	Added `crates/challenge-registry` workspace member and `[workspace.metadata.challenge-features]` with dynamic-loading flag.
Checkpoint & Restoration System `crates/core/src/checkpoint.rs`, `crates/core/src/restoration.rs`, `crates/core/src/error.rs`, `crates/core/src/lib.rs`	Implemented checkpoint creation, loading, and atomic persistence with integrity verification; added restoration manager with filtering, validation, and recovery flows; introduced `Validation` error variant and module exports.
Challenge Registry Core `crates/challenge-registry/src/error.rs`, `crates/challenge-registry/src/registry.rs`, `crates/challenge-registry/src/lib.rs`	Created centralized challenge registry with storage, lifecycle integration, health monitoring; defined error types with conversions; established public module structure and re-exports.
Challenge Discovery & Version Management `crates/challenge-registry/src/discovery.rs`, `crates/challenge-registry/src/version.rs`	Implemented multi-source challenge discovery with configurable scanning; added version parsing, comparison, and constraint resolution (Exact, AtLeast, Range, Compatible, Any).
Challenge Lifecycle & State `crates/challenge-registry/src/lifecycle.rs`, `crates/challenge-registry/src/state.rs`, `crates/challenge-registry/src/migration.rs`	Added lifecycle state machine with valid transitions; implemented thread-safe state store with snapshot-based recovery; introduced migration framework with plan generation based on version deltas and rollback support.
Health Monitoring `crates/challenge-registry/src/health.rs`, `crates/rpc-server/src/health.rs`, `crates/rpc-server/src/lib.rs`	Created health status tracking with configurable checks and response time averaging; added RPC server health endpoint with component-level status, readiness state, and uptime tracking.
Validator Node Integration `bins/validator-node/src/main.rs`	Integrated `ShutdownHandler` to create final checkpoints on graceful shutdown; added periodic checkpoint cycles during runtime with configurable intervals.
Challenge Registry Manifest `crates/challenge-registry/Cargo.toml`, `crates/core/src/checkpoint.rs` dev-dependencies	Defined challenge registry dependencies (async-trait, parking_lot, serde, tracing, semver, uuid); added tempfile for checkpoint tests.
Documentation & Tests `challenges/README.md`, `challenges/.gitkeep`, `docs/challenge-integration.md`, `tests/Cargo.toml`, `tests/checkpoint_tests.rs`	Added challenge crate guidelines and integration guide; included directory placeholder; created comprehensive checkpoint round-trip and restoration validation tests.

Sequence Diagram(s)

sequenceDiagram
    participant Validator as Validator Node
    participant Registry as Challenge Registry
    participant Checkpoint as Checkpoint Manager
    participant Disk as Disk Storage
    participant Restorable as Restoration Manager

    Validator->>Registry: Initialize challenge registry
    Registry-->>Validator: Ready

    loop During Runtime
        Validator->>Registry: Update challenge state<br/>(evaluations, health)
        Registry-->>Validator: Acknowledged
        Validator->>Checkpoint: Periodic create_checkpoint()
        Checkpoint->>Checkpoint: Serialize state + compute hash
        Checkpoint->>Disk: Atomic write (temp → final)
        Disk-->>Checkpoint: Checkpoint persisted
    end

    Note over Validator: Ctrl+C Signal

    Validator->>Checkpoint: Final create_checkpoint()
    Checkpoint->>Disk: Write final state
    Validator-->>Validator: Shutdown

    Note over Restorable: Later: Recovery

    Restorable->>Checkpoint: load_latest()
    Checkpoint->>Disk: Read checkpoint + verify hash
    Disk-->>Checkpoint: CheckpointData
    Checkpoint-->>Restorable: (CheckpointHeader, CheckpointData)
    Restorable->>Restorable: Validate & filter state
    Restorable-->>Validator: Restored state ready

sequenceDiagram
    participant Client as Client
    participant Registry as Challenge Registry
    participant Discovery as Discovery Engine
    participant Health as Health Monitor
    participant Lifecycle as Lifecycle Manager

    Client->>Discovery: discover_from_local(path)
    Discovery->>Discovery: Scan filesystem<br/>Check challenge.toml
    Discovery-->>Client: Vec<DiscoveredChallenge>

    Client->>Registry: register(ChallengeEntry)
    Registry->>Lifecycle: Track new lifecycle
    Registry->>Health: Initialize health state
    Registry-->>Client: ChallengeId

    loop Health Checks
        Health->>Health: Check endpoint<br/>Record metrics
        Health-->>Registry: Update HealthStatus
    end

    Client->>Registry: update_version(id, new_version)
    Registry->>Registry: Validate compatibility
    Registry-->>Client: Old version

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 The registry hops with vibrant grace,
Challenges discovered, tracked in place,
Checkpoints whisper state across the night,
Health beats steady, keeping all things right,
Version migrations flow like gentle streams,
A fresh foundation built on solid dreams!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: add challenge integration infrastructure with checkpoint persistence' directly and specifically summarizes the main changes: introducing challenge integration infrastructure with checkpoint persistence as a core feature.
Docstring Coverage	✅ Passed	Docstring coverage is 85.59% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/challenge-integration-1770118346

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🤖 Fix all issues with AI agents

In `@challenges/README.md`:
- Around line 7-12: The fenced code block in README.md that displays the
directory tree (the block starting with the triple backticks followed by the
tree: "challenges/ ├── README.md ...") is missing a language tag; update that
fenced block to include the language tag "text" (i.e., change the opening ``` to
```text) so Markdownlint MD040 is satisfied and the tree is treated as plain
text.

In `@crates/challenge-registry/src/health.rs`:
- Around line 91-100: The method record_failure currently uses a hardcoded
threshold (>= 3) to flip status to Unhealthy; update it to use the configured
threshold instead by reading HealthConfig.failure_threshold (or accept a
threshold parameter) so changes to HealthConfig take effect — either change
record_failure on the Health struct to accept a threshold argument and use that
value, or have HealthMonitor::record_failure call into the existing
record_failure and pass self.config.failure_threshold; make sure to replace the
literal 3 with the referenced failure_threshold and keep updating
consecutive_failures, last_check_at, and HealthStatus logic unchanged.

In `@crates/challenge-registry/src/migration.rs`:
- Around line 345-357: The current finalize_migration removes the plan from
active_plans before checking its status, which can drop non-terminal plans and
prevents failed plans from being archived; update finalize_migration to first
look up (read/clone or inspect) the MigrationPlan in self.active_plans without
removing it, verify that its status is a terminal state (allow Completed,
RolledBack, and Failed as terminal), and only then call remove(challenge_id) to
take it out and return the plan; reference the finalize_migration method,
active_plans map, MigrationPlan status checks (e.g., is_complete or status enum
variants) and ensure compatibility with fail_migration which sets Failed so
failed plans become removable/archivable.

In `@crates/challenge-registry/src/registry.rs`:
- Around line 177-202: The update_state function allows arbitrary lifecycle
changes but must validate transitions using
ChallengeLifecycle::is_valid_transition; modify update_state (in registry.rs) to
check ChallengeLifecycle::is_valid_transition(&old_state, &new_state) before
applying changes and if invalid return a new
RegistryError::InvalidStateTransition containing old_state and new_state (add
that variant to RegistryError), only update
registered.entry.lifecycle_state/updated_at and emit
LifecycleEvent::StateChanged when the transition is valid; ensure debug logging
remains and that the function returns Err(...) for invalid transitions instead
of proceeding.

In `@crates/challenge-registry/src/version.rs`:
- Around line 69-79: The current Ord impl for ChallengeVersion (fn cmp) ignores
the prerelease field; update cmp in the impl for ChallengeVersion to enforce
semver precedence: after comparing major, minor, patch, if equal compare
prerelease such that a missing prerelease (release) has higher precedence than
any prerelease (i.e., None > Some), and when both have prerelease strings
compare their dot-separated identifiers in order using semver rules (numeric
identifiers compare numerically and have lower precedence than non-numeric
identifiers; non-numeric compare lexicographically; if all equal the shorter
identifier list has lower precedence). Use the existing fields
ChallengeVersion.prerelease and the cmp function to implement this logic in the
Ord::cmp for ChallengeVersion.

In `@crates/core/src/checkpoint.rs`:
- Around line 231-235: Do not increment self.current_sequence before persisting;
instead compute let next_sequence = self.current_sequence + 1, use
checkpoint_filename(next_sequence) to write and rename the temp file, and only
after the atomic rename succeeds assign self.current_sequence = next_sequence so
failures don't advance the in-memory sequence. Apply the same pattern to the
other similar routine (the block at 279-283) so current_sequence is updated only
on successful persistence; keep all filename construction and rename logic using
checkpoint_filename(next_sequence) and leave load_latest behavior unchanged.
- Around line 321-357: The code reads untrusted sizes into header_len and
header.data_size and allocates vectors directly, which can OOM on corrupt files;
before allocating check bounds: define reasonable maxs (e.g. MAX_HEADER_SIZE and
MAX_DATA_SIZE), validate header_len is <= MAX_HEADER_SIZE and less than the
remaining reader/file length, then allocate header_bytes; after deserializing
CheckpointHeader, validate header.data_size <= MAX_DATA_SIZE and that
header.data_size does not exceed remaining reader/file length before allocating
data_bytes; on violations return a MiniChainError::Storage with a clear message.
Use the existing symbols header_len, CheckpointHeader, header.data_size, reader,
and CHECKPOINT_VERSION to locate insertion points.

In `@crates/rpc-server/src/health.rs`:
- Around line 196-221: get_overall_status currently ignores the challenges field
on ComponentStatus even though set_component_status and ComponentStatus track
it; update get_overall_status to include components.challenges in the unhealthy
and degraded checks (treat challenges the same way as p2p/storage/consensus),
and ensure the final special-case check (the bittensor check) still runs as
intended — modify the get_overall_status function to reference
components.challenges alongside components.p2p, components.storage, and
components.consensus when computing overall HealthStatus.

In `@docs/challenge-integration.md`:
- Around line 42-51: The Markdown code fence showing the project tree lacks a
language tag; update the block in docs/challenge-integration.md by changing the
opening ``` to ```text so the project-structure block (the my-challenge/ tree
with Cargo.toml, src/lib.rs, evaluation.rs, scoring.rs, Dockerfile, README.md)
is fenced as ```text which will silence MD040 warnings and improve
rendering/readability.

🧹 Nitpick comments (13)

challenges/.gitkeep (1)
1-1: Consider making the .gitkeep file truly empty or add a comment.

By convention, .gitkeep files are typically either completely empty (0 bytes) or contain a brief comment explaining their purpose. The current single newline works but is slightly unconventional.
📝 Alternative approaches

Option 1: Remove the empty line entirely to make it a true 0-byte file.

Option 2: Add a descriptive comment:
-
+# Placeholder directory for challenge crate modules
crates/challenge-registry/src/error.rs (1)
45-49: Consider preserving more context for I/O errors.

Mapping all std::io::Error to Internal loses context about whether the error originated from state persistence, file operations, or network I/O. This could make debugging harder.
💡 Optional improvement

Consider mapping I/O errors contextually where they occur, or including the error kind:
 impl From<std::io::Error> for RegistryError {
     fn from(err: std::io::Error) -> Self {
-        RegistryError::Internal(err.to_string())
+        RegistryError::Internal(format!("IO error ({}): {}", err.kind(), err))
     }
 }
tests/checkpoint_tests.rs (1)
139-146: Clarify the distinction between data sequence and checkpoint sequence.

The test expects latest.sequence (from CheckpointData) to be 9, while header.sequence is 10. This works because the loop index i goes 0-9, but it might confuse readers. Consider adding a comment explaining the distinction.
📝 Optional clarification
     // Latest should be sequence 10
+    // Note: header.sequence is the checkpoint file sequence (1-indexed)
+    // data.sequence is the application sequence from the loop (0-indexed)
     let (header, latest) = manager
         .load_latest()
         .expect("Failed to load")
         .expect("No checkpoint");
     assert_eq!(latest.sequence, 9);
     assert_eq!(header.sequence, 10);
crates/challenge-registry/src/lifecycle.rs (1)

56-78: Auto-restart configuration defined but restart tracking not implemented.

The auto_restart and max_restart_attempts fields are configured but there's no state or method to track actual restart attempts per challenge. Consider whether restart count tracking should be managed here or delegated to the registry.

crates/challenge-registry/src/health.rs (1)

111-114: recovery_threshold is defined but never used.

The HealthConfig.recovery_threshold field (line 113) is documented as "Number of successes to recover from unhealthy" but record_success unconditionally sets status to Healthy. Consider implementing recovery logic or removing the unused field.
crates/challenge-registry/src/discovery.rs (2)
158-194: Code duplication between challenge.toml and Cargo.toml handling.

The logic for extracting the challenge name and creating DiscoveredChallenge is nearly identical between the two branches. Consider extracting a helper function.
Proposed refactor to reduce duplication
+    fn create_discovered_from_path(&self, path: &PathBuf) -> DiscoveredChallenge {
+        let name = path
+            .file_name()
+            .and_then(|n| n.to_str())
+            .unwrap_or("unknown")
+            .to_string();
+
+        DiscoveredChallenge {
+            name,
+            version: ChallengeVersion::default(),
+            docker_image: None,
+            local_path: Some(path.clone()),
+            health_endpoint: None,
+            evaluation_endpoint: None,
+            metadata: ChallengeMetadata::default(),
+            source: DiscoverySource::LocalFilesystem(path.clone()),
+        }
+    }
+
     pub fn discover_from_local(&self, path: &PathBuf) -> RegistryResult<Vec<DiscoveredChallenge>> {
         // ... validation ...
         
         if path.is_dir() {
             let challenge_toml = path.join("challenge.toml");
             let cargo_toml = path.join("Cargo.toml");
 
             if challenge_toml.exists() || cargo_toml.exists() {
-                // ... duplicated code ...
+                challenges.push(self.create_discovered_from_path(path));
             }
         }
         Ok(challenges)
     }
121-140: discover_all only implements local discovery; Docker and P2P are placeholders.

The method iterates only over local_paths while docker_registries and enable_p2p config fields are unused. This appears intentional for the initial infrastructure, but consider adding TODO comments or returning a warning when these sources are configured but not yet implemented.
crates/challenge-registry/src/state.rs (1)
206-209: Inefficient snapshot trimming with remove(0) in loop.

Using remove(0) on a Vec in a loop is O(n²) because each removal shifts all subsequent elements. With default max_snapshots=5, this is negligible, but consider using drain for cleaner O(n) removal if this could be configured higher.
Proposed fix using drain
         // Trim old snapshots
-        while snapshots.len() > self.max_snapshots {
-            snapshots.remove(0);
-        }
+        if snapshots.len() > self.max_snapshots {
+            let excess = snapshots.len() - self.max_snapshots;
+            snapshots.drain(0..excess);
+        }
crates/core/src/restoration.rs (4)
256-275: Stale evaluation filtering is a no-op placeholder.

The skip_stale_evaluations option is respected, but the actual filter logic always returns true, keeping all evaluations. The comment acknowledges this: "For now, keep all pending (they don't have epoch info)".

Consider either:

Implementing the logic if PendingEvaluationState has epoch information available

Removing the skip_stale_evaluations option until it's implemented

Adding a TODO comment that's more visible

286-292: Magic number for epoch validation should be a named constant.

The 1_000_000 epoch limit is arbitrary. Consider defining it as a constant with documentation explaining the rationale.
Proposed fix
+/// Maximum reasonable epoch value for validation.
+/// This is a sanity check to detect corrupted checkpoints.
+const MAX_REASONABLE_EPOCH: u64 = 1_000_000;
+
     fn validate_data(&self, data: &CheckpointData) -> Result<()> {
         // Validate epoch is reasonable
-        if data.epoch > 1_000_000 {
+        if data.epoch > MAX_REASONABLE_EPOCH {
             return Err(MiniChainError::Validation(
                 "Checkpoint epoch seems unreasonably high".into(),
             ));
         }
336-351: get_checkpoint_info loads full checkpoint data for metadata extraction.

This method loads the entire checkpoint (including all evaluations) just to extract summary information. If CheckpointManager supported header-only loading, this could be more efficient for listing many checkpoints.

372-379: Add documentation explaining expected implementors of the Restorable trait.

The trait is exported in the public API but has no implementations in the codebase and lacks documentation about intended use. Consider adding a doc comment clarifying whether external crates are expected to implement this trait or if it's reserved for future use within this crate.
crates/challenge-registry/src/registry.rs (1)
274-279: Event listeners executed synchronously under lock.

emit_event calls all listeners while holding the read lock on event_listeners. Long-running or blocking listeners could delay other registry operations. Consider:

Cloning listeners before iterating (release lock sooner)

Spawning listener calls asynchronously

Documenting that listeners should be non-blocking
Option: Clone listeners to release lock faster
     fn emit_event(&self, event: LifecycleEvent) {
-        for listener in self.event_listeners.read().iter() {
+        let listeners: Vec<_> = self.event_listeners.read().iter().collect();
+        // Lock released here
+        for listener in listeners {
             listener(event.clone());
         }
     }
Note: This requires listeners to be Clone, which Box<dyn Fn> is not. Alternative: store Arc<dyn Fn> instead.

coderabbitai · 2026-02-03T11:39:36Z

challenges/README.md

+```
+challenges/
+├── README.md           # This file
+├── example-challenge/  # Example challenge template (future)
+└── [your-challenge]/   # Your custom challenge crate
+```


⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced block.
Markdownlint MD040 flags fenced blocks without a language; use text here.

Proposed fix

-``` +```text challenges/ ├── README.md # This file ├── example-challenge/ # Example challenge template (future) └── [your-challenge]/ # Your custom challenge crate

</details> <details> <summary>🤖 Prompt for AI Agents</summary>

In @challenges/README.md around lines 7 - 12, The fenced code block in README.md
that displays the directory tree (the block starting with the triple backticks
followed by the tree: "challenges/ ├── README.md ...") is missing a language
tag; update that fenced block to include the language tag "text" (i.e., change
the opening totext) so Markdownlint MD040 is satisfied and the tree is
treated as plain text.

</details>  

coderabbitai · 2026-02-03T11:39:36Z

crates/challenge-registry/src/health.rs

+    pub fn record_failure(&mut self, reason: String) {
+        self.consecutive_failures += 1;
+        self.last_check_at = chrono::Utc::now().timestamp_millis();
+
+        if self.consecutive_failures >= 3 {
+            self.status = HealthStatus::Unhealthy(reason);
+        } else {
+            self.status = HealthStatus::Degraded(reason);
+        }
+    }


⚠️ Potential issue | 🟡 Minor

Hardcoded failure threshold ignores HealthConfig.failure_threshold.

The record_failure method hardcodes >= 3 for marking unhealthy, but HealthConfig has a configurable failure_threshold field (default 3). This creates an inconsistency where changing the config has no effect.

Proposed fix to use configurable threshold

Either pass the threshold to record_failure:

- pub fn record_failure(&mut self, reason: String) { + pub fn record_failure(&mut self, reason: String, failure_threshold: u32) { self.consecutive_failures += 1; self.last_check_at = chrono::Utc::now().timestamp_millis(); - if self.consecutive_failures >= 3 { + if self.consecutive_failures >= failure_threshold { self.status = HealthStatus::Unhealthy(reason); } else { self.status = HealthStatus::Degraded(reason); } }

Or have HealthMonitor::record_failure apply the threshold from its config.

🤖 Prompt for AI Agents

In `@crates/challenge-registry/src/health.rs` around lines 91 - 100, The method record_failure currently uses a hardcoded threshold (>= 3) to flip status to Unhealthy; update it to use the configured threshold instead by reading HealthConfig.failure_threshold (or accept a threshold parameter) so changes to HealthConfig take effect — either change record_failure on the Health struct to accept a threshold argument and use that value, or have HealthMonitor::record_failure call into the existing record_failure and pass self.config.failure_threshold; make sure to replace the literal 3 with the referenced failure_threshold and keep updating consecutive_failures, last_check_at, and HealthStatus logic unchanged.

coderabbitai · 2026-02-03T11:39:36Z

crates/challenge-registry/src/migration.rs

+    /// Finalize and archive a completed migration
+    pub fn finalize_migration(&self, challenge_id: &ChallengeId) -> RegistryResult<MigrationPlan> {
+        let plan = self
+            .active_plans
+            .write()
+            .remove(challenge_id)
+            .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?;
+
+        if !plan.is_complete() {
+            return Err(RegistryError::MigrationFailed(
+                "Migration not complete".to_string(),
+            ));
+        }


⚠️ Potential issue | 🟠 Major

Finalize only after verifying a terminal status (and keep failed plans archivable).

Line 345 removes the plan before checking completion, so a premature call drops the active migration. Also, fail_migration sets Failed but finalize_migration rejects anything outside Completed/RolledBack, leaving failed plans stuck in active_plans. Validate status while the plan is still in the map and allow terminal states before removal.

🔧 Suggested fix

- let plan = self - .active_plans - .write() - .remove(challenge_id) - .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?; - - if !plan.is_complete() { - return Err(RegistryError::MigrationFailed( - "Migration not complete".to_string(), - )); - } + let mut plans = self.active_plans.write(); + let is_terminal = plans + .get(challenge_id) + .map(|p| { + matches!( + p.status, + MigrationStatus::Completed + | MigrationStatus::RolledBack + | MigrationStatus::Failed(_) + ) + }) + .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?; + + if !is_terminal { + return Err(RegistryError::MigrationFailed( + "Migration not complete".to_string(), + )); + } + + let plan = plans.remove(challenge_id).expect("checked above");

🤖 Prompt for AI Agents

In `@crates/challenge-registry/src/migration.rs` around lines 345 - 357, The current finalize_migration removes the plan from active_plans before checking its status, which can drop non-terminal plans and prevents failed plans from being archived; update finalize_migration to first look up (read/clone or inspect) the MigrationPlan in self.active_plans without removing it, verify that its status is a terminal state (allow Completed, RolledBack, and Failed as terminal), and only then call remove(challenge_id) to take it out and return the plan; reference the finalize_migration method, active_plans map, MigrationPlan status checks (e.g., is_complete or status enum variants) and ensure compatibility with fail_migration which sets Failed so failed plans become removable/archivable.

coderabbitai · 2026-02-03T11:39:36Z

crates/challenge-registry/src/registry.rs

+    /// Update challenge lifecycle state
+    pub fn update_state(&self, id: &ChallengeId, new_state: LifecycleState) -> RegistryResult<()> {
+        let mut challenges = self.challenges.write();
+        let registered = challenges
+            .get_mut(id)
+            .ok_or_else(|| RegistryError::ChallengeNotFound(id.to_string()))?;
+
+        let old_state = registered.entry.lifecycle_state.clone();
+        registered.entry.lifecycle_state = new_state.clone();
+        registered.entry.updated_at = chrono::Utc::now().timestamp_millis();
+
+        debug!(
+            challenge_id = %id,
+            old_state = ?old_state,
+            new_state = ?new_state,
+            "Challenge state updated"
+        );
+
+        self.emit_event(LifecycleEvent::StateChanged {
+            challenge_id: *id,
+            old_state,
+            new_state,
+        });
+
+        Ok(())
+    }


⚠️ Potential issue | 🟠 Major

update_state doesn't validate transitions using ChallengeLifecycle.is_valid_transition.

The ChallengeLifecycle struct has is_valid_transition logic (defined in lifecycle.rs), but update_state doesn't use it. Invalid state transitions are allowed, which could lead to inconsistent states.

Proposed fix to validate transitions

pub fn update_state(&self, id: &ChallengeId, new_state: LifecycleState) -> RegistryResult<()> { let mut challenges = self.challenges.write(); let registered = challenges .get_mut(id) .ok_or_else(|| RegistryError::ChallengeNotFound(id.to_string()))?; let old_state = registered.entry.lifecycle_state.clone(); + + if !self.lifecycle.is_valid_transition(&old_state, &new_state) { + return Err(RegistryError::InvalidStateTransition( + format!("{:?} -> {:?}", old_state, new_state) + )); + } + registered.entry.lifecycle_state = new_state.clone(); registered.entry.updated_at = chrono::Utc::now().timestamp_millis();

Note: This requires adding an InvalidStateTransition variant to RegistryError.

🤖 Prompt for AI Agents

In `@crates/challenge-registry/src/registry.rs` around lines 177 - 202, The update_state function allows arbitrary lifecycle changes but must validate transitions using ChallengeLifecycle::is_valid_transition; modify update_state (in registry.rs) to check ChallengeLifecycle::is_valid_transition(&old_state, &new_state) before applying changes and if invalid return a new RegistryError::InvalidStateTransition containing old_state and new_state (add that variant to RegistryError), only update registered.entry.lifecycle_state/updated_at and emit LifecycleEvent::StateChanged when the transition is valid; ensure debug logging remains and that the function returns Err(...) for invalid transitions instead of proceeding.

coderabbitai · 2026-02-03T11:39:36Z

crates/challenge-registry/src/version.rs

+impl Ord for ChallengeVersion {
+    fn cmp(&self, other: &Self) -> Ordering {
+        match self.major.cmp(&other.major) {
+            Ordering::Equal => match self.minor.cmp(&other.minor) {
+                Ordering::Equal => self.patch.cmp(&other.patch),
+                ord => ord,
+            },
+            ord => ord,
+        }
+    }
+}


⚠️ Potential issue | 🟡 Minor

Prerelease versions are not considered in ordering.

The Ord implementation ignores the prerelease field, meaning 1.0.0-alpha and 1.0.0 are considered equal in comparisons. Per semver specification, prerelease versions should have lower precedence than the release version (i.e., 1.0.0-alpha < 1.0.0).

This could cause issues when comparing versions during migration planning or constraint satisfaction.

💡 Suggested fix for semver-compliant ordering

impl Ord for ChallengeVersion { fn cmp(&self, other: &Self) -> Ordering { match self.major.cmp(&other.major) { Ordering::Equal => match self.minor.cmp(&other.minor) { - Ordering::Equal => self.patch.cmp(&other.patch), + Ordering::Equal => match self.patch.cmp(&other.patch) { + Ordering::Equal => { + // Prerelease versions have lower precedence than release + match (&self.prerelease, &other.prerelease) { + (None, None) => Ordering::Equal, + (Some(_), None) => Ordering::Less, // prerelease < release + (None, Some(_)) => Ordering::Greater, // release > prerelease + (Some(a), Some(b)) => a.cmp(b), // compare prerelease strings + } + } + ord => ord, + }, ord => ord, }, ord => ord, } } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

impl Ord for ChallengeVersion {

fn cmp(&self, other: &Self) -> Ordering {

match self.major.cmp(&other.major) {

Ordering::Equal => match self.minor.cmp(&other.minor) {

Ordering::Equal => self.patch.cmp(&other.patch),

ord => ord,

},

ord => ord,

}

}

}

impl Ord for ChallengeVersion {

fn cmp(&self, other: &Self) -> Ordering {

match self.major.cmp(&other.major) {

Ordering::Equal => match self.minor.cmp(&other.minor) {

Ordering::Equal => match self.patch.cmp(&other.patch) {

Ordering::Equal => {

// Prerelease versions have lower precedence than release

match (&self.prerelease, &other.prerelease) {

(None, None) => Ordering::Equal,

(Some(_), None) => Ordering::Less, // prerelease < release

(None, Some(_)) => Ordering::Greater, // release > prerelease

(Some(a), Some(b)) => a.cmp(b), // compare prerelease strings

}

}

ord => ord,

},

ord => ord,

},

ord => ord,

}

}

}

🤖 Prompt for AI Agents

In `@crates/challenge-registry/src/version.rs` around lines 69 - 79, The current Ord impl for ChallengeVersion (fn cmp) ignores the prerelease field; update cmp in the impl for ChallengeVersion to enforce semver precedence: after comparing major, minor, patch, if equal compare prerelease such that a missing prerelease (release) has higher precedence than any prerelease (i.e., None > Some), and when both have prerelease strings compare their dot-separated identifiers in order using semver rules (numeric identifiers compare numerically and have lower precedence than non-numeric identifiers; non-numeric compare lexicographically; if all equal the shorter identifier list has lower precedence). Use the existing fields ChallengeVersion.prerelease and the cmp function to implement this logic in the Ord::cmp for ChallengeVersion.

coderabbitai · 2026-02-03T11:39:36Z

crates/core/src/checkpoint.rs

+    pub fn create_checkpoint(&mut self, data: &CheckpointData) -> Result<PathBuf> {
+        self.current_sequence += 1;
+        let sequence = self.current_sequence;
+        let filename = self.checkpoint_filename(sequence);
+


⚠️ Potential issue | 🟠 Major

Defer current_sequence updates until after the checkpoint is safely persisted.

Line 232 increments before the write/rename. If I/O fails, current_sequence points to a non-existent file, and load_latest returns None even though older checkpoints exist. Compute the next sequence locally and update only after the rename succeeds.

🔧 Suggested fix

- self.current_sequence += 1; - let sequence = self.current_sequence; + let sequence = self.current_sequence + 1; let filename = self.checkpoint_filename(sequence); @@ fs::rename(&temp_filename, &filename).map_err(|e| { MiniChainError::Storage(format!("Failed to finalize checkpoint: {}", e)) })?; + + self.current_sequence = sequence;

Also applies to: 279-283

🤖 Prompt for AI Agents

In `@crates/core/src/checkpoint.rs` around lines 231 - 235, Do not increment self.current_sequence before persisting; instead compute let next_sequence = self.current_sequence + 1, use checkpoint_filename(next_sequence) to write and rename the temp file, and only after the atomic rename succeeds assign self.current_sequence = next_sequence so failures don't advance the in-memory sequence. Apply the same pattern to the other similar routine (the block at 279-283) so current_sequence is updated only on successful persistence; keep all filename construction and rename logic using checkpoint_filename(next_sequence) and leave load_latest behavior unchanged.

coderabbitai · 2026-02-03T11:39:36Z

crates/core/src/checkpoint.rs

+        // Read header length
+        let mut header_len_bytes = [0u8; 4];
+        reader
+            .read_exact(&mut header_len_bytes)
+            .map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?;
+        let header_len = u32::from_le_bytes(header_len_bytes) as usize;
+
+        // Read header
+        let mut header_bytes = vec![0u8; header_len];
+        reader
+            .read_exact(&mut header_bytes)
+            .map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?;
+
+        let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| {
+            MiniChainError::Serialization(format!("Failed to deserialize header: {}", e))
+        })?;
+
+        // Verify magic
+        if !header.verify_magic() {
+            return Err(MiniChainError::Storage(
+                "Invalid checkpoint magic bytes".into(),
+            ));
+        }
+
+        // Verify version compatibility
+        if header.version > CHECKPOINT_VERSION {
+            return Err(MiniChainError::Storage(format!(
+                "Checkpoint version {} is newer than supported version {}",
+                header.version, CHECKPOINT_VERSION
+            )));
+        }
+
+        // Read data
+        let mut data_bytes = vec![0u8; header.data_size as usize];
+        reader
+            .read_exact(&mut data_bytes)
+            .map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?;


⚠️ Potential issue | 🟠 Major

Add size bounds before allocating header/data buffers.

Line 326 uses a header length sourced from disk, and Line 354 uses data_size from the header. A corrupt file can force large allocations and OOM. Add reasonable caps (or validate against file length) before allocating.

🔧 Suggested fix

// Read header length let mut header_len_bytes = [0u8; 4]; reader .read_exact(&mut header_len_bytes) .map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?; let header_len = u32::from_le_bytes(header_len_bytes) as usize; + const MAX_HEADER_SIZE: usize = 16 * 1024; + const MAX_DATA_SIZE: u64 = 512 * 1024 * 1024; + if header_len == 0 || header_len > MAX_HEADER_SIZE { + return Err(MiniChainError::Storage("Checkpoint header too large".into())); + } @@ let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| { MiniChainError::Serialization(format!("Failed to deserialize header: {}", e)) })?; + + if header.data_size > MAX_DATA_SIZE { + return Err(MiniChainError::Storage("Checkpoint data too large".into())); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Read header length

let mut header_len_bytes = [0u8; 4];

reader

.read_exact(&mut header_len_bytes)

.map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?;

let header_len = u32::from_le_bytes(header_len_bytes) as usize;

// Read header

let mut header_bytes = vec![0u8; header_len];

reader

.read_exact(&mut header_bytes)

.map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?;

let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| {

MiniChainError::Serialization(format!("Failed to deserialize header: {}", e))

})?;

// Verify magic

if !header.verify_magic() {

return Err(MiniChainError::Storage(

"Invalid checkpoint magic bytes".into(),

));

}

// Verify version compatibility

if header.version > CHECKPOINT_VERSION {

return Err(MiniChainError::Storage(format!(

"Checkpoint version {} is newer than supported version {}",

header.version, CHECKPOINT_VERSION

)));

}

// Read data

let mut data_bytes = vec![0u8; header.data_size as usize];

reader

.read_exact(&mut data_bytes)

.map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?;

// Read header length

let mut header_len_bytes = [0u8; 4];

reader

.read_exact(&mut header_len_bytes)

.map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?;

let header_len = u32::from_le_bytes(header_len_bytes) as usize;

const MAX_HEADER_SIZE: usize = 16 * 1024;

const MAX_DATA_SIZE: u64 = 512 * 1024 * 1024;

if header_len == 0 || header_len > MAX_HEADER_SIZE {

return Err(MiniChainError::Storage("Checkpoint header too large".into()));

}

// Read header

let mut header_bytes = vec![0u8; header_len];

reader

.read_exact(&mut header_bytes)

.map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?;

let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| {

MiniChainError::Serialization(format!("Failed to deserialize header: {}", e))

})?;

if header.data_size > MAX_DATA_SIZE {

return Err(MiniChainError::Storage("Checkpoint data too large".into()));

}

// Verify magic

if !header.verify_magic() {

return Err(MiniChainError::Storage(

"Invalid checkpoint magic bytes".into(),

));

}

// Verify version compatibility

if header.version > CHECKPOINT_VERSION {

return Err(MiniChainError::Storage(format!(

"Checkpoint version {} is newer than supported version {}",

header.version, CHECKPOINT_VERSION

)));

}

// Read data

let mut data_bytes = vec![0u8; header.data_size as usize];

reader

.read_exact(&mut data_bytes)

.map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?;

🤖 Prompt for AI Agents

In `@crates/core/src/checkpoint.rs` around lines 321 - 357, The code reads untrusted sizes into header_len and header.data_size and allocates vectors directly, which can OOM on corrupt files; before allocating check bounds: define reasonable maxs (e.g. MAX_HEADER_SIZE and MAX_DATA_SIZE), validate header_len is <= MAX_HEADER_SIZE and less than the remaining reader/file length, then allocate header_bytes; after deserializing CheckpointHeader, validate header.data_size <= MAX_DATA_SIZE and that header.data_size does not exceed remaining reader/file length before allocating data_bytes; on violations return a MiniChainError::Storage with a clear message. Use the existing symbols header_len, CheckpointHeader, header.data_size, reader, and CHECKPOINT_VERSION to locate insertion points.

coderabbitai · 2026-02-03T11:39:36Z

crates/rpc-server/src/health.rs

+    fn get_overall_status(&self) -> HealthStatus {
+        let components = self.components.read();
+
+        // If any component is unhealthy, overall is unhealthy
+        if components.p2p == HealthStatus::Unhealthy
+            || components.storage == HealthStatus::Unhealthy
+            || components.consensus == HealthStatus::Unhealthy
+        {
+            return HealthStatus::Unhealthy;
+        }
+
+        // If any critical component is degraded, overall is degraded
+        if components.p2p == HealthStatus::Degraded
+            || components.storage == HealthStatus::Degraded
+            || components.consensus == HealthStatus::Degraded
+        {
+            return HealthStatus::Degraded;
+        }
+
+        // If Bittensor is down but others are fine, degraded
+        if components.bittensor == HealthStatus::Unhealthy {
+            return HealthStatus::Degraded;
+        }
+
+        HealthStatus::Healthy
+    }


⚠️ Potential issue | 🟡 Minor

The challenges component status is tracked but not used in health calculation.

The set_component_status method accepts "challenges" (line 190), and ComponentStatus includes a challenges field, but get_overall_status() doesn't consider it when determining overall health. If challenge container health should affect the overall status, it should be included in this logic.

💡 Suggested fix if challenges should affect health

fn get_overall_status(&self) -> HealthStatus { let components = self.components.read(); // If any component is unhealthy, overall is unhealthy if components.p2p == HealthStatus::Unhealthy || components.storage == HealthStatus::Unhealthy || components.consensus == HealthStatus::Unhealthy { return HealthStatus::Unhealthy; } // If any critical component is degraded, overall is degraded if components.p2p == HealthStatus::Degraded || components.storage == HealthStatus::Degraded || components.consensus == HealthStatus::Degraded { return HealthStatus::Degraded; } - // If Bittensor is down but others are fine, degraded - if components.bittensor == HealthStatus::Unhealthy { + // If Bittensor or challenges are down but core is fine, degraded + if components.bittensor == HealthStatus::Unhealthy + || components.challenges == HealthStatus::Unhealthy + { return HealthStatus::Degraded; } HealthStatus::Healthy }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fn get_overall_status(&self) -> HealthStatus {

let components = self.components.read();

// If any component is unhealthy, overall is unhealthy

if components.p2p == HealthStatus::Unhealthy

|| components.storage == HealthStatus::Unhealthy

|| components.consensus == HealthStatus::Unhealthy

{

return HealthStatus::Unhealthy;

}

// If any critical component is degraded, overall is degraded

if components.p2p == HealthStatus::Degraded

|| components.storage == HealthStatus::Degraded

|| components.consensus == HealthStatus::Degraded

{

return HealthStatus::Degraded;

}

// If Bittensor is down but others are fine, degraded

if components.bittensor == HealthStatus::Unhealthy {

return HealthStatus::Degraded;

}

HealthStatus::Healthy

}

fn get_overall_status(&self) -> HealthStatus {

let components = self.components.read();

// If any component is unhealthy, overall is unhealthy

if components.p2p == HealthStatus::Unhealthy

|| components.storage == HealthStatus::Unhealthy

|| components.consensus == HealthStatus::Unhealthy

{

return HealthStatus::Unhealthy;

}

// If any critical component is degraded, overall is degraded

if components.p2p == HealthStatus::Degraded

|| components.storage == HealthStatus::Degraded

|| components.consensus == HealthStatus::Degraded

{

return HealthStatus::Degraded;

}

// If Bittensor or challenges are down but core is fine, degraded

if components.bittensor == HealthStatus::Unhealthy

|| components.challenges == HealthStatus::Unhealthy

{

return HealthStatus::Degraded;

}

HealthStatus::Healthy

}

🤖 Prompt for AI Agents

In `@crates/rpc-server/src/health.rs` around lines 196 - 221, get_overall_status currently ignores the challenges field on ComponentStatus even though set_component_status and ComponentStatus track it; update get_overall_status to include components.challenges in the unhealthy and degraded checks (treat challenges the same way as p2p/storage/consensus), and ensure the final special-case check (the bittensor check) still runs as intended — modify the get_overall_status function to reference components.challenges alongside components.p2p, components.storage, and components.consensus when computing overall HealthStatus.

coderabbitai · 2026-02-03T11:39:37Z

docs/challenge-integration.md

+```
+my-challenge/
+├── Cargo.toml
+├── src/
+│   ├── lib.rs           # Challenge implementation
+│   ├── evaluation.rs    # Evaluation logic
+│   └── scoring.rs       # Scoring algorithm
+├── Dockerfile           # Container build
+└── README.md           # Documentation
+```


⚠️ Potential issue | 🟡 Minor

Add a language tag to the project-structure block.
This avoids MD040 warnings and improves readability.

Proposed fix

-``` +```text my-challenge/ ├── Cargo.toml ├── src/ │ ├── lib.rs # Challenge implementation │ ├── evaluation.rs # Evaluation logic │ └── scoring.rs # Scoring algorithm ├── Dockerfile # Container build └── README.md # Documentation

</details> <details> <summary>🤖 Prompt for AI Agents</summary>

In @docs/challenge-integration.md around lines 42 - 51, The Markdown code fence
showing the project tree lacks a language tag; update the block in
docs/challenge-integration.md by changing the opening totext so the
project-structure block (the my-challenge/ tree with Cargo.toml, src/lib.rs,
evaluation.rs, scoring.rs, Dockerfile, README.md) is fenced as ```text which
will silence MD040 warnings and improve rendering/readability.

</details>  

echobt added 8 commits February 3, 2026 10:57

feat: add challenges directory structure and workspace configuration

1c24a12

feat(core): add checkpoint system for state persistence

56ea011

feat: add restoration system for checkpoint recovery

467e7d4

feat(rpc-server): add health check endpoints for rolling updates

b48d692

docs: add challenge integration guide

42a35be

test: add integration tests for checkpoint and restoration system

ef71472

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

echobt merged commit b50ce52 into main Feb 3, 2026
7 checks passed

echobt deleted the feat/challenge-integration-1770118346 branch February 3, 2026 11:43

github-actions bot mentioned this pull request Feb 3, 2026

chore(main): release 0.3.0 #1

Open

This was referenced Feb 3, 2026

test: add comprehensive unit tests across Platform v2 modules #3

Merged

feat: WASM-based dynamic challenge loading system with term-challenge integration #6

Open

feat: add challenge integration infrastructure with checkpoint persistence #2

feat: add challenge integration infrastructure with checkpoint persistence #2

Uh oh!

Conversation

echobt commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Challenge Infrastructure

Checkpoint System

Rolling Updates

Documentation

Testing

Benefits

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

echobt commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading