Skip to content

Conversation

@echobt
Copy link
Contributor

@echobt echobt commented Feb 3, 2026

Summary

This PR adds comprehensive infrastructure for integrating challenge crates into Platform v2, with a focus on enabling zero-downtime updates.

Changes

Challenge Infrastructure

  • Add challenges/ directory structure for hosting challenge crates
  • Create platform-challenge-registry crate for challenge lifecycle management
    • Challenge discovery and registration
    • Version management (semver-based)
    • Lifecycle state machine (registered/starting/running/stopping/stopped)
    • Health monitoring with configurable checks
    • State persistence and hot-reload support
    • Migration planning for version upgrades

Checkpoint System

  • Add checkpoint system in platform-core for state persistence
  • Implement restoration manager for checkpoint recovery
  • Support automatic state recovery on restart

Rolling Updates

  • Add health check endpoints in RPC server for rolling updates
  • Implement graceful shutdown with checkpoint persistence
  • Create periodic checkpoints (every 5 minutes) for resilience

Documentation

  • Add challenge integration guide

Testing

  • Add comprehensive integration tests for checkpoint/restoration system

Benefits

  • Validators can update without losing evaluation progress
  • Challenges can be hot-reloaded without service interruption
  • Automatic recovery from unexpected shutdowns
  • Foundation for adding challenge crates (e.g., term-challenge)

Testing

  • All existing tests pass
  • New checkpoint tests validate persistence and recovery
  • Code compiles with cargo check --workspace

Summary by CodeRabbit

Release Notes

  • New Features

    • Added challenge registry system for discovering, registering, and managing multiple challenges with version support
    • Implemented state persistence and recovery system via checkpoints for resilient operation
    • Added health monitoring with configurable checks for system components
    • Introduced challenge lifecycle management with automatic restarts and migration support
  • Documentation

    • Added comprehensive challenge integration guide and platform documentation
  • Infrastructure

    • Enhanced validator with graceful shutdown and state checkpoint capabilities

Create new platform-challenge-registry crate with:
- Challenge discovery and registration
- Version management (semver-based)
- Lifecycle state machine (registered/starting/running/stopping/stopped)
- Health monitoring with configurable checks
- State persistence and hot-reload support
- Migration planning for version upgrades

Modules:
- registry: Main registry with CRUD operations
- lifecycle: State machine for challenge states
- health: Health monitoring and status tracking
- state: State snapshots for hot-reload
- discovery: Challenge discovery from various sources
- migration: Version migration planning
- version: Semantic versioning support
- error: Registry-specific error types
- Add ShutdownHandler struct for checkpoint management
- Create periodic checkpoints every 5 minutes
- Save final checkpoint on graceful shutdown (Ctrl+C)
- Persist evaluation state for hot-reload recovery

This enables validators to update without losing evaluation progress.
@coderabbitai
Copy link

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

Introduces a comprehensive challenge registry system with state persistence, checkpoint/restoration capabilities, and health monitoring infrastructure. Adds a new crates/challenge-registry crate with discovery, lifecycle, migration, versioning, and state management modules. Implements checkpoint-based state persistence in core for graceful shutdown and recovery, plus a health check system for the RPC server.

Changes

Cohort / File(s) Summary
Workspace Configuration
Cargo.toml
Added crates/challenge-registry workspace member and [workspace.metadata.challenge-features] with dynamic-loading flag.
Checkpoint & Restoration System
crates/core/src/checkpoint.rs, crates/core/src/restoration.rs, crates/core/src/error.rs, crates/core/src/lib.rs
Implemented checkpoint creation, loading, and atomic persistence with integrity verification; added restoration manager with filtering, validation, and recovery flows; introduced Validation error variant and module exports.
Challenge Registry Core
crates/challenge-registry/src/error.rs, crates/challenge-registry/src/registry.rs, crates/challenge-registry/src/lib.rs
Created centralized challenge registry with storage, lifecycle integration, health monitoring; defined error types with conversions; established public module structure and re-exports.
Challenge Discovery & Version Management
crates/challenge-registry/src/discovery.rs, crates/challenge-registry/src/version.rs
Implemented multi-source challenge discovery with configurable scanning; added version parsing, comparison, and constraint resolution (Exact, AtLeast, Range, Compatible, Any).
Challenge Lifecycle & State
crates/challenge-registry/src/lifecycle.rs, crates/challenge-registry/src/state.rs, crates/challenge-registry/src/migration.rs
Added lifecycle state machine with valid transitions; implemented thread-safe state store with snapshot-based recovery; introduced migration framework with plan generation based on version deltas and rollback support.
Health Monitoring
crates/challenge-registry/src/health.rs, crates/rpc-server/src/health.rs, crates/rpc-server/src/lib.rs
Created health status tracking with configurable checks and response time averaging; added RPC server health endpoint with component-level status, readiness state, and uptime tracking.
Validator Node Integration
bins/validator-node/src/main.rs
Integrated ShutdownHandler to create final checkpoints on graceful shutdown; added periodic checkpoint cycles during runtime with configurable intervals.
Challenge Registry Manifest
crates/challenge-registry/Cargo.toml, crates/core/src/checkpoint.rs dev-dependencies
Defined challenge registry dependencies (async-trait, parking_lot, serde, tracing, semver, uuid); added tempfile for checkpoint tests.
Documentation & Tests
challenges/README.md, challenges/.gitkeep, docs/challenge-integration.md, tests/Cargo.toml, tests/checkpoint_tests.rs
Added challenge crate guidelines and integration guide; included directory placeholder; created comprehensive checkpoint round-trip and restoration validation tests.

Sequence Diagram(s)

sequenceDiagram
    participant Validator as Validator Node
    participant Registry as Challenge Registry
    participant Checkpoint as Checkpoint Manager
    participant Disk as Disk Storage
    participant Restorable as Restoration Manager

    Validator->>Registry: Initialize challenge registry
    Registry-->>Validator: Ready

    loop During Runtime
        Validator->>Registry: Update challenge state<br/>(evaluations, health)
        Registry-->>Validator: Acknowledged
        Validator->>Checkpoint: Periodic create_checkpoint()
        Checkpoint->>Checkpoint: Serialize state + compute hash
        Checkpoint->>Disk: Atomic write (temp → final)
        Disk-->>Checkpoint: Checkpoint persisted
    end

    Note over Validator: Ctrl+C Signal

    Validator->>Checkpoint: Final create_checkpoint()
    Checkpoint->>Disk: Write final state
    Validator-->>Validator: Shutdown

    Note over Restorable: Later: Recovery

    Restorable->>Checkpoint: load_latest()
    Checkpoint->>Disk: Read checkpoint + verify hash
    Disk-->>Checkpoint: CheckpointData
    Checkpoint-->>Restorable: (CheckpointHeader, CheckpointData)
    Restorable->>Restorable: Validate & filter state
    Restorable-->>Validator: Restored state ready
Loading
sequenceDiagram
    participant Client as Client
    participant Registry as Challenge Registry
    participant Discovery as Discovery Engine
    participant Health as Health Monitor
    participant Lifecycle as Lifecycle Manager

    Client->>Discovery: discover_from_local(path)
    Discovery->>Discovery: Scan filesystem<br/>Check challenge.toml
    Discovery-->>Client: Vec<DiscoveredChallenge>

    Client->>Registry: register(ChallengeEntry)
    Registry->>Lifecycle: Track new lifecycle
    Registry->>Health: Initialize health state
    Registry-->>Client: ChallengeId

    loop Health Checks
        Health->>Health: Check endpoint<br/>Record metrics
        Health-->>Registry: Update HealthStatus
    end

    Client->>Registry: update_version(id, new_version)
    Registry->>Registry: Validate compatibility
    Registry-->>Client: Old version
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 The registry hops with vibrant grace,
Challenges discovered, tracked in place,
Checkpoints whisper state across the night,
Health beats steady, keeping all things right,
Version migrations flow like gentle streams,
A fresh foundation built on solid dreams!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: add challenge integration infrastructure with checkpoint persistence' directly and specifically summarizes the main changes: introducing challenge integration infrastructure with checkpoint persistence as a core feature.
Docstring Coverage ✅ Passed Docstring coverage is 85.59% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/challenge-integration-1770118346

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Fix all issues with AI agents
In `@challenges/README.md`:
- Around line 7-12: The fenced code block in README.md that displays the
directory tree (the block starting with the triple backticks followed by the
tree: "challenges/ ├── README.md ...") is missing a language tag; update that
fenced block to include the language tag "text" (i.e., change the opening ``` to
```text) so Markdownlint MD040 is satisfied and the tree is treated as plain
text.

In `@crates/challenge-registry/src/health.rs`:
- Around line 91-100: The method record_failure currently uses a hardcoded
threshold (>= 3) to flip status to Unhealthy; update it to use the configured
threshold instead by reading HealthConfig.failure_threshold (or accept a
threshold parameter) so changes to HealthConfig take effect — either change
record_failure on the Health struct to accept a threshold argument and use that
value, or have HealthMonitor::record_failure call into the existing
record_failure and pass self.config.failure_threshold; make sure to replace the
literal 3 with the referenced failure_threshold and keep updating
consecutive_failures, last_check_at, and HealthStatus logic unchanged.

In `@crates/challenge-registry/src/migration.rs`:
- Around line 345-357: The current finalize_migration removes the plan from
active_plans before checking its status, which can drop non-terminal plans and
prevents failed plans from being archived; update finalize_migration to first
look up (read/clone or inspect) the MigrationPlan in self.active_plans without
removing it, verify that its status is a terminal state (allow Completed,
RolledBack, and Failed as terminal), and only then call remove(challenge_id) to
take it out and return the plan; reference the finalize_migration method,
active_plans map, MigrationPlan status checks (e.g., is_complete or status enum
variants) and ensure compatibility with fail_migration which sets Failed so
failed plans become removable/archivable.

In `@crates/challenge-registry/src/registry.rs`:
- Around line 177-202: The update_state function allows arbitrary lifecycle
changes but must validate transitions using
ChallengeLifecycle::is_valid_transition; modify update_state (in registry.rs) to
check ChallengeLifecycle::is_valid_transition(&old_state, &new_state) before
applying changes and if invalid return a new
RegistryError::InvalidStateTransition containing old_state and new_state (add
that variant to RegistryError), only update
registered.entry.lifecycle_state/updated_at and emit
LifecycleEvent::StateChanged when the transition is valid; ensure debug logging
remains and that the function returns Err(...) for invalid transitions instead
of proceeding.

In `@crates/challenge-registry/src/version.rs`:
- Around line 69-79: The current Ord impl for ChallengeVersion (fn cmp) ignores
the prerelease field; update cmp in the impl for ChallengeVersion to enforce
semver precedence: after comparing major, minor, patch, if equal compare
prerelease such that a missing prerelease (release) has higher precedence than
any prerelease (i.e., None > Some), and when both have prerelease strings
compare their dot-separated identifiers in order using semver rules (numeric
identifiers compare numerically and have lower precedence than non-numeric
identifiers; non-numeric compare lexicographically; if all equal the shorter
identifier list has lower precedence). Use the existing fields
ChallengeVersion.prerelease and the cmp function to implement this logic in the
Ord::cmp for ChallengeVersion.

In `@crates/core/src/checkpoint.rs`:
- Around line 231-235: Do not increment self.current_sequence before persisting;
instead compute let next_sequence = self.current_sequence + 1, use
checkpoint_filename(next_sequence) to write and rename the temp file, and only
after the atomic rename succeeds assign self.current_sequence = next_sequence so
failures don't advance the in-memory sequence. Apply the same pattern to the
other similar routine (the block at 279-283) so current_sequence is updated only
on successful persistence; keep all filename construction and rename logic using
checkpoint_filename(next_sequence) and leave load_latest behavior unchanged.
- Around line 321-357: The code reads untrusted sizes into header_len and
header.data_size and allocates vectors directly, which can OOM on corrupt files;
before allocating check bounds: define reasonable maxs (e.g. MAX_HEADER_SIZE and
MAX_DATA_SIZE), validate header_len is <= MAX_HEADER_SIZE and less than the
remaining reader/file length, then allocate header_bytes; after deserializing
CheckpointHeader, validate header.data_size <= MAX_DATA_SIZE and that
header.data_size does not exceed remaining reader/file length before allocating
data_bytes; on violations return a MiniChainError::Storage with a clear message.
Use the existing symbols header_len, CheckpointHeader, header.data_size, reader,
and CHECKPOINT_VERSION to locate insertion points.

In `@crates/rpc-server/src/health.rs`:
- Around line 196-221: get_overall_status currently ignores the challenges field
on ComponentStatus even though set_component_status and ComponentStatus track
it; update get_overall_status to include components.challenges in the unhealthy
and degraded checks (treat challenges the same way as p2p/storage/consensus),
and ensure the final special-case check (the bittensor check) still runs as
intended — modify the get_overall_status function to reference
components.challenges alongside components.p2p, components.storage, and
components.consensus when computing overall HealthStatus.

In `@docs/challenge-integration.md`:
- Around line 42-51: The Markdown code fence showing the project tree lacks a
language tag; update the block in docs/challenge-integration.md by changing the
opening ``` to ```text so the project-structure block (the my-challenge/ tree
with Cargo.toml, src/lib.rs, evaluation.rs, scoring.rs, Dockerfile, README.md)
is fenced as ```text which will silence MD040 warnings and improve
rendering/readability.
🧹 Nitpick comments (13)
challenges/.gitkeep (1)

1-1: Consider making the .gitkeep file truly empty or add a comment.

By convention, .gitkeep files are typically either completely empty (0 bytes) or contain a brief comment explaining their purpose. The current single newline works but is slightly unconventional.

📝 Alternative approaches

Option 1: Remove the empty line entirely to make it a true 0-byte file.

Option 2: Add a descriptive comment:

-
+# Placeholder directory for challenge crate modules
crates/challenge-registry/src/error.rs (1)

45-49: Consider preserving more context for I/O errors.

Mapping all std::io::Error to Internal loses context about whether the error originated from state persistence, file operations, or network I/O. This could make debugging harder.

💡 Optional improvement

Consider mapping I/O errors contextually where they occur, or including the error kind:

 impl From<std::io::Error> for RegistryError {
     fn from(err: std::io::Error) -> Self {
-        RegistryError::Internal(err.to_string())
+        RegistryError::Internal(format!("IO error ({}): {}", err.kind(), err))
     }
 }
tests/checkpoint_tests.rs (1)

139-146: Clarify the distinction between data sequence and checkpoint sequence.

The test expects latest.sequence (from CheckpointData) to be 9, while header.sequence is 10. This works because the loop index i goes 0-9, but it might confuse readers. Consider adding a comment explaining the distinction.

📝 Optional clarification
     // Latest should be sequence 10
+    // Note: header.sequence is the checkpoint file sequence (1-indexed)
+    // data.sequence is the application sequence from the loop (0-indexed)
     let (header, latest) = manager
         .load_latest()
         .expect("Failed to load")
         .expect("No checkpoint");
     assert_eq!(latest.sequence, 9);
     assert_eq!(header.sequence, 10);
crates/challenge-registry/src/lifecycle.rs (1)

56-78: Auto-restart configuration defined but restart tracking not implemented.

The auto_restart and max_restart_attempts fields are configured but there's no state or method to track actual restart attempts per challenge. Consider whether restart count tracking should be managed here or delegated to the registry.

crates/challenge-registry/src/health.rs (1)

111-114: recovery_threshold is defined but never used.

The HealthConfig.recovery_threshold field (line 113) is documented as "Number of successes to recover from unhealthy" but record_success unconditionally sets status to Healthy. Consider implementing recovery logic or removing the unused field.

crates/challenge-registry/src/discovery.rs (2)

158-194: Code duplication between challenge.toml and Cargo.toml handling.

The logic for extracting the challenge name and creating DiscoveredChallenge is nearly identical between the two branches. Consider extracting a helper function.

Proposed refactor to reduce duplication
+    fn create_discovered_from_path(&self, path: &PathBuf) -> DiscoveredChallenge {
+        let name = path
+            .file_name()
+            .and_then(|n| n.to_str())
+            .unwrap_or("unknown")
+            .to_string();
+
+        DiscoveredChallenge {
+            name,
+            version: ChallengeVersion::default(),
+            docker_image: None,
+            local_path: Some(path.clone()),
+            health_endpoint: None,
+            evaluation_endpoint: None,
+            metadata: ChallengeMetadata::default(),
+            source: DiscoverySource::LocalFilesystem(path.clone()),
+        }
+    }
+
     pub fn discover_from_local(&self, path: &PathBuf) -> RegistryResult<Vec<DiscoveredChallenge>> {
         // ... validation ...
         
         if path.is_dir() {
             let challenge_toml = path.join("challenge.toml");
             let cargo_toml = path.join("Cargo.toml");
 
             if challenge_toml.exists() || cargo_toml.exists() {
-                // ... duplicated code ...
+                challenges.push(self.create_discovered_from_path(path));
             }
         }
         Ok(challenges)
     }

121-140: discover_all only implements local discovery; Docker and P2P are placeholders.

The method iterates only over local_paths while docker_registries and enable_p2p config fields are unused. This appears intentional for the initial infrastructure, but consider adding TODO comments or returning a warning when these sources are configured but not yet implemented.

crates/challenge-registry/src/state.rs (1)

206-209: Inefficient snapshot trimming with remove(0) in loop.

Using remove(0) on a Vec in a loop is O(n²) because each removal shifts all subsequent elements. With default max_snapshots=5, this is negligible, but consider using drain for cleaner O(n) removal if this could be configured higher.

Proposed fix using drain
         // Trim old snapshots
-        while snapshots.len() > self.max_snapshots {
-            snapshots.remove(0);
-        }
+        if snapshots.len() > self.max_snapshots {
+            let excess = snapshots.len() - self.max_snapshots;
+            snapshots.drain(0..excess);
+        }
crates/core/src/restoration.rs (4)

256-275: Stale evaluation filtering is a no-op placeholder.

The skip_stale_evaluations option is respected, but the actual filter logic always returns true, keeping all evaluations. The comment acknowledges this: "For now, keep all pending (they don't have epoch info)".

Consider either:

  1. Implementing the logic if PendingEvaluationState has epoch information available
  2. Removing the skip_stale_evaluations option until it's implemented
  3. Adding a TODO comment that's more visible

286-292: Magic number for epoch validation should be a named constant.

The 1_000_000 epoch limit is arbitrary. Consider defining it as a constant with documentation explaining the rationale.

Proposed fix
+/// Maximum reasonable epoch value for validation.
+/// This is a sanity check to detect corrupted checkpoints.
+const MAX_REASONABLE_EPOCH: u64 = 1_000_000;
+
     fn validate_data(&self, data: &CheckpointData) -> Result<()> {
         // Validate epoch is reasonable
-        if data.epoch > 1_000_000 {
+        if data.epoch > MAX_REASONABLE_EPOCH {
             return Err(MiniChainError::Validation(
                 "Checkpoint epoch seems unreasonably high".into(),
             ));
         }

336-351: get_checkpoint_info loads full checkpoint data for metadata extraction.

This method loads the entire checkpoint (including all evaluations) just to extract summary information. If CheckpointManager supported header-only loading, this could be more efficient for listing many checkpoints.


372-379: Add documentation explaining expected implementors of the Restorable trait.

The trait is exported in the public API but has no implementations in the codebase and lacks documentation about intended use. Consider adding a doc comment clarifying whether external crates are expected to implement this trait or if it's reserved for future use within this crate.

crates/challenge-registry/src/registry.rs (1)

274-279: Event listeners executed synchronously under lock.

emit_event calls all listeners while holding the read lock on event_listeners. Long-running or blocking listeners could delay other registry operations. Consider:

  1. Cloning listeners before iterating (release lock sooner)
  2. Spawning listener calls asynchronously
  3. Documenting that listeners should be non-blocking
Option: Clone listeners to release lock faster
     fn emit_event(&self, event: LifecycleEvent) {
-        for listener in self.event_listeners.read().iter() {
+        let listeners: Vec<_> = self.event_listeners.read().iter().collect();
+        // Lock released here
+        for listener in listeners {
             listener(event.clone());
         }
     }

Note: This requires listeners to be Clone, which Box<dyn Fn> is not. Alternative: store Arc<dyn Fn> instead.

Comment on lines +7 to +12
```
challenges/
├── README.md # This file
├── example-challenge/ # Example challenge template (future)
└── [your-challenge]/ # Your custom challenge crate
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced block.
Markdownlint MD040 flags fenced blocks without a language; use text here.

Proposed fix
-```
+```text
 challenges/
 ├── README.md           # This file
 ├── example-challenge/  # Example challenge template (future)
 └── [your-challenge]/   # Your custom challenge crate
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

In @challenges/README.md around lines 7 - 12, The fenced code block in README.md
that displays the directory tree (the block starting with the triple backticks
followed by the tree: "challenges/ ├── README.md ...") is missing a language
tag; update that fenced block to include the language tag "text" (i.e., change
the opening totext) so Markdownlint MD040 is satisfied and the tree is
treated as plain text.


</details>

<!-- fingerprinting:phantom:poseidon:eagle -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +91 to +100
pub fn record_failure(&mut self, reason: String) {
self.consecutive_failures += 1;
self.last_check_at = chrono::Utc::now().timestamp_millis();

if self.consecutive_failures >= 3 {
self.status = HealthStatus::Unhealthy(reason);
} else {
self.status = HealthStatus::Degraded(reason);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded failure threshold ignores HealthConfig.failure_threshold.

The record_failure method hardcodes >= 3 for marking unhealthy, but HealthConfig has a configurable failure_threshold field (default 3). This creates an inconsistency where changing the config has no effect.

Proposed fix to use configurable threshold

Either pass the threshold to record_failure:

-    pub fn record_failure(&mut self, reason: String) {
+    pub fn record_failure(&mut self, reason: String, failure_threshold: u32) {
         self.consecutive_failures += 1;
         self.last_check_at = chrono::Utc::now().timestamp_millis();
 
-        if self.consecutive_failures >= 3 {
+        if self.consecutive_failures >= failure_threshold {
             self.status = HealthStatus::Unhealthy(reason);
         } else {
             self.status = HealthStatus::Degraded(reason);
         }
     }

Or have HealthMonitor::record_failure apply the threshold from its config.

🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/health.rs` around lines 91 - 100, The method
record_failure currently uses a hardcoded threshold (>= 3) to flip status to
Unhealthy; update it to use the configured threshold instead by reading
HealthConfig.failure_threshold (or accept a threshold parameter) so changes to
HealthConfig take effect — either change record_failure on the Health struct to
accept a threshold argument and use that value, or have
HealthMonitor::record_failure call into the existing record_failure and pass
self.config.failure_threshold; make sure to replace the literal 3 with the
referenced failure_threshold and keep updating consecutive_failures,
last_check_at, and HealthStatus logic unchanged.

Comment on lines +345 to +357
/// Finalize and archive a completed migration
pub fn finalize_migration(&self, challenge_id: &ChallengeId) -> RegistryResult<MigrationPlan> {
let plan = self
.active_plans
.write()
.remove(challenge_id)
.ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?;

if !plan.is_complete() {
return Err(RegistryError::MigrationFailed(
"Migration not complete".to_string(),
));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Finalize only after verifying a terminal status (and keep failed plans archivable).

Line 345 removes the plan before checking completion, so a premature call drops the active migration. Also, fail_migration sets Failed but finalize_migration rejects anything outside Completed/RolledBack, leaving failed plans stuck in active_plans. Validate status while the plan is still in the map and allow terminal states before removal.

🔧 Suggested fix
-        let plan = self
-            .active_plans
-            .write()
-            .remove(challenge_id)
-            .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?;
-
-        if !plan.is_complete() {
-            return Err(RegistryError::MigrationFailed(
-                "Migration not complete".to_string(),
-            ));
-        }
+        let mut plans = self.active_plans.write();
+        let is_terminal = plans
+            .get(challenge_id)
+            .map(|p| {
+                matches!(
+                    p.status,
+                    MigrationStatus::Completed
+                        | MigrationStatus::RolledBack
+                        | MigrationStatus::Failed(_)
+                )
+            })
+            .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?;
+
+        if !is_terminal {
+            return Err(RegistryError::MigrationFailed(
+                "Migration not complete".to_string(),
+            ));
+        }
+
+        let plan = plans.remove(challenge_id).expect("checked above");
🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/migration.rs` around lines 345 - 357, The
current finalize_migration removes the plan from active_plans before checking
its status, which can drop non-terminal plans and prevents failed plans from
being archived; update finalize_migration to first look up (read/clone or
inspect) the MigrationPlan in self.active_plans without removing it, verify that
its status is a terminal state (allow Completed, RolledBack, and Failed as
terminal), and only then call remove(challenge_id) to take it out and return the
plan; reference the finalize_migration method, active_plans map, MigrationPlan
status checks (e.g., is_complete or status enum variants) and ensure
compatibility with fail_migration which sets Failed so failed plans become
removable/archivable.

Comment on lines +177 to +202
/// Update challenge lifecycle state
pub fn update_state(&self, id: &ChallengeId, new_state: LifecycleState) -> RegistryResult<()> {
let mut challenges = self.challenges.write();
let registered = challenges
.get_mut(id)
.ok_or_else(|| RegistryError::ChallengeNotFound(id.to_string()))?;

let old_state = registered.entry.lifecycle_state.clone();
registered.entry.lifecycle_state = new_state.clone();
registered.entry.updated_at = chrono::Utc::now().timestamp_millis();

debug!(
challenge_id = %id,
old_state = ?old_state,
new_state = ?new_state,
"Challenge state updated"
);

self.emit_event(LifecycleEvent::StateChanged {
challenge_id: *id,
old_state,
new_state,
});

Ok(())
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

update_state doesn't validate transitions using ChallengeLifecycle.is_valid_transition.

The ChallengeLifecycle struct has is_valid_transition logic (defined in lifecycle.rs), but update_state doesn't use it. Invalid state transitions are allowed, which could lead to inconsistent states.

Proposed fix to validate transitions
     pub fn update_state(&self, id: &ChallengeId, new_state: LifecycleState) -> RegistryResult<()> {
         let mut challenges = self.challenges.write();
         let registered = challenges
             .get_mut(id)
             .ok_or_else(|| RegistryError::ChallengeNotFound(id.to_string()))?;
 
         let old_state = registered.entry.lifecycle_state.clone();
+        
+        if !self.lifecycle.is_valid_transition(&old_state, &new_state) {
+            return Err(RegistryError::InvalidStateTransition(
+                format!("{:?} -> {:?}", old_state, new_state)
+            ));
+        }
+        
         registered.entry.lifecycle_state = new_state.clone();
         registered.entry.updated_at = chrono::Utc::now().timestamp_millis();

Note: This requires adding an InvalidStateTransition variant to RegistryError.

🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/registry.rs` around lines 177 - 202, The
update_state function allows arbitrary lifecycle changes but must validate
transitions using ChallengeLifecycle::is_valid_transition; modify update_state
(in registry.rs) to check ChallengeLifecycle::is_valid_transition(&old_state,
&new_state) before applying changes and if invalid return a new
RegistryError::InvalidStateTransition containing old_state and new_state (add
that variant to RegistryError), only update
registered.entry.lifecycle_state/updated_at and emit
LifecycleEvent::StateChanged when the transition is valid; ensure debug logging
remains and that the function returns Err(...) for invalid transitions instead
of proceeding.

Comment on lines +69 to +79
impl Ord for ChallengeVersion {
fn cmp(&self, other: &Self) -> Ordering {
match self.major.cmp(&other.major) {
Ordering::Equal => match self.minor.cmp(&other.minor) {
Ordering::Equal => self.patch.cmp(&other.patch),
ord => ord,
},
ord => ord,
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Prerelease versions are not considered in ordering.

The Ord implementation ignores the prerelease field, meaning 1.0.0-alpha and 1.0.0 are considered equal in comparisons. Per semver specification, prerelease versions should have lower precedence than the release version (i.e., 1.0.0-alpha < 1.0.0).

This could cause issues when comparing versions during migration planning or constraint satisfaction.

💡 Suggested fix for semver-compliant ordering
 impl Ord for ChallengeVersion {
     fn cmp(&self, other: &Self) -> Ordering {
         match self.major.cmp(&other.major) {
             Ordering::Equal => match self.minor.cmp(&other.minor) {
-                Ordering::Equal => self.patch.cmp(&other.patch),
+                Ordering::Equal => match self.patch.cmp(&other.patch) {
+                    Ordering::Equal => {
+                        // Prerelease versions have lower precedence than release
+                        match (&self.prerelease, &other.prerelease) {
+                            (None, None) => Ordering::Equal,
+                            (Some(_), None) => Ordering::Less,    // prerelease < release
+                            (None, Some(_)) => Ordering::Greater, // release > prerelease
+                            (Some(a), Some(b)) => a.cmp(b),       // compare prerelease strings
+                        }
+                    }
+                    ord => ord,
+                },
                 ord => ord,
             },
             ord => ord,
         }
     }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
impl Ord for ChallengeVersion {
fn cmp(&self, other: &Self) -> Ordering {
match self.major.cmp(&other.major) {
Ordering::Equal => match self.minor.cmp(&other.minor) {
Ordering::Equal => self.patch.cmp(&other.patch),
ord => ord,
},
ord => ord,
}
}
}
impl Ord for ChallengeVersion {
fn cmp(&self, other: &Self) -> Ordering {
match self.major.cmp(&other.major) {
Ordering::Equal => match self.minor.cmp(&other.minor) {
Ordering::Equal => match self.patch.cmp(&other.patch) {
Ordering::Equal => {
// Prerelease versions have lower precedence than release
match (&self.prerelease, &other.prerelease) {
(None, None) => Ordering::Equal,
(Some(_), None) => Ordering::Less, // prerelease < release
(None, Some(_)) => Ordering::Greater, // release > prerelease
(Some(a), Some(b)) => a.cmp(b), // compare prerelease strings
}
}
ord => ord,
},
ord => ord,
},
ord => ord,
}
}
}
🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/version.rs` around lines 69 - 79, The current
Ord impl for ChallengeVersion (fn cmp) ignores the prerelease field; update cmp
in the impl for ChallengeVersion to enforce semver precedence: after comparing
major, minor, patch, if equal compare prerelease such that a missing prerelease
(release) has higher precedence than any prerelease (i.e., None > Some), and
when both have prerelease strings compare their dot-separated identifiers in
order using semver rules (numeric identifiers compare numerically and have lower
precedence than non-numeric identifiers; non-numeric compare lexicographically;
if all equal the shorter identifier list has lower precedence). Use the existing
fields ChallengeVersion.prerelease and the cmp function to implement this logic
in the Ord::cmp for ChallengeVersion.

Comment on lines +231 to +235
pub fn create_checkpoint(&mut self, data: &CheckpointData) -> Result<PathBuf> {
self.current_sequence += 1;
let sequence = self.current_sequence;
let filename = self.checkpoint_filename(sequence);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Defer current_sequence updates until after the checkpoint is safely persisted.

Line 232 increments before the write/rename. If I/O fails, current_sequence points to a non-existent file, and load_latest returns None even though older checkpoints exist. Compute the next sequence locally and update only after the rename succeeds.

🔧 Suggested fix
-        self.current_sequence += 1;
-        let sequence = self.current_sequence;
+        let sequence = self.current_sequence + 1;
         let filename = self.checkpoint_filename(sequence);
@@
         fs::rename(&temp_filename, &filename).map_err(|e| {
             MiniChainError::Storage(format!("Failed to finalize checkpoint: {}", e))
         })?;
+
+        self.current_sequence = sequence;

Also applies to: 279-283

🤖 Prompt for AI Agents
In `@crates/core/src/checkpoint.rs` around lines 231 - 235, Do not increment
self.current_sequence before persisting; instead compute let next_sequence =
self.current_sequence + 1, use checkpoint_filename(next_sequence) to write and
rename the temp file, and only after the atomic rename succeeds assign
self.current_sequence = next_sequence so failures don't advance the in-memory
sequence. Apply the same pattern to the other similar routine (the block at
279-283) so current_sequence is updated only on successful persistence; keep all
filename construction and rename logic using checkpoint_filename(next_sequence)
and leave load_latest behavior unchanged.

Comment on lines +321 to +357
// Read header length
let mut header_len_bytes = [0u8; 4];
reader
.read_exact(&mut header_len_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?;
let header_len = u32::from_le_bytes(header_len_bytes) as usize;

// Read header
let mut header_bytes = vec![0u8; header_len];
reader
.read_exact(&mut header_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?;

let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| {
MiniChainError::Serialization(format!("Failed to deserialize header: {}", e))
})?;

// Verify magic
if !header.verify_magic() {
return Err(MiniChainError::Storage(
"Invalid checkpoint magic bytes".into(),
));
}

// Verify version compatibility
if header.version > CHECKPOINT_VERSION {
return Err(MiniChainError::Storage(format!(
"Checkpoint version {} is newer than supported version {}",
header.version, CHECKPOINT_VERSION
)));
}

// Read data
let mut data_bytes = vec![0u8; header.data_size as usize];
reader
.read_exact(&mut data_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add size bounds before allocating header/data buffers.

Line 326 uses a header length sourced from disk, and Line 354 uses data_size from the header. A corrupt file can force large allocations and OOM. Add reasonable caps (or validate against file length) before allocating.

🔧 Suggested fix
         // Read header length
         let mut header_len_bytes = [0u8; 4];
         reader
             .read_exact(&mut header_len_bytes)
             .map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?;
         let header_len = u32::from_le_bytes(header_len_bytes) as usize;
+        const MAX_HEADER_SIZE: usize = 16 * 1024;
+        const MAX_DATA_SIZE: u64 = 512 * 1024 * 1024;
+        if header_len == 0 || header_len > MAX_HEADER_SIZE {
+            return Err(MiniChainError::Storage("Checkpoint header too large".into()));
+        }
@@
         let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| {
             MiniChainError::Serialization(format!("Failed to deserialize header: {}", e))
         })?;
+
+        if header.data_size > MAX_DATA_SIZE {
+            return Err(MiniChainError::Storage("Checkpoint data too large".into()));
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Read header length
let mut header_len_bytes = [0u8; 4];
reader
.read_exact(&mut header_len_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?;
let header_len = u32::from_le_bytes(header_len_bytes) as usize;
// Read header
let mut header_bytes = vec![0u8; header_len];
reader
.read_exact(&mut header_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?;
let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| {
MiniChainError::Serialization(format!("Failed to deserialize header: {}", e))
})?;
// Verify magic
if !header.verify_magic() {
return Err(MiniChainError::Storage(
"Invalid checkpoint magic bytes".into(),
));
}
// Verify version compatibility
if header.version > CHECKPOINT_VERSION {
return Err(MiniChainError::Storage(format!(
"Checkpoint version {} is newer than supported version {}",
header.version, CHECKPOINT_VERSION
)));
}
// Read data
let mut data_bytes = vec![0u8; header.data_size as usize];
reader
.read_exact(&mut data_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?;
// Read header length
let mut header_len_bytes = [0u8; 4];
reader
.read_exact(&mut header_len_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?;
let header_len = u32::from_le_bytes(header_len_bytes) as usize;
const MAX_HEADER_SIZE: usize = 16 * 1024;
const MAX_DATA_SIZE: u64 = 512 * 1024 * 1024;
if header_len == 0 || header_len > MAX_HEADER_SIZE {
return Err(MiniChainError::Storage("Checkpoint header too large".into()));
}
// Read header
let mut header_bytes = vec![0u8; header_len];
reader
.read_exact(&mut header_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?;
let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| {
MiniChainError::Serialization(format!("Failed to deserialize header: {}", e))
})?;
if header.data_size > MAX_DATA_SIZE {
return Err(MiniChainError::Storage("Checkpoint data too large".into()));
}
// Verify magic
if !header.verify_magic() {
return Err(MiniChainError::Storage(
"Invalid checkpoint magic bytes".into(),
));
}
// Verify version compatibility
if header.version > CHECKPOINT_VERSION {
return Err(MiniChainError::Storage(format!(
"Checkpoint version {} is newer than supported version {}",
header.version, CHECKPOINT_VERSION
)));
}
// Read data
let mut data_bytes = vec![0u8; header.data_size as usize];
reader
.read_exact(&mut data_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?;
🤖 Prompt for AI Agents
In `@crates/core/src/checkpoint.rs` around lines 321 - 357, The code reads
untrusted sizes into header_len and header.data_size and allocates vectors
directly, which can OOM on corrupt files; before allocating check bounds: define
reasonable maxs (e.g. MAX_HEADER_SIZE and MAX_DATA_SIZE), validate header_len is
<= MAX_HEADER_SIZE and less than the remaining reader/file length, then allocate
header_bytes; after deserializing CheckpointHeader, validate header.data_size <=
MAX_DATA_SIZE and that header.data_size does not exceed remaining reader/file
length before allocating data_bytes; on violations return a
MiniChainError::Storage with a clear message. Use the existing symbols
header_len, CheckpointHeader, header.data_size, reader, and CHECKPOINT_VERSION
to locate insertion points.

Comment on lines +196 to +221
fn get_overall_status(&self) -> HealthStatus {
let components = self.components.read();

// If any component is unhealthy, overall is unhealthy
if components.p2p == HealthStatus::Unhealthy
|| components.storage == HealthStatus::Unhealthy
|| components.consensus == HealthStatus::Unhealthy
{
return HealthStatus::Unhealthy;
}

// If any critical component is degraded, overall is degraded
if components.p2p == HealthStatus::Degraded
|| components.storage == HealthStatus::Degraded
|| components.consensus == HealthStatus::Degraded
{
return HealthStatus::Degraded;
}

// If Bittensor is down but others are fine, degraded
if components.bittensor == HealthStatus::Unhealthy {
return HealthStatus::Degraded;
}

HealthStatus::Healthy
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

The challenges component status is tracked but not used in health calculation.

The set_component_status method accepts "challenges" (line 190), and ComponentStatus includes a challenges field, but get_overall_status() doesn't consider it when determining overall health. If challenge container health should affect the overall status, it should be included in this logic.

💡 Suggested fix if challenges should affect health
     fn get_overall_status(&self) -> HealthStatus {
         let components = self.components.read();

         // If any component is unhealthy, overall is unhealthy
         if components.p2p == HealthStatus::Unhealthy
             || components.storage == HealthStatus::Unhealthy
             || components.consensus == HealthStatus::Unhealthy
         {
             return HealthStatus::Unhealthy;
         }

         // If any critical component is degraded, overall is degraded
         if components.p2p == HealthStatus::Degraded
             || components.storage == HealthStatus::Degraded
             || components.consensus == HealthStatus::Degraded
         {
             return HealthStatus::Degraded;
         }

-        // If Bittensor is down but others are fine, degraded
-        if components.bittensor == HealthStatus::Unhealthy {
+        // If Bittensor or challenges are down but core is fine, degraded
+        if components.bittensor == HealthStatus::Unhealthy
+            || components.challenges == HealthStatus::Unhealthy
+        {
             return HealthStatus::Degraded;
         }

         HealthStatus::Healthy
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn get_overall_status(&self) -> HealthStatus {
let components = self.components.read();
// If any component is unhealthy, overall is unhealthy
if components.p2p == HealthStatus::Unhealthy
|| components.storage == HealthStatus::Unhealthy
|| components.consensus == HealthStatus::Unhealthy
{
return HealthStatus::Unhealthy;
}
// If any critical component is degraded, overall is degraded
if components.p2p == HealthStatus::Degraded
|| components.storage == HealthStatus::Degraded
|| components.consensus == HealthStatus::Degraded
{
return HealthStatus::Degraded;
}
// If Bittensor is down but others are fine, degraded
if components.bittensor == HealthStatus::Unhealthy {
return HealthStatus::Degraded;
}
HealthStatus::Healthy
}
fn get_overall_status(&self) -> HealthStatus {
let components = self.components.read();
// If any component is unhealthy, overall is unhealthy
if components.p2p == HealthStatus::Unhealthy
|| components.storage == HealthStatus::Unhealthy
|| components.consensus == HealthStatus::Unhealthy
{
return HealthStatus::Unhealthy;
}
// If any critical component is degraded, overall is degraded
if components.p2p == HealthStatus::Degraded
|| components.storage == HealthStatus::Degraded
|| components.consensus == HealthStatus::Degraded
{
return HealthStatus::Degraded;
}
// If Bittensor or challenges are down but core is fine, degraded
if components.bittensor == HealthStatus::Unhealthy
|| components.challenges == HealthStatus::Unhealthy
{
return HealthStatus::Degraded;
}
HealthStatus::Healthy
}
🤖 Prompt for AI Agents
In `@crates/rpc-server/src/health.rs` around lines 196 - 221, get_overall_status
currently ignores the challenges field on ComponentStatus even though
set_component_status and ComponentStatus track it; update get_overall_status to
include components.challenges in the unhealthy and degraded checks (treat
challenges the same way as p2p/storage/consensus), and ensure the final
special-case check (the bittensor check) still runs as intended — modify the
get_overall_status function to reference components.challenges alongside
components.p2p, components.storage, and components.consensus when computing
overall HealthStatus.

Comment on lines +42 to +51
```
my-challenge/
├── Cargo.toml
├── src/
│ ├── lib.rs # Challenge implementation
│ ├── evaluation.rs # Evaluation logic
│ └── scoring.rs # Scoring algorithm
├── Dockerfile # Container build
└── README.md # Documentation
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to the project-structure block.
This avoids MD040 warnings and improves readability.

Proposed fix
-```
+```text
 my-challenge/
 ├── Cargo.toml
 ├── src/
 │   ├── lib.rs           # Challenge implementation
 │   ├── evaluation.rs    # Evaluation logic
 │   └── scoring.rs       # Scoring algorithm
 ├── Dockerfile           # Container build
 └── README.md           # Documentation
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

In @docs/challenge-integration.md around lines 42 - 51, The Markdown code fence
showing the project tree lacks a language tag; update the block in
docs/challenge-integration.md by changing the opening totext so the
project-structure block (the my-challenge/ tree with Cargo.toml, src/lib.rs,
evaluation.rs, scoring.rs, Dockerfile, README.md) is fenced as ```text which
will silence MD040 warnings and improve rendering/readability.


</details>

<!-- fingerprinting:phantom:poseidon:eagle -->

<!-- This is an auto-generated comment by CodeRabbit -->

@echobt echobt merged commit b50ce52 into main Feb 3, 2026
7 checks passed
@echobt echobt deleted the feat/challenge-integration-1770118346 branch February 3, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant