-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add challenge integration infrastructure with checkpoint persistence #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Create new platform-challenge-registry crate with: - Challenge discovery and registration - Version management (semver-based) - Lifecycle state machine (registered/starting/running/stopping/stopped) - Health monitoring with configurable checks - State persistence and hot-reload support - Migration planning for version upgrades Modules: - registry: Main registry with CRUD operations - lifecycle: State machine for challenge states - health: Health monitoring and status tracking - state: State snapshots for hot-reload - discovery: Challenge discovery from various sources - migration: Version migration planning - version: Semantic versioning support - error: Registry-specific error types
- Add ShutdownHandler struct for checkpoint management - Create periodic checkpoints every 5 minutes - Save final checkpoint on graceful shutdown (Ctrl+C) - Persist evaluation state for hot-reload recovery This enables validators to update without losing evaluation progress.
📝 WalkthroughWalkthroughIntroduces a comprehensive challenge registry system with state persistence, checkpoint/restoration capabilities, and health monitoring infrastructure. Adds a new Changes
Sequence Diagram(s)sequenceDiagram
participant Validator as Validator Node
participant Registry as Challenge Registry
participant Checkpoint as Checkpoint Manager
participant Disk as Disk Storage
participant Restorable as Restoration Manager
Validator->>Registry: Initialize challenge registry
Registry-->>Validator: Ready
loop During Runtime
Validator->>Registry: Update challenge state<br/>(evaluations, health)
Registry-->>Validator: Acknowledged
Validator->>Checkpoint: Periodic create_checkpoint()
Checkpoint->>Checkpoint: Serialize state + compute hash
Checkpoint->>Disk: Atomic write (temp → final)
Disk-->>Checkpoint: Checkpoint persisted
end
Note over Validator: Ctrl+C Signal
Validator->>Checkpoint: Final create_checkpoint()
Checkpoint->>Disk: Write final state
Validator-->>Validator: Shutdown
Note over Restorable: Later: Recovery
Restorable->>Checkpoint: load_latest()
Checkpoint->>Disk: Read checkpoint + verify hash
Disk-->>Checkpoint: CheckpointData
Checkpoint-->>Restorable: (CheckpointHeader, CheckpointData)
Restorable->>Restorable: Validate & filter state
Restorable-->>Validator: Restored state ready
sequenceDiagram
participant Client as Client
participant Registry as Challenge Registry
participant Discovery as Discovery Engine
participant Health as Health Monitor
participant Lifecycle as Lifecycle Manager
Client->>Discovery: discover_from_local(path)
Discovery->>Discovery: Scan filesystem<br/>Check challenge.toml
Discovery-->>Client: Vec<DiscoveredChallenge>
Client->>Registry: register(ChallengeEntry)
Registry->>Lifecycle: Track new lifecycle
Registry->>Health: Initialize health state
Registry-->>Client: ChallengeId
loop Health Checks
Health->>Health: Check endpoint<br/>Record metrics
Health-->>Registry: Update HealthStatus
end
Client->>Registry: update_version(id, new_version)
Registry->>Registry: Validate compatibility
Registry-->>Client: Old version
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🤖 Fix all issues with AI agents
In `@challenges/README.md`:
- Around line 7-12: The fenced code block in README.md that displays the
directory tree (the block starting with the triple backticks followed by the
tree: "challenges/ ├── README.md ...") is missing a language tag; update that
fenced block to include the language tag "text" (i.e., change the opening ``` to
```text) so Markdownlint MD040 is satisfied and the tree is treated as plain
text.
In `@crates/challenge-registry/src/health.rs`:
- Around line 91-100: The method record_failure currently uses a hardcoded
threshold (>= 3) to flip status to Unhealthy; update it to use the configured
threshold instead by reading HealthConfig.failure_threshold (or accept a
threshold parameter) so changes to HealthConfig take effect — either change
record_failure on the Health struct to accept a threshold argument and use that
value, or have HealthMonitor::record_failure call into the existing
record_failure and pass self.config.failure_threshold; make sure to replace the
literal 3 with the referenced failure_threshold and keep updating
consecutive_failures, last_check_at, and HealthStatus logic unchanged.
In `@crates/challenge-registry/src/migration.rs`:
- Around line 345-357: The current finalize_migration removes the plan from
active_plans before checking its status, which can drop non-terminal plans and
prevents failed plans from being archived; update finalize_migration to first
look up (read/clone or inspect) the MigrationPlan in self.active_plans without
removing it, verify that its status is a terminal state (allow Completed,
RolledBack, and Failed as terminal), and only then call remove(challenge_id) to
take it out and return the plan; reference the finalize_migration method,
active_plans map, MigrationPlan status checks (e.g., is_complete or status enum
variants) and ensure compatibility with fail_migration which sets Failed so
failed plans become removable/archivable.
In `@crates/challenge-registry/src/registry.rs`:
- Around line 177-202: The update_state function allows arbitrary lifecycle
changes but must validate transitions using
ChallengeLifecycle::is_valid_transition; modify update_state (in registry.rs) to
check ChallengeLifecycle::is_valid_transition(&old_state, &new_state) before
applying changes and if invalid return a new
RegistryError::InvalidStateTransition containing old_state and new_state (add
that variant to RegistryError), only update
registered.entry.lifecycle_state/updated_at and emit
LifecycleEvent::StateChanged when the transition is valid; ensure debug logging
remains and that the function returns Err(...) for invalid transitions instead
of proceeding.
In `@crates/challenge-registry/src/version.rs`:
- Around line 69-79: The current Ord impl for ChallengeVersion (fn cmp) ignores
the prerelease field; update cmp in the impl for ChallengeVersion to enforce
semver precedence: after comparing major, minor, patch, if equal compare
prerelease such that a missing prerelease (release) has higher precedence than
any prerelease (i.e., None > Some), and when both have prerelease strings
compare their dot-separated identifiers in order using semver rules (numeric
identifiers compare numerically and have lower precedence than non-numeric
identifiers; non-numeric compare lexicographically; if all equal the shorter
identifier list has lower precedence). Use the existing fields
ChallengeVersion.prerelease and the cmp function to implement this logic in the
Ord::cmp for ChallengeVersion.
In `@crates/core/src/checkpoint.rs`:
- Around line 231-235: Do not increment self.current_sequence before persisting;
instead compute let next_sequence = self.current_sequence + 1, use
checkpoint_filename(next_sequence) to write and rename the temp file, and only
after the atomic rename succeeds assign self.current_sequence = next_sequence so
failures don't advance the in-memory sequence. Apply the same pattern to the
other similar routine (the block at 279-283) so current_sequence is updated only
on successful persistence; keep all filename construction and rename logic using
checkpoint_filename(next_sequence) and leave load_latest behavior unchanged.
- Around line 321-357: The code reads untrusted sizes into header_len and
header.data_size and allocates vectors directly, which can OOM on corrupt files;
before allocating check bounds: define reasonable maxs (e.g. MAX_HEADER_SIZE and
MAX_DATA_SIZE), validate header_len is <= MAX_HEADER_SIZE and less than the
remaining reader/file length, then allocate header_bytes; after deserializing
CheckpointHeader, validate header.data_size <= MAX_DATA_SIZE and that
header.data_size does not exceed remaining reader/file length before allocating
data_bytes; on violations return a MiniChainError::Storage with a clear message.
Use the existing symbols header_len, CheckpointHeader, header.data_size, reader,
and CHECKPOINT_VERSION to locate insertion points.
In `@crates/rpc-server/src/health.rs`:
- Around line 196-221: get_overall_status currently ignores the challenges field
on ComponentStatus even though set_component_status and ComponentStatus track
it; update get_overall_status to include components.challenges in the unhealthy
and degraded checks (treat challenges the same way as p2p/storage/consensus),
and ensure the final special-case check (the bittensor check) still runs as
intended — modify the get_overall_status function to reference
components.challenges alongside components.p2p, components.storage, and
components.consensus when computing overall HealthStatus.
In `@docs/challenge-integration.md`:
- Around line 42-51: The Markdown code fence showing the project tree lacks a
language tag; update the block in docs/challenge-integration.md by changing the
opening ``` to ```text so the project-structure block (the my-challenge/ tree
with Cargo.toml, src/lib.rs, evaluation.rs, scoring.rs, Dockerfile, README.md)
is fenced as ```text which will silence MD040 warnings and improve
rendering/readability.
🧹 Nitpick comments (13)
challenges/.gitkeep (1)
1-1: Consider making the .gitkeep file truly empty or add a comment.By convention,
.gitkeepfiles are typically either completely empty (0 bytes) or contain a brief comment explaining their purpose. The current single newline works but is slightly unconventional.📝 Alternative approaches
Option 1: Remove the empty line entirely to make it a true 0-byte file.
Option 2: Add a descriptive comment:
- +# Placeholder directory for challenge crate modulescrates/challenge-registry/src/error.rs (1)
45-49: Consider preserving more context for I/O errors.Mapping all
std::io::ErrortoInternalloses context about whether the error originated from state persistence, file operations, or network I/O. This could make debugging harder.💡 Optional improvement
Consider mapping I/O errors contextually where they occur, or including the error kind:
impl From<std::io::Error> for RegistryError { fn from(err: std::io::Error) -> Self { - RegistryError::Internal(err.to_string()) + RegistryError::Internal(format!("IO error ({}): {}", err.kind(), err)) } }tests/checkpoint_tests.rs (1)
139-146: Clarify the distinction between data sequence and checkpoint sequence.The test expects
latest.sequence(fromCheckpointData) to be 9, whileheader.sequenceis 10. This works because the loop indexigoes 0-9, but it might confuse readers. Consider adding a comment explaining the distinction.📝 Optional clarification
// Latest should be sequence 10 + // Note: header.sequence is the checkpoint file sequence (1-indexed) + // data.sequence is the application sequence from the loop (0-indexed) let (header, latest) = manager .load_latest() .expect("Failed to load") .expect("No checkpoint"); assert_eq!(latest.sequence, 9); assert_eq!(header.sequence, 10);crates/challenge-registry/src/lifecycle.rs (1)
56-78: Auto-restart configuration defined but restart tracking not implemented.The
auto_restartandmax_restart_attemptsfields are configured but there's no state or method to track actual restart attempts per challenge. Consider whether restart count tracking should be managed here or delegated to the registry.crates/challenge-registry/src/health.rs (1)
111-114:recovery_thresholdis defined but never used.The
HealthConfig.recovery_thresholdfield (line 113) is documented as "Number of successes to recover from unhealthy" butrecord_successunconditionally sets status toHealthy. Consider implementing recovery logic or removing the unused field.crates/challenge-registry/src/discovery.rs (2)
158-194: Code duplication betweenchallenge.tomlandCargo.tomlhandling.The logic for extracting the challenge name and creating
DiscoveredChallengeis nearly identical between the two branches. Consider extracting a helper function.Proposed refactor to reduce duplication
+ fn create_discovered_from_path(&self, path: &PathBuf) -> DiscoveredChallenge { + let name = path + .file_name() + .and_then(|n| n.to_str()) + .unwrap_or("unknown") + .to_string(); + + DiscoveredChallenge { + name, + version: ChallengeVersion::default(), + docker_image: None, + local_path: Some(path.clone()), + health_endpoint: None, + evaluation_endpoint: None, + metadata: ChallengeMetadata::default(), + source: DiscoverySource::LocalFilesystem(path.clone()), + } + } + pub fn discover_from_local(&self, path: &PathBuf) -> RegistryResult<Vec<DiscoveredChallenge>> { // ... validation ... if path.is_dir() { let challenge_toml = path.join("challenge.toml"); let cargo_toml = path.join("Cargo.toml"); if challenge_toml.exists() || cargo_toml.exists() { - // ... duplicated code ... + challenges.push(self.create_discovered_from_path(path)); } } Ok(challenges) }
121-140:discover_allonly implements local discovery; Docker and P2P are placeholders.The method iterates only over
local_pathswhiledocker_registriesandenable_p2pconfig fields are unused. This appears intentional for the initial infrastructure, but consider adding TODO comments or returning a warning when these sources are configured but not yet implemented.crates/challenge-registry/src/state.rs (1)
206-209: Inefficient snapshot trimming withremove(0)in loop.Using
remove(0)on aVecin a loop is O(n²) because each removal shifts all subsequent elements. With defaultmax_snapshots=5, this is negligible, but consider usingdrainfor cleaner O(n) removal if this could be configured higher.Proposed fix using drain
// Trim old snapshots - while snapshots.len() > self.max_snapshots { - snapshots.remove(0); - } + if snapshots.len() > self.max_snapshots { + let excess = snapshots.len() - self.max_snapshots; + snapshots.drain(0..excess); + }crates/core/src/restoration.rs (4)
256-275: Stale evaluation filtering is a no-op placeholder.The
skip_stale_evaluationsoption is respected, but the actual filter logic always returnstrue, keeping all evaluations. The comment acknowledges this: "For now, keep all pending (they don't have epoch info)".Consider either:
- Implementing the logic if
PendingEvaluationStatehas epoch information available- Removing the
skip_stale_evaluationsoption until it's implemented- Adding a TODO comment that's more visible
286-292: Magic number for epoch validation should be a named constant.The
1_000_000epoch limit is arbitrary. Consider defining it as a constant with documentation explaining the rationale.Proposed fix
+/// Maximum reasonable epoch value for validation. +/// This is a sanity check to detect corrupted checkpoints. +const MAX_REASONABLE_EPOCH: u64 = 1_000_000; + fn validate_data(&self, data: &CheckpointData) -> Result<()> { // Validate epoch is reasonable - if data.epoch > 1_000_000 { + if data.epoch > MAX_REASONABLE_EPOCH { return Err(MiniChainError::Validation( "Checkpoint epoch seems unreasonably high".into(), )); }
336-351:get_checkpoint_infoloads full checkpoint data for metadata extraction.This method loads the entire checkpoint (including all evaluations) just to extract summary information. If
CheckpointManagersupported header-only loading, this could be more efficient for listing many checkpoints.
372-379: Add documentation explaining expected implementors of theRestorabletrait.The trait is exported in the public API but has no implementations in the codebase and lacks documentation about intended use. Consider adding a doc comment clarifying whether external crates are expected to implement this trait or if it's reserved for future use within this crate.
crates/challenge-registry/src/registry.rs (1)
274-279: Event listeners executed synchronously under lock.
emit_eventcalls all listeners while holding the read lock onevent_listeners. Long-running or blocking listeners could delay other registry operations. Consider:
- Cloning listeners before iterating (release lock sooner)
- Spawning listener calls asynchronously
- Documenting that listeners should be non-blocking
Option: Clone listeners to release lock faster
fn emit_event(&self, event: LifecycleEvent) { - for listener in self.event_listeners.read().iter() { + let listeners: Vec<_> = self.event_listeners.read().iter().collect(); + // Lock released here + for listener in listeners { listener(event.clone()); } }Note: This requires listeners to be
Clone, whichBox<dyn Fn>is not. Alternative: storeArc<dyn Fn>instead.
| ``` | ||
| challenges/ | ||
| ├── README.md # This file | ||
| ├── example-challenge/ # Example challenge template (future) | ||
| └── [your-challenge]/ # Your custom challenge crate | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language tag to the fenced block.
Markdownlint MD040 flags fenced blocks without a language; use text here.
Proposed fix
-```
+```text
challenges/
├── README.md # This file
├── example-challenge/ # Example challenge template (future)
└── [your-challenge]/ # Your custom challenge crate</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
In @challenges/README.md around lines 7 - 12, The fenced code block in README.md
that displays the directory tree (the block starting with the triple backticks
followed by the tree: "challenges/ ├── README.md ...") is missing a language
tag; update that fenced block to include the language tag "text" (i.e., change
the opening totext) so Markdownlint MD040 is satisfied and the tree is
treated as plain text.
</details>
<!-- fingerprinting:phantom:poseidon:eagle -->
<!-- This is an auto-generated comment by CodeRabbit -->
| pub fn record_failure(&mut self, reason: String) { | ||
| self.consecutive_failures += 1; | ||
| self.last_check_at = chrono::Utc::now().timestamp_millis(); | ||
|
|
||
| if self.consecutive_failures >= 3 { | ||
| self.status = HealthStatus::Unhealthy(reason); | ||
| } else { | ||
| self.status = HealthStatus::Degraded(reason); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded failure threshold ignores HealthConfig.failure_threshold.
The record_failure method hardcodes >= 3 for marking unhealthy, but HealthConfig has a configurable failure_threshold field (default 3). This creates an inconsistency where changing the config has no effect.
Proposed fix to use configurable threshold
Either pass the threshold to record_failure:
- pub fn record_failure(&mut self, reason: String) {
+ pub fn record_failure(&mut self, reason: String, failure_threshold: u32) {
self.consecutive_failures += 1;
self.last_check_at = chrono::Utc::now().timestamp_millis();
- if self.consecutive_failures >= 3 {
+ if self.consecutive_failures >= failure_threshold {
self.status = HealthStatus::Unhealthy(reason);
} else {
self.status = HealthStatus::Degraded(reason);
}
}Or have HealthMonitor::record_failure apply the threshold from its config.
🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/health.rs` around lines 91 - 100, The method
record_failure currently uses a hardcoded threshold (>= 3) to flip status to
Unhealthy; update it to use the configured threshold instead by reading
HealthConfig.failure_threshold (or accept a threshold parameter) so changes to
HealthConfig take effect — either change record_failure on the Health struct to
accept a threshold argument and use that value, or have
HealthMonitor::record_failure call into the existing record_failure and pass
self.config.failure_threshold; make sure to replace the literal 3 with the
referenced failure_threshold and keep updating consecutive_failures,
last_check_at, and HealthStatus logic unchanged.
| /// Finalize and archive a completed migration | ||
| pub fn finalize_migration(&self, challenge_id: &ChallengeId) -> RegistryResult<MigrationPlan> { | ||
| let plan = self | ||
| .active_plans | ||
| .write() | ||
| .remove(challenge_id) | ||
| .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?; | ||
|
|
||
| if !plan.is_complete() { | ||
| return Err(RegistryError::MigrationFailed( | ||
| "Migration not complete".to_string(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finalize only after verifying a terminal status (and keep failed plans archivable).
Line 345 removes the plan before checking completion, so a premature call drops the active migration. Also, fail_migration sets Failed but finalize_migration rejects anything outside Completed/RolledBack, leaving failed plans stuck in active_plans. Validate status while the plan is still in the map and allow terminal states before removal.
🔧 Suggested fix
- let plan = self
- .active_plans
- .write()
- .remove(challenge_id)
- .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?;
-
- if !plan.is_complete() {
- return Err(RegistryError::MigrationFailed(
- "Migration not complete".to_string(),
- ));
- }
+ let mut plans = self.active_plans.write();
+ let is_terminal = plans
+ .get(challenge_id)
+ .map(|p| {
+ matches!(
+ p.status,
+ MigrationStatus::Completed
+ | MigrationStatus::RolledBack
+ | MigrationStatus::Failed(_)
+ )
+ })
+ .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?;
+
+ if !is_terminal {
+ return Err(RegistryError::MigrationFailed(
+ "Migration not complete".to_string(),
+ ));
+ }
+
+ let plan = plans.remove(challenge_id).expect("checked above");🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/migration.rs` around lines 345 - 357, The
current finalize_migration removes the plan from active_plans before checking
its status, which can drop non-terminal plans and prevents failed plans from
being archived; update finalize_migration to first look up (read/clone or
inspect) the MigrationPlan in self.active_plans without removing it, verify that
its status is a terminal state (allow Completed, RolledBack, and Failed as
terminal), and only then call remove(challenge_id) to take it out and return the
plan; reference the finalize_migration method, active_plans map, MigrationPlan
status checks (e.g., is_complete or status enum variants) and ensure
compatibility with fail_migration which sets Failed so failed plans become
removable/archivable.
| /// Update challenge lifecycle state | ||
| pub fn update_state(&self, id: &ChallengeId, new_state: LifecycleState) -> RegistryResult<()> { | ||
| let mut challenges = self.challenges.write(); | ||
| let registered = challenges | ||
| .get_mut(id) | ||
| .ok_or_else(|| RegistryError::ChallengeNotFound(id.to_string()))?; | ||
|
|
||
| let old_state = registered.entry.lifecycle_state.clone(); | ||
| registered.entry.lifecycle_state = new_state.clone(); | ||
| registered.entry.updated_at = chrono::Utc::now().timestamp_millis(); | ||
|
|
||
| debug!( | ||
| challenge_id = %id, | ||
| old_state = ?old_state, | ||
| new_state = ?new_state, | ||
| "Challenge state updated" | ||
| ); | ||
|
|
||
| self.emit_event(LifecycleEvent::StateChanged { | ||
| challenge_id: *id, | ||
| old_state, | ||
| new_state, | ||
| }); | ||
|
|
||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_state doesn't validate transitions using ChallengeLifecycle.is_valid_transition.
The ChallengeLifecycle struct has is_valid_transition logic (defined in lifecycle.rs), but update_state doesn't use it. Invalid state transitions are allowed, which could lead to inconsistent states.
Proposed fix to validate transitions
pub fn update_state(&self, id: &ChallengeId, new_state: LifecycleState) -> RegistryResult<()> {
let mut challenges = self.challenges.write();
let registered = challenges
.get_mut(id)
.ok_or_else(|| RegistryError::ChallengeNotFound(id.to_string()))?;
let old_state = registered.entry.lifecycle_state.clone();
+
+ if !self.lifecycle.is_valid_transition(&old_state, &new_state) {
+ return Err(RegistryError::InvalidStateTransition(
+ format!("{:?} -> {:?}", old_state, new_state)
+ ));
+ }
+
registered.entry.lifecycle_state = new_state.clone();
registered.entry.updated_at = chrono::Utc::now().timestamp_millis();Note: This requires adding an InvalidStateTransition variant to RegistryError.
🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/registry.rs` around lines 177 - 202, The
update_state function allows arbitrary lifecycle changes but must validate
transitions using ChallengeLifecycle::is_valid_transition; modify update_state
(in registry.rs) to check ChallengeLifecycle::is_valid_transition(&old_state,
&new_state) before applying changes and if invalid return a new
RegistryError::InvalidStateTransition containing old_state and new_state (add
that variant to RegistryError), only update
registered.entry.lifecycle_state/updated_at and emit
LifecycleEvent::StateChanged when the transition is valid; ensure debug logging
remains and that the function returns Err(...) for invalid transitions instead
of proceeding.
| impl Ord for ChallengeVersion { | ||
| fn cmp(&self, other: &Self) -> Ordering { | ||
| match self.major.cmp(&other.major) { | ||
| Ordering::Equal => match self.minor.cmp(&other.minor) { | ||
| Ordering::Equal => self.patch.cmp(&other.patch), | ||
| ord => ord, | ||
| }, | ||
| ord => ord, | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prerelease versions are not considered in ordering.
The Ord implementation ignores the prerelease field, meaning 1.0.0-alpha and 1.0.0 are considered equal in comparisons. Per semver specification, prerelease versions should have lower precedence than the release version (i.e., 1.0.0-alpha < 1.0.0).
This could cause issues when comparing versions during migration planning or constraint satisfaction.
💡 Suggested fix for semver-compliant ordering
impl Ord for ChallengeVersion {
fn cmp(&self, other: &Self) -> Ordering {
match self.major.cmp(&other.major) {
Ordering::Equal => match self.minor.cmp(&other.minor) {
- Ordering::Equal => self.patch.cmp(&other.patch),
+ Ordering::Equal => match self.patch.cmp(&other.patch) {
+ Ordering::Equal => {
+ // Prerelease versions have lower precedence than release
+ match (&self.prerelease, &other.prerelease) {
+ (None, None) => Ordering::Equal,
+ (Some(_), None) => Ordering::Less, // prerelease < release
+ (None, Some(_)) => Ordering::Greater, // release > prerelease
+ (Some(a), Some(b)) => a.cmp(b), // compare prerelease strings
+ }
+ }
+ ord => ord,
+ },
ord => ord,
},
ord => ord,
}
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| impl Ord for ChallengeVersion { | |
| fn cmp(&self, other: &Self) -> Ordering { | |
| match self.major.cmp(&other.major) { | |
| Ordering::Equal => match self.minor.cmp(&other.minor) { | |
| Ordering::Equal => self.patch.cmp(&other.patch), | |
| ord => ord, | |
| }, | |
| ord => ord, | |
| } | |
| } | |
| } | |
| impl Ord for ChallengeVersion { | |
| fn cmp(&self, other: &Self) -> Ordering { | |
| match self.major.cmp(&other.major) { | |
| Ordering::Equal => match self.minor.cmp(&other.minor) { | |
| Ordering::Equal => match self.patch.cmp(&other.patch) { | |
| Ordering::Equal => { | |
| // Prerelease versions have lower precedence than release | |
| match (&self.prerelease, &other.prerelease) { | |
| (None, None) => Ordering::Equal, | |
| (Some(_), None) => Ordering::Less, // prerelease < release | |
| (None, Some(_)) => Ordering::Greater, // release > prerelease | |
| (Some(a), Some(b)) => a.cmp(b), // compare prerelease strings | |
| } | |
| } | |
| ord => ord, | |
| }, | |
| ord => ord, | |
| }, | |
| ord => ord, | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/version.rs` around lines 69 - 79, The current
Ord impl for ChallengeVersion (fn cmp) ignores the prerelease field; update cmp
in the impl for ChallengeVersion to enforce semver precedence: after comparing
major, minor, patch, if equal compare prerelease such that a missing prerelease
(release) has higher precedence than any prerelease (i.e., None > Some), and
when both have prerelease strings compare their dot-separated identifiers in
order using semver rules (numeric identifiers compare numerically and have lower
precedence than non-numeric identifiers; non-numeric compare lexicographically;
if all equal the shorter identifier list has lower precedence). Use the existing
fields ChallengeVersion.prerelease and the cmp function to implement this logic
in the Ord::cmp for ChallengeVersion.
| pub fn create_checkpoint(&mut self, data: &CheckpointData) -> Result<PathBuf> { | ||
| self.current_sequence += 1; | ||
| let sequence = self.current_sequence; | ||
| let filename = self.checkpoint_filename(sequence); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defer current_sequence updates until after the checkpoint is safely persisted.
Line 232 increments before the write/rename. If I/O fails, current_sequence points to a non-existent file, and load_latest returns None even though older checkpoints exist. Compute the next sequence locally and update only after the rename succeeds.
🔧 Suggested fix
- self.current_sequence += 1;
- let sequence = self.current_sequence;
+ let sequence = self.current_sequence + 1;
let filename = self.checkpoint_filename(sequence);
@@
fs::rename(&temp_filename, &filename).map_err(|e| {
MiniChainError::Storage(format!("Failed to finalize checkpoint: {}", e))
})?;
+
+ self.current_sequence = sequence;Also applies to: 279-283
🤖 Prompt for AI Agents
In `@crates/core/src/checkpoint.rs` around lines 231 - 235, Do not increment
self.current_sequence before persisting; instead compute let next_sequence =
self.current_sequence + 1, use checkpoint_filename(next_sequence) to write and
rename the temp file, and only after the atomic rename succeeds assign
self.current_sequence = next_sequence so failures don't advance the in-memory
sequence. Apply the same pattern to the other similar routine (the block at
279-283) so current_sequence is updated only on successful persistence; keep all
filename construction and rename logic using checkpoint_filename(next_sequence)
and leave load_latest behavior unchanged.
| // Read header length | ||
| let mut header_len_bytes = [0u8; 4]; | ||
| reader | ||
| .read_exact(&mut header_len_bytes) | ||
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?; | ||
| let header_len = u32::from_le_bytes(header_len_bytes) as usize; | ||
|
|
||
| // Read header | ||
| let mut header_bytes = vec![0u8; header_len]; | ||
| reader | ||
| .read_exact(&mut header_bytes) | ||
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?; | ||
|
|
||
| let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| { | ||
| MiniChainError::Serialization(format!("Failed to deserialize header: {}", e)) | ||
| })?; | ||
|
|
||
| // Verify magic | ||
| if !header.verify_magic() { | ||
| return Err(MiniChainError::Storage( | ||
| "Invalid checkpoint magic bytes".into(), | ||
| )); | ||
| } | ||
|
|
||
| // Verify version compatibility | ||
| if header.version > CHECKPOINT_VERSION { | ||
| return Err(MiniChainError::Storage(format!( | ||
| "Checkpoint version {} is newer than supported version {}", | ||
| header.version, CHECKPOINT_VERSION | ||
| ))); | ||
| } | ||
|
|
||
| // Read data | ||
| let mut data_bytes = vec![0u8; header.data_size as usize]; | ||
| reader | ||
| .read_exact(&mut data_bytes) | ||
| .map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add size bounds before allocating header/data buffers.
Line 326 uses a header length sourced from disk, and Line 354 uses data_size from the header. A corrupt file can force large allocations and OOM. Add reasonable caps (or validate against file length) before allocating.
🔧 Suggested fix
// Read header length
let mut header_len_bytes = [0u8; 4];
reader
.read_exact(&mut header_len_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?;
let header_len = u32::from_le_bytes(header_len_bytes) as usize;
+ const MAX_HEADER_SIZE: usize = 16 * 1024;
+ const MAX_DATA_SIZE: u64 = 512 * 1024 * 1024;
+ if header_len == 0 || header_len > MAX_HEADER_SIZE {
+ return Err(MiniChainError::Storage("Checkpoint header too large".into()));
+ }
@@
let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| {
MiniChainError::Serialization(format!("Failed to deserialize header: {}", e))
})?;
+
+ if header.data_size > MAX_DATA_SIZE {
+ return Err(MiniChainError::Storage("Checkpoint data too large".into()));
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Read header length | |
| let mut header_len_bytes = [0u8; 4]; | |
| reader | |
| .read_exact(&mut header_len_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?; | |
| let header_len = u32::from_le_bytes(header_len_bytes) as usize; | |
| // Read header | |
| let mut header_bytes = vec![0u8; header_len]; | |
| reader | |
| .read_exact(&mut header_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?; | |
| let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| { | |
| MiniChainError::Serialization(format!("Failed to deserialize header: {}", e)) | |
| })?; | |
| // Verify magic | |
| if !header.verify_magic() { | |
| return Err(MiniChainError::Storage( | |
| "Invalid checkpoint magic bytes".into(), | |
| )); | |
| } | |
| // Verify version compatibility | |
| if header.version > CHECKPOINT_VERSION { | |
| return Err(MiniChainError::Storage(format!( | |
| "Checkpoint version {} is newer than supported version {}", | |
| header.version, CHECKPOINT_VERSION | |
| ))); | |
| } | |
| // Read data | |
| let mut data_bytes = vec![0u8; header.data_size as usize]; | |
| reader | |
| .read_exact(&mut data_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?; | |
| // Read header length | |
| let mut header_len_bytes = [0u8; 4]; | |
| reader | |
| .read_exact(&mut header_len_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?; | |
| let header_len = u32::from_le_bytes(header_len_bytes) as usize; | |
| const MAX_HEADER_SIZE: usize = 16 * 1024; | |
| const MAX_DATA_SIZE: u64 = 512 * 1024 * 1024; | |
| if header_len == 0 || header_len > MAX_HEADER_SIZE { | |
| return Err(MiniChainError::Storage("Checkpoint header too large".into())); | |
| } | |
| // Read header | |
| let mut header_bytes = vec![0u8; header_len]; | |
| reader | |
| .read_exact(&mut header_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?; | |
| let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| { | |
| MiniChainError::Serialization(format!("Failed to deserialize header: {}", e)) | |
| })?; | |
| if header.data_size > MAX_DATA_SIZE { | |
| return Err(MiniChainError::Storage("Checkpoint data too large".into())); | |
| } | |
| // Verify magic | |
| if !header.verify_magic() { | |
| return Err(MiniChainError::Storage( | |
| "Invalid checkpoint magic bytes".into(), | |
| )); | |
| } | |
| // Verify version compatibility | |
| if header.version > CHECKPOINT_VERSION { | |
| return Err(MiniChainError::Storage(format!( | |
| "Checkpoint version {} is newer than supported version {}", | |
| header.version, CHECKPOINT_VERSION | |
| ))); | |
| } | |
| // Read data | |
| let mut data_bytes = vec![0u8; header.data_size as usize]; | |
| reader | |
| .read_exact(&mut data_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?; |
🤖 Prompt for AI Agents
In `@crates/core/src/checkpoint.rs` around lines 321 - 357, The code reads
untrusted sizes into header_len and header.data_size and allocates vectors
directly, which can OOM on corrupt files; before allocating check bounds: define
reasonable maxs (e.g. MAX_HEADER_SIZE and MAX_DATA_SIZE), validate header_len is
<= MAX_HEADER_SIZE and less than the remaining reader/file length, then allocate
header_bytes; after deserializing CheckpointHeader, validate header.data_size <=
MAX_DATA_SIZE and that header.data_size does not exceed remaining reader/file
length before allocating data_bytes; on violations return a
MiniChainError::Storage with a clear message. Use the existing symbols
header_len, CheckpointHeader, header.data_size, reader, and CHECKPOINT_VERSION
to locate insertion points.
| fn get_overall_status(&self) -> HealthStatus { | ||
| let components = self.components.read(); | ||
|
|
||
| // If any component is unhealthy, overall is unhealthy | ||
| if components.p2p == HealthStatus::Unhealthy | ||
| || components.storage == HealthStatus::Unhealthy | ||
| || components.consensus == HealthStatus::Unhealthy | ||
| { | ||
| return HealthStatus::Unhealthy; | ||
| } | ||
|
|
||
| // If any critical component is degraded, overall is degraded | ||
| if components.p2p == HealthStatus::Degraded | ||
| || components.storage == HealthStatus::Degraded | ||
| || components.consensus == HealthStatus::Degraded | ||
| { | ||
| return HealthStatus::Degraded; | ||
| } | ||
|
|
||
| // If Bittensor is down but others are fine, degraded | ||
| if components.bittensor == HealthStatus::Unhealthy { | ||
| return HealthStatus::Degraded; | ||
| } | ||
|
|
||
| HealthStatus::Healthy | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenges component status is tracked but not used in health calculation.
The set_component_status method accepts "challenges" (line 190), and ComponentStatus includes a challenges field, but get_overall_status() doesn't consider it when determining overall health. If challenge container health should affect the overall status, it should be included in this logic.
💡 Suggested fix if challenges should affect health
fn get_overall_status(&self) -> HealthStatus {
let components = self.components.read();
// If any component is unhealthy, overall is unhealthy
if components.p2p == HealthStatus::Unhealthy
|| components.storage == HealthStatus::Unhealthy
|| components.consensus == HealthStatus::Unhealthy
{
return HealthStatus::Unhealthy;
}
// If any critical component is degraded, overall is degraded
if components.p2p == HealthStatus::Degraded
|| components.storage == HealthStatus::Degraded
|| components.consensus == HealthStatus::Degraded
{
return HealthStatus::Degraded;
}
- // If Bittensor is down but others are fine, degraded
- if components.bittensor == HealthStatus::Unhealthy {
+ // If Bittensor or challenges are down but core is fine, degraded
+ if components.bittensor == HealthStatus::Unhealthy
+ || components.challenges == HealthStatus::Unhealthy
+ {
return HealthStatus::Degraded;
}
HealthStatus::Healthy
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn get_overall_status(&self) -> HealthStatus { | |
| let components = self.components.read(); | |
| // If any component is unhealthy, overall is unhealthy | |
| if components.p2p == HealthStatus::Unhealthy | |
| || components.storage == HealthStatus::Unhealthy | |
| || components.consensus == HealthStatus::Unhealthy | |
| { | |
| return HealthStatus::Unhealthy; | |
| } | |
| // If any critical component is degraded, overall is degraded | |
| if components.p2p == HealthStatus::Degraded | |
| || components.storage == HealthStatus::Degraded | |
| || components.consensus == HealthStatus::Degraded | |
| { | |
| return HealthStatus::Degraded; | |
| } | |
| // If Bittensor is down but others are fine, degraded | |
| if components.bittensor == HealthStatus::Unhealthy { | |
| return HealthStatus::Degraded; | |
| } | |
| HealthStatus::Healthy | |
| } | |
| fn get_overall_status(&self) -> HealthStatus { | |
| let components = self.components.read(); | |
| // If any component is unhealthy, overall is unhealthy | |
| if components.p2p == HealthStatus::Unhealthy | |
| || components.storage == HealthStatus::Unhealthy | |
| || components.consensus == HealthStatus::Unhealthy | |
| { | |
| return HealthStatus::Unhealthy; | |
| } | |
| // If any critical component is degraded, overall is degraded | |
| if components.p2p == HealthStatus::Degraded | |
| || components.storage == HealthStatus::Degraded | |
| || components.consensus == HealthStatus::Degraded | |
| { | |
| return HealthStatus::Degraded; | |
| } | |
| // If Bittensor or challenges are down but core is fine, degraded | |
| if components.bittensor == HealthStatus::Unhealthy | |
| || components.challenges == HealthStatus::Unhealthy | |
| { | |
| return HealthStatus::Degraded; | |
| } | |
| HealthStatus::Healthy | |
| } |
🤖 Prompt for AI Agents
In `@crates/rpc-server/src/health.rs` around lines 196 - 221, get_overall_status
currently ignores the challenges field on ComponentStatus even though
set_component_status and ComponentStatus track it; update get_overall_status to
include components.challenges in the unhealthy and degraded checks (treat
challenges the same way as p2p/storage/consensus), and ensure the final
special-case check (the bittensor check) still runs as intended — modify the
get_overall_status function to reference components.challenges alongside
components.p2p, components.storage, and components.consensus when computing
overall HealthStatus.
| ``` | ||
| my-challenge/ | ||
| ├── Cargo.toml | ||
| ├── src/ | ||
| │ ├── lib.rs # Challenge implementation | ||
| │ ├── evaluation.rs # Evaluation logic | ||
| │ └── scoring.rs # Scoring algorithm | ||
| ├── Dockerfile # Container build | ||
| └── README.md # Documentation | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language tag to the project-structure block.
This avoids MD040 warnings and improves readability.
Proposed fix
-```
+```text
my-challenge/
├── Cargo.toml
├── src/
│ ├── lib.rs # Challenge implementation
│ ├── evaluation.rs # Evaluation logic
│ └── scoring.rs # Scoring algorithm
├── Dockerfile # Container build
└── README.md # Documentation</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
In @docs/challenge-integration.md around lines 42 - 51, The Markdown code fence
showing the project tree lacks a language tag; update the block in
docs/challenge-integration.md by changing the opening totext so the
project-structure block (the my-challenge/ tree with Cargo.toml, src/lib.rs,
evaluation.rs, scoring.rs, Dockerfile, README.md) is fenced as ```text which
will silence MD040 warnings and improve rendering/readability.
</details>
<!-- fingerprinting:phantom:poseidon:eagle -->
<!-- This is an auto-generated comment by CodeRabbit -->
Summary
This PR adds comprehensive infrastructure for integrating challenge crates into Platform v2, with a focus on enabling zero-downtime updates.
Changes
Challenge Infrastructure
challenges/directory structure for hosting challenge cratesplatform-challenge-registrycrate for challenge lifecycle managementCheckpoint System
platform-corefor state persistenceRolling Updates
Documentation
Testing
Benefits
Testing
cargo check --workspaceSummary by CodeRabbit
Release Notes
New Features
Documentation
Infrastructure