Skip to content

feat: add runners to pb protocol #2433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: 05-27-chore_pegboard_add_exit_code_log_from_container_runner
Choose a base branch
from

Conversation

MasterPtato
Copy link
Contributor

Changes

Copy link
Contributor Author

MasterPtato commented May 9, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

cloudflare-workers-and-pages bot commented May 9, 2025

Deploying rivet with  Cloudflare Pages  Cloudflare Pages

Latest commit: cdf9857
Status:🚫  Build failed.

View logs

@MasterPtato MasterPtato force-pushed the 05-08-feat_add_runners_to_pb_protocol branch 5 times, most recently from d3bc56e to 9e9302b Compare May 9, 2025 23:59
@MasterPtato MasterPtato force-pushed the 05-07-fix_swap_status_monitor_dns_resolver_with_native_resolver branch from 13d14bb to 35fc915 Compare May 9, 2025 23:59
@graphite-app graphite-app bot changed the base branch from 05-07-fix_swap_status_monitor_dns_resolver_with_native_resolver to graphite-base/2433 May 10, 2025 00:01
@MasterPtato MasterPtato force-pushed the 05-08-feat_add_runners_to_pb_protocol branch from 9e9302b to b4a61a7 Compare May 15, 2025 01:34
@MasterPtato MasterPtato force-pushed the graphite-base/2433 branch from 35fc915 to bb8ccd5 Compare May 15, 2025 01:34
@MasterPtato MasterPtato changed the base branch from graphite-base/2433 to main May 15, 2025 01:34
@MasterPtato MasterPtato force-pushed the 05-08-feat_add_runners_to_pb_protocol branch 3 times, most recently from faaaf17 to 542b3c5 Compare May 16, 2025 20:15
@MasterPtato MasterPtato force-pushed the 05-08-feat_add_runners_to_pb_protocol branch from 542b3c5 to 885534d Compare May 17, 2025 01:22
Copy link

cloudflare-workers-and-pages bot commented May 17, 2025

Deploying rivet-studio with  Cloudflare Pages  Cloudflare Pages

Latest commit: cdf9857
Status:🚫  Build failed.

View logs

Copy link

cloudflare-workers-and-pages bot commented May 17, 2025

Deploying rivet-hub with  Cloudflare Pages  Cloudflare Pages

Latest commit: cdf9857
Status:🚫  Build failed.

View logs

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR introduces a major architectural change to the Rivet project by adding runners to the pegboard protocol and removing the isolate-v8-runner implementation. The changes consolidate runner management into a more generic protocol while maintaining backward compatibility.

  • Removes isolate-v8-runner package and related JavaScript runtime functionality, suggesting a shift away from V8-based isolation
  • Adds new actor2 workflow with support for multi-slot runners and improved state management in packages/edge/services/pegboard/src/workflows/actor2/
  • Introduces runner tracking by remaining slots with new key structures in packages/edge/services/pegboard/src/keys/runner.rs
  • Refactors container runner to use generic runner terminology and WebSocket-based communication in packages/edge/infra/client/container-runner/
  • Adds subnet configuration and IP management for runners in pegboard_configure.sh with new variables like SUBNET_IPV4 and SUBNET_IPV6

76 file(s) reviewed, 49 comment(s)
Edit PR Review Bot Settings | Greptile

@@ -47,10 +47,10 @@ RUN \
--mount=type=cache,target=/root/.cache,id=universal-user-cache \
--mount=type=cache,target=/root/.npm,id=universal-user-npm \
--mount=type=cache,target=/root/.yarn,id=universal-user-yarn \
RUSTFLAGS="--cfg tokio_unstable" RIVET_BUILD_HUB=0 cargo build --bin rivet-server --bin rivet-edge-server --bin rivet-guard --bin rivet-client --bin rivet-isolate-v8-runner --bin rivet-container-runner && \
RUSTFLAGS="--cfg tokio_unstable" RIVET_BUILD_HUB=0 cargo build --bin rivet-server --bin rivet-edge-server --bin rivet-guard --bin rivet-client --bin rivet-container-runner && \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The rivet-isolate-v8-runner binary has been removed from the build command. Verify that this functionality is either no longer needed or has been replaced elsewhere in the codebase.

Comment on lines +52 to +58
http2: {
protocol: "http",
internalPort: 8085,
routing: {
guard: {},
},
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: HTTP2 port is configured but no tests are implemented to verify HTTP2 functionality

Comment on lines 47 to 53
// Copy dist directory to out_dir
let dist_path = hub_path.join("dist");
fs_extra::dir::copy(
dist_path.clone(),
out_dir,
&fs_extra::dir::CopyOptions::new().content_only(true),
)?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: This will fail if the dist directory doesn't exist when RIVET_BUILD_HUB is false. The old code created an empty directory in this case, but now it tries to copy a potentially non-existent directory.

Suggested change
// Copy dist directory to out_dir
let dist_path = hub_path.join("dist");
fs_extra::dir::copy(
dist_path.clone(),
out_dir,
&fs_extra::dir::CopyOptions::new().content_only(true),
)?;
// Copy dist directory to out_dir if it exists
let dist_path = hub_path.join("dist");
if dist_path.exists() {
fs_extra::dir::copy(
dist_path.clone(),
out_dir,
&fs_extra::dir::CopyOptions::new().content_only(true),
)?;
} else {
// Create empty directory when dist doesn't exist
fs::create_dir_all(out_dir)?
}

Comment on lines 55 to 56
println!("cargo:rerun-if-changed={}", hub_path.display());
println!("cargo:rerun-if-env-changed=FONTAWESOME_PACKAGE_TOKEN");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: These rerun-if-changed triggers are now outside the build conditional, meaning they'll trigger rebuilds even when RIVET_BUILD_HUB is false.

let mut create_sub = ctx
.subscribe::<pegboard::workflows::actor::CreateComplete>(("actor_id", actor_id))

if let build::types::BuildAllocationType::None = build.allocation_type {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider extracting common workflow setup code to avoid duplication between allocation type branches

Comment on lines +267 to +295
pub struct RunnerConfig {
pub image: Image,
pub root_user_enabled: bool,
pub resources: Resources,
pub env: HashableMap<String, String>,
pub ports: HashableMap<String, Port>,
pub network_mode: NetworkMode,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: RunnerConfig duplicates fields from deprecated ActorConfig. Consider reusing the existing types to maintain consistency.

Comment on lines +254 to +316
if let BuildAllocationType::Multi = input.build_allocation_type {
loop {
let Some(entry) = stream.try_next().await? else {
return Ok(None);
};

let old_runner_allocation_key = keys::subspace()
.unpack::<keys::datacenter::RunnersByRemainingSlotsKey>(entry.key())
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?;

let data = old_runner_allocation_key
.deserialize(entry.value())
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?;

// Add read conflict only for this key
tx.add_conflict_range(
entry.key(),
&end_of_key_range(entry.key()),
ConflictRangeType::Read,
)?;

// Clear old entry
tx.clear(entry.key());

let new_remaining_slots =
old_runner_allocation_key.remaining_slots.saturating_sub(1);

// Write new allocation key with 1 less slot
let new_allocation_key = keys::datacenter::RunnersByRemainingSlotsKey::new(
input.image_id,
new_remaining_slots,
old_runner_allocation_key.runner_id,
);
tx.set(&keys::subspace().pack(&new_allocation_key), entry.value());

// Update runner record
let remaining_slots_key =
keys::runner::RemainingSlotsKey::new(old_runner_allocation_key.runner_id);
tx.set(
&keys::subspace().pack(&remaining_slots_key),
&remaining_slots_key
.serialize(new_remaining_slots)
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);

// Insert actor index key
let client_actor_key =
keys::client::ActorKey::new(data.client_id, input.actor_id);
tx.set(
&keys::subspace().pack(&client_actor_key),
&client_actor_key
.serialize(input.generation)
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);

return Ok(Some(AllocateActorOutput {
runner_id: old_runner_allocation_key.runner_id,
new_runner: false,
client_id: data.client_id,
client_workflow_id: data.client_workflow_id,
}));
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The allocation loop for multi-type builds lacks a check for remaining_slots > 0, which could lead to allocating to runners with no capacity

Comment on lines +426 to +462
if let BuildAllocationType::Multi = input.build_allocation_type {
let remaining_slots = input.build_allocation_total_slots.saturating_sub(1);
let total_slots = input.build_allocation_total_slots;

// Insert runner records
let remaining_slots_key = keys::runner::RemainingSlotsKey::new(runner_id);
tx.set(
&keys::subspace().pack(&remaining_slots_key),
&remaining_slots_key
.serialize(remaining_slots)
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);

let total_slots_key = keys::runner::TotalSlotsKey::new(runner_id);
tx.set(
&keys::subspace().pack(&total_slots_key),
&total_slots_key
.serialize(total_slots)
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);

// Insert runner index key
let runner_idx_key = keys::datacenter::RunnersByRemainingSlotsKey::new(
input.image_id,
remaining_slots,
runner_id,
);
tx.set(
&keys::subspace().pack(&runner_idx_key),
&runner_idx_key
.serialize(keys::datacenter::RunnersByRemainingSlotsKeyData {
client_id: old_client_allocation_key.client_id,
client_workflow_id,
})
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Runner records are only created when build_allocation_type is Multi, but this creates an inconsistency since Single type builds should also have runner records

Suggested change
if let BuildAllocationType::Multi = input.build_allocation_type {
let remaining_slots = input.build_allocation_total_slots.saturating_sub(1);
let total_slots = input.build_allocation_total_slots;
// Insert runner records
let remaining_slots_key = keys::runner::RemainingSlotsKey::new(runner_id);
tx.set(
&keys::subspace().pack(&remaining_slots_key),
&remaining_slots_key
.serialize(remaining_slots)
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);
let total_slots_key = keys::runner::TotalSlotsKey::new(runner_id);
tx.set(
&keys::subspace().pack(&total_slots_key),
&total_slots_key
.serialize(total_slots)
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);
// Insert runner index key
let runner_idx_key = keys::datacenter::RunnersByRemainingSlotsKey::new(
input.image_id,
remaining_slots,
runner_id,
);
tx.set(
&keys::subspace().pack(&runner_idx_key),
&runner_idx_key
.serialize(keys::datacenter::RunnersByRemainingSlotsKeyData {
client_id: old_client_allocation_key.client_id,
client_workflow_id,
})
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);
}
// Insert runner records for all allocation types
let (remaining_slots, total_slots) = match input.build_allocation_type {
BuildAllocationType::Multi => {
(input.build_allocation_total_slots.saturating_sub(1), input.build_allocation_total_slots)
},
BuildAllocationType::Single => (0, 1),
};
// Insert runner records
let remaining_slots_key = keys::runner::RemainingSlotsKey::new(runner_id);
tx.set(
&keys::subspace().pack(&remaining_slots_key),
&remaining_slots_key
.serialize(remaining_slots)
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);
let total_slots_key = keys::runner::TotalSlotsKey::new(runner_id);
tx.set(
&keys::subspace().pack(&total_slots_key),
&total_slots_key
.serialize(total_slots)
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);
// Insert runner index key
let runner_idx_key = keys::datacenter::RunnersByRemainingSlotsKey::new(
input.image_id,
remaining_slots,
runner_id,
);
tx.set(
&keys::subspace().pack(&runner_idx_key),
&runner_idx_key
.serialize(keys::datacenter::RunnersByRemainingSlotsKeyData {
client_id: old_client_allocation_key.client_id,
client_workflow_id,
})
.map_err(|x| fdb::FdbBindingError::CustomError(x.into()))?,
);

Comment on lines +1028 to +1041
destroy::clear_ports_and_resources(
input.actor_id,
input.image_id,
Some(build.allocation_type),
ingress_ports,
Some(input.runner_id),
Some(input.client_id),
Some(input.client_workflow_id),
selected_resources_memory_mib,
selected_resources_cpu_millicores,
&tx,
)
.await
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: clear_ports_and_resources is called with build.allocation_type before checking if build exists, which could cause a panic if build is not found

Suggested change
destroy::clear_ports_and_resources(
input.actor_id,
input.image_id,
Some(build.allocation_type),
ingress_ports,
Some(input.runner_id),
Some(input.client_id),
Some(input.client_workflow_id),
selected_resources_memory_mib,
selected_resources_cpu_millicores,
&tx,
)
.await
}
destroy::clear_ports_and_resources(
input.actor_id,
input.image_id,
build.map(|b| b.allocation_type),
ingress_ports,
Some(input.runner_id),
Some(input.client_id),
Some(input.client_workflow_id),
selected_resources_memory_mib,
selected_resources_cpu_millicores,
&tx,
)
.await
}

@@ -1,2 +1,3 @@
pub mod actor;
pub mod actor2;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider using a more descriptive name like actor_v2 or similar instead of actor2 to better indicate this is a new version

@MasterPtato MasterPtato changed the base branch from main to graphite-base/2433 May 27, 2025 02:01
@MasterPtato MasterPtato force-pushed the 05-08-feat_add_runners_to_pb_protocol branch from 885534d to d0f985a Compare May 27, 2025 02:02
@MasterPtato MasterPtato changed the base branch from graphite-base/2433 to 05-27-chore_pegboard_add_exit_code_log_from_container_runner May 27, 2025 02:02
@MasterPtato MasterPtato force-pushed the 05-08-feat_add_runners_to_pb_protocol branch from 8fad6b0 to a832749 Compare July 3, 2025 01:31
@MasterPtato MasterPtato force-pushed the 05-27-chore_pegboard_add_exit_code_log_from_container_runner branch from f20b30f to 03f6f81 Compare July 3, 2025 01:31
Copy link

claude bot commented Jul 3, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add runners to pb protocol

I've completed a comprehensive review of this major architectural change. This PR introduces significant improvements to the Rivet platform by adding runner support to the pegboard protocol while removing the isolate-v8-runner implementation.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze core architectural changes (actor2 workflow)
  • Review protocol and runner management changes
  • Examine security implications
  • Check error handling and resource management
  • Assess test coverage
  • Provide comprehensive feedback

🏗️ Architecture & Design

Strengths:

  • Unified Runner Protocol: The consolidation from separate isolate and container runners to a generic runner protocol is a solid architectural improvement
  • Multi-slot Support: The new slot-based allocation system (BuildAllocationType::Multi) enables better resource utilization
  • Actor2 Workflow: Clean separation between actor and actor2 workflows allows for incremental migration
  • Database Design: The new runner tracking in FDB with remaining/total slots provides efficient allocation lookups

Concerns:

  • Breaking Change Scope: Removing the entire isolate-v8-runner package without clear migration path could break existing JavaScript runtime functionality
  • Dual Workflow Complexity: Having both actor and actor2 workflows increases cognitive load and maintenance burden

🐛 Potential Bugs & Issues

Critical Issues

  1. Runner Allocation Logic Gap (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:255):

    // Missing remaining_slots > 0 check for multi-type builds
    if let BuildAllocationType::Multi = input.build_allocation_type {
        loop {
            // Should verify remaining_slots > 0 before allocation
  2. SQL Parameter Mismatch (packages/edge/infra/client/manager/src/runner/mod.rs:510):

    SET delete_ts = ?3  -- Should be ?2, missing bind parameter
    WHERE
  3. Race Condition in Runner Socket (packages/edge/infra/client/manager/src/ctx.rs:877):
    Runner socket connection could still occur between the connection check lines.

  4. Build Type Validation Error (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:1021):

    // build.allocation_type accessed before checking if build exists
    let build = unwrap_with!(build_res.builds.first(), BUILD_NOT_FOUND);
    // Later used without null check
    destroy::clear_ports_and_resources(..., build.allocation_type, ...)

Security Concerns

  1. Hardcoded UUIDs (packages/edge/infra/client/echo/src/main.rs:68):

    let payload = json!({
        "init": {
            "runner_id": Uuid::nil(), // Should be unique UUID
        },
    });
  2. Missing Input Validation (packages/edge/infra/client/echo/src/main.rs:107):
    No error handling for missing actor_id or generation fields in packets.

🚀 Performance Considerations

Improvements:

  • FDB Allocation Strategy: Using SNAPSHOT reads for allocation prevents conflicts while maintaining consistency
  • Bin-packing Algorithm: Reverse ordering in allocation queries optimizes resource utilization
  • Connection Pooling: WebSocket connection management looks efficient

Potential Issues:

  • Tokio "full" Feature: packages/edge/infra/client/echo/Cargo.toml enables all Tokio features unnecessarily
  • Allocation Loop Performance: The multi-slot allocation loop could benefit from early termination checks
  • Ping Thread Cleanup: Ping threads continue running even after main WebSocket connections close

🔒 Security Assessment

Positive:

  • Process Isolation: Container runner maintains proper process separation
  • Signal Handling: Appropriate use of Unix signals for process management
  • Resource Bounds: Memory and CPU limits properly enforced

Concerns:

  • Error Exposure: WebSocket error messages might leak internal system information
  • Resource Exhaustion: Missing rate limiting could allow unconstrained logging (removed throttle.rs)

🧪 Test Coverage

Missing Tests:

  • Runner Lifecycle: No comprehensive tests for runner creation, allocation, and cleanup
  • Error Scenarios: Limited testing of failure modes and edge cases
  • Protocol Validation: Missing tests for malformed protocol messages
  • Multi-slot Allocation: No tests for the new slot-based allocation system

Existing Tests:

  • Basic client state tests exist but assume generation 0
  • Integration tests are minimal given the scope of changes

📝 Code Quality Issues

Style & Maintainability

  1. Inconsistent Naming: actor2 should be actor_v2 for clarity
  2. TODO Overuse: Multiple unresolved TODOs in critical paths
  3. Deprecated Fields: Multiple deprecated protocol fields still marked as required
  4. Comment Accuracy: TotalSlotsKey comment says "MiB" but measures slots

Error Handling

  1. Incomplete Error Paths: Several unwrap() calls that should handle errors gracefully
  2. Missing Context: Error messages lack sufficient debugging information
  3. Inconsistent Returns: Some functions return Ok(()) where errors should be propagated

🔧 Recommendations

Immediate Fixes Required

  1. Fix SQL Parameter Binding: Correct the missing parameter in runner deletion query
  2. Add Remaining Slots Check: Validate slot availability before allocation attempts
  3. Improve Error Handling: Replace unwrap() calls with proper error handling
  4. Fix UUID Generation: Use Uuid::new_v4() instead of Uuid::nil() for runner IDs

Architecture Improvements

  1. Migration Strategy: Provide clear documentation for migrating from isolate-v8-runner
  2. Test Coverage: Add comprehensive integration tests for the new runner system
  3. Monitoring: Implement metrics for runner allocation success/failure rates
  4. Documentation: Update API documentation to reflect the new allocation types

Performance Optimizations

  1. Resource Cleanup: Implement proper cleanup for ping threads and WebSocket connections
  2. Feature Minimization: Use specific Tokio features instead of "full"
  3. Allocation Efficiency: Add early termination conditions in allocation loops

✅ Overall Assessment

This PR represents a significant and necessary architectural improvement that consolidates runner management and enables multi-slot allocation. While the core design is sound, several critical bugs and missing error handling need to be addressed before merging.

Verdict: Approve with required changes - The architectural direction is correct, but the implementation needs refinement in error handling, test coverage, and bug fixes identified above.

Priority Issues to Address:

  1. SQL parameter binding fix (High)
  2. Remaining slots validation (High)
  3. Error handling improvements (High)
  4. Test coverage expansion (Medium)
  5. Documentation updates (Medium)

Copy link

claude bot commented Jul 3, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add runners to pb protocol

I've completed a comprehensive review of this major architectural change. This PR introduces significant improvements to the Rivet platform by adding runner support to the pegboard protocol while removing the isolate-v8-runner implementation.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze core architectural changes (actor2 workflow)
  • Review protocol and runner management changes
  • Examine security implications
  • Check error handling and resource management
  • Assess test coverage
  • Provide comprehensive feedback

🏗️ Architecture & Design

Strengths:

  • Unified Runner Protocol: The consolidation from separate isolate and container runners to a generic runner protocol is a solid architectural improvement
  • Multi-slot Support: The new slot-based allocation system (BuildAllocationType::Multi) enables better resource utilization
  • Actor2 Workflow: Clean separation between actor and actor2 workflows allows for incremental migration
  • Database Design: The new runner tracking in FDB with remaining/total slots provides efficient allocation lookups

Concerns:

  • Breaking Change Scope: Removing the entire isolate-v8-runner package without clear migration path could break existing JavaScript runtime functionality
  • Dual Workflow Complexity: Having both actor and actor2 workflows increases cognitive load and maintenance burden

🐛 Potential Bugs & Issues

Critical Issues

  1. Runner Allocation Logic Gap (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:255):

    // Missing remaining_slots > 0 check for multi-type builds
    if let BuildAllocationType::Multi = input.build_allocation_type {
        loop {
            // Should verify remaining_slots > 0 before allocation
  2. SQL Parameter Mismatch (packages/edge/infra/client/manager/src/runner/mod.rs:510):

    SET delete_ts = ?3  -- Should be ?2, missing bind parameter
    WHERE
  3. Race Condition in Runner Socket (packages/edge/infra/client/manager/src/ctx.rs:877):
    Runner socket connection could still occur between the connection check lines.

  4. Build Type Validation Error (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:1021):

    // build.allocation_type accessed before checking if build exists
    let build = unwrap_with!(build_res.builds.first(), BUILD_NOT_FOUND);
    // Later used without null check
    destroy::clear_ports_and_resources(..., build.allocation_type, ...)

Security Concerns

  1. Hardcoded UUIDs (packages/edge/infra/client/echo/src/main.rs:68):

    let payload = json!({
        "init": {
            "runner_id": Uuid::nil(), // Should be unique UUID
        },
    });
  2. Missing Input Validation (packages/edge/infra/client/echo/src/main.rs:107):
    No error handling for missing actor_id or generation fields in packets.

🚀 Performance Considerations

Improvements:

  • FDB Allocation Strategy: Using SNAPSHOT reads for allocation prevents conflicts while maintaining consistency
  • Bin-packing Algorithm: Reverse ordering in allocation queries optimizes resource utilization
  • Connection Pooling: WebSocket connection management looks efficient

Potential Issues:

  • Tokio "full" Feature: packages/edge/infra/client/echo/Cargo.toml enables all Tokio features unnecessarily
  • Allocation Loop Performance: The multi-slot allocation loop could benefit from early termination checks
  • Ping Thread Cleanup: Ping threads continue running even after main WebSocket connections close

🔒 Security Assessment

Positive:

  • Process Isolation: Container runner maintains proper process separation
  • Signal Handling: Appropriate use of Unix signals for process management
  • Resource Bounds: Memory and CPU limits properly enforced

Concerns:

  • Error Exposure: WebSocket error messages might leak internal system information
  • Resource Exhaustion: Missing rate limiting could allow unconstrained logging (removed throttle.rs)

🧪 Test Coverage

Missing Tests:

  • Runner Lifecycle: No comprehensive tests for runner creation, allocation, and cleanup
  • Error Scenarios: Limited testing of failure modes and edge cases
  • Protocol Validation: Missing tests for malformed protocol messages
  • Multi-slot Allocation: No tests for the new slot-based allocation system

Existing Tests:

  • Basic client state tests exist but assume generation 0
  • Integration tests are minimal given the scope of changes

📝 Code Quality Issues

Style & Maintainability

  1. Inconsistent Naming: actor2 should be actor_v2 for clarity
  2. TODO Overuse: Multiple unresolved TODOs in critical paths
  3. Deprecated Fields: Multiple deprecated protocol fields still marked as required
  4. Comment Accuracy: TotalSlotsKey comment says "MiB" but measures slots

Error Handling

  1. Incomplete Error Paths: Several unwrap() calls that should handle errors gracefully
  2. Missing Context: Error messages lack sufficient debugging information
  3. Inconsistent Returns: Some functions return Ok(()) where errors should be propagated

🔧 Recommendations

Immediate Fixes Required

  1. Fix SQL Parameter Binding: Correct the missing parameter in runner deletion query
  2. Add Remaining Slots Check: Validate slot availability before allocation attempts
  3. Improve Error Handling: Replace unwrap() calls with proper error handling
  4. Fix UUID Generation: Use Uuid::new_v4() instead of Uuid::nil() for runner IDs

Architecture Improvements

  1. Migration Strategy: Provide clear documentation for migrating from isolate-v8-runner
  2. Test Coverage: Add comprehensive integration tests for the new runner system
  3. Monitoring: Implement metrics for runner allocation success/failure rates
  4. Documentation: Update API documentation to reflect the new allocation types

Performance Optimizations

  1. Resource Cleanup: Implement proper cleanup for ping threads and WebSocket connections
  2. Feature Minimization: Use specific Tokio features instead of "full"
  3. Allocation Efficiency: Add early termination conditions in allocation loops

✅ Overall Assessment

This PR represents a significant and necessary architectural improvement that consolidates runner management and enables multi-slot allocation. While the core design is sound, several critical bugs and missing error handling need to be addressed before merging.

Verdict: Approve with required changes - The architectural direction is correct, but the implementation needs refinement in error handling, test coverage, and bug fixes identified above.

Priority Issues to Address:

  1. SQL parameter binding fix (High)
  2. Remaining slots validation (High)
  3. Error handling improvements (High)
  4. Test coverage expansion (Medium)
  5. Documentation updates (Medium)

@MasterPtato MasterPtato force-pushed the 05-27-chore_pegboard_add_exit_code_log_from_container_runner branch from 03f6f81 to c6943e6 Compare July 4, 2025 01:27
@MasterPtato MasterPtato force-pushed the 05-08-feat_add_runners_to_pb_protocol branch from a832749 to 0074232 Compare July 4, 2025 01:27
Copy link

coderabbitai bot commented Jul 4, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

claude bot commented Jul 4, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add runners to pb protocol

I've completed a comprehensive review of this major architectural change. This PR introduces significant improvements to the Rivet platform by adding runner support to the pegboard protocol while removing the isolate-v8-runner implementation.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze overall architecture and design decisions
  • Review actor2 workflow implementation
  • Examine runner management and protocol changes
  • Check security implications
  • Assess error handling and edge cases
  • Evaluate performance considerations
  • Review test coverage
  • Provide structured feedback and recommendations

🏗️ Architecture & Design

Strengths:

  • Unified Runner Protocol: The consolidation from separate isolate and container runners to a generic runner protocol is a solid architectural improvement
  • Multi-slot Support: The new slot-based allocation system (BuildAllocationType::Multi) enables better resource utilization
  • Actor2 Workflow: Clean separation between actor and actor2 workflows allows for incremental migration
  • Database Design: The new runner tracking in FDB with remaining/total slots provides efficient allocation lookups

Concerns:

  • Breaking Change Scope: Removing the entire isolate-v8-runner package without clear migration path could break existing JavaScript runtime functionality
  • Dual Workflow Complexity: Having both actor and actor2 workflows increases cognitive load and maintenance burden

🐛 Potential Bugs & Issues

Critical Issues

  1. Runner Allocation Logic Gap (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:255):

    // Missing remaining_slots > 0 check for multi-type builds
    if let BuildAllocationType::Multi = input.build_allocation_type {
        loop {
            // Should verify remaining_slots > 0 before allocation
  2. SQL Parameter Mismatch (packages/edge/infra/client/manager/src/runner/mod.rs:510):

    SET delete_ts = ?3  -- Should be ?2, missing bind parameter
    WHERE
  3. Race Condition in Runner Socket (packages/edge/infra/client/manager/src/ctx.rs:877):
    Runner socket connection could still occur between the connection check lines.

  4. Build Type Validation Error (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:1021):

    // build.allocation_type accessed before checking if build exists
    let build = unwrap_with!(build_res.builds.first(), BUILD_NOT_FOUND);
    // Later used without null check
    destroy::clear_ports_and_resources(..., build.allocation_type, ...)

Security Concerns

  1. Hardcoded UUIDs (packages/edge/infra/client/echo/src/main.rs:68):

    let payload = json!({
        "init": {
            "runner_id": Uuid::nil(), // Should be unique UUID
        },
    });
  2. Missing Input Validation (packages/edge/infra/client/echo/src/main.rs:107):
    No error handling for missing actor_id or generation fields in packets.

🚀 Performance Considerations

Improvements:

  • FDB Allocation Strategy: Using SNAPSHOT reads for allocation prevents conflicts while maintaining consistency
  • Bin-packing Algorithm: Reverse ordering in allocation queries optimizes resource utilization
  • Connection Pooling: WebSocket connection management looks efficient

Potential Issues:

  • Tokio "full" Feature: packages/edge/infra/client/echo/Cargo.toml enables all Tokio features unnecessarily
  • Allocation Loop Performance: The multi-slot allocation loop could benefit from early termination checks
  • Ping Thread Cleanup: Ping threads continue running even after main WebSocket connections close

🔒 Security Assessment

Positive:

  • Process Isolation: Container runner maintains proper process separation
  • Signal Handling: Appropriate use of Unix signals for process management
  • Resource Bounds: Memory and CPU limits properly enforced

Concerns:

  • Error Exposure: WebSocket error messages might leak internal system information
  • Resource Exhaustion: Missing rate limiting could allow unconstrained logging (removed throttle.rs)

🧪 Test Coverage

Missing Tests:

  • Runner Lifecycle: No comprehensive tests for runner creation, allocation, and cleanup
  • Error Scenarios: Limited testing of failure modes and edge cases
  • Protocol Validation: Missing tests for malformed protocol messages
  • Multi-slot Allocation: No tests for the new slot-based allocation system

Existing Tests:

  • Basic client state tests exist but assume generation 0
  • Integration tests are minimal given the scope of changes

📝 Code Quality Issues

Style & Maintainability

  1. Inconsistent Naming: actor2 should be actor_v2 for clarity
  2. TODO Overuse: Multiple unresolved TODOs in critical paths
  3. Deprecated Fields: Multiple deprecated protocol fields still marked as required
  4. Comment Accuracy: TotalSlotsKey comment says "MiB" but measures slots

Error Handling

  1. Incomplete Error Paths: Several unwrap() calls that should handle errors gracefully
  2. Missing Context: Error messages lack sufficient debugging information
  3. Inconsistent Returns: Some functions return Ok(()) where errors should be propagated

🔧 Recommendations

Immediate Fixes Required

  1. Fix SQL Parameter Binding: Correct the missing parameter in runner deletion query
  2. Add Remaining Slots Check: Validate slot availability before allocation attempts
  3. Improve Error Handling: Replace unwrap() calls with proper error handling
  4. Fix UUID Generation: Use Uuid::new_v4() instead of Uuid::nil() for runner IDs

Architecture Improvements

  1. Migration Strategy: Provide clear documentation for migrating from isolate-v8-runner
  2. Test Coverage: Add comprehensive integration tests for the new runner system
  3. Monitoring: Implement metrics for runner allocation success/failure rates
  4. Documentation: Update API documentation to reflect the new allocation types

Performance Optimizations

  1. Resource Cleanup: Implement proper cleanup for ping threads and WebSocket connections
  2. Feature Minimization: Use specific Tokio features instead of "full"
  3. Allocation Efficiency: Add early termination conditions in allocation loops

✅ Overall Assessment

This PR represents a significant and necessary architectural improvement that consolidates runner management and enables multi-slot allocation. While the core design is sound, several critical bugs and missing error handling need to be addressed before merging.

Verdict: Approve with required changes - The architectural direction is correct, but the implementation needs refinement in error handling, test coverage, and bug fixes identified above.

Priority Issues to Address:

  1. SQL parameter binding fix (High)
  2. Remaining slots validation (High)
  3. Error handling improvements (High)
  4. Test coverage expansion (Medium)
  5. Documentation updates (Medium)

Copy link

claude bot commented Jul 4, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add runners to pb protocol

I've completed a comprehensive review of this major architectural change. This PR introduces significant improvements to the Rivet platform by adding runner support to the pegboard protocol while removing the isolate-v8-runner implementation.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze overall architecture and design decisions
  • Review actor2 workflow implementation
  • Examine runner management and protocol changes
  • Check security implications
  • Assess error handling and edge cases
  • Evaluate performance considerations
  • Review test coverage
  • Provide structured feedback and recommendations

🏗️ Architecture & Design

Strengths:

  • Unified Runner Protocol: The consolidation from separate isolate and container runners to a generic runner protocol is a solid architectural improvement
  • Multi-slot Support: The new slot-based allocation system (BuildAllocationType::Multi) enables better resource utilization
  • Actor2 Workflow: Clean separation between actor and actor2 workflows allows for incremental migration
  • Database Design: The new runner tracking in FDB with remaining/total slots provides efficient allocation lookups

Concerns:

  • Breaking Change Scope: Removing the entire isolate-v8-runner package without clear migration path could break existing JavaScript runtime functionality
  • Dual Workflow Complexity: Having both actor and actor2 workflows increases cognitive load and maintenance burden

🐛 Potential Bugs & Issues

Critical Issues

  1. Runner Allocation Logic Gap (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:255):

    // Missing remaining_slots > 0 check for multi-type builds
    if let BuildAllocationType::Multi = input.build_allocation_type {
        loop {
            // Should verify remaining_slots > 0 before allocation
  2. SQL Parameter Mismatch (packages/edge/infra/client/manager/src/runner/mod.rs:510):

    SET delete_ts = ?3  -- Should be ?2, missing bind parameter
    WHERE
  3. Race Condition in Runner Socket (packages/edge/infra/client/manager/src/ctx.rs:877):
    Runner socket connection could still occur between the connection check lines.

  4. Build Type Validation Error (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:1021):

    // build.allocation_type accessed before checking if build exists
    let build = unwrap_with!(build_res.builds.first(), BUILD_NOT_FOUND);
    // Later used without null check
    destroy::clear_ports_and_resources(..., build.allocation_type, ...)

Security Concerns

  1. Hardcoded UUIDs (packages/edge/infra/client/echo/src/main.rs:68):

    let payload = json!({
        "init": {
            "runner_id": Uuid::nil(), // Should be unique UUID
        },
    });
  2. Missing Input Validation (packages/edge/infra/client/echo/src/main.rs:107):
    No error handling for missing actor_id or generation fields in packets.

🚀 Performance Considerations

Improvements:

  • FDB Allocation Strategy: Using SNAPSHOT reads for allocation prevents conflicts while maintaining consistency
  • Bin-packing Algorithm: Reverse ordering in allocation queries optimizes resource utilization
  • Connection Pooling: WebSocket connection management looks efficient

Potential Issues:

  • Tokio "full" Feature: packages/edge/infra/client/echo/Cargo.toml enables all Tokio features unnecessarily
  • Allocation Loop Performance: The multi-slot allocation loop could benefit from early termination checks
  • Ping Thread Cleanup: Ping threads continue running even after main WebSocket connections close

🔒 Security Assessment

Positive:

  • Process Isolation: Container runner maintains proper process separation
  • Signal Handling: Appropriate use of Unix signals for process management
  • Resource Bounds: Memory and CPU limits properly enforced

Concerns:

  • Error Exposure: WebSocket error messages might leak internal system information
  • Resource Exhaustion: Missing rate limiting could allow unconstrained logging (removed throttle.rs)

🧪 Test Coverage

Missing Tests:

  • Runner Lifecycle: No comprehensive tests for runner creation, allocation, and cleanup
  • Error Scenarios: Limited testing of failure modes and edge cases
  • Protocol Validation: Missing tests for malformed protocol messages
  • Multi-slot Allocation: No tests for the new slot-based allocation system

Existing Tests:

  • Basic client state tests exist but assume generation 0
  • Integration tests are minimal given the scope of changes

📝 Code Quality Issues

Style & Maintainability

  1. Inconsistent Naming: actor2 should be actor_v2 for clarity
  2. TODO Overuse: Multiple unresolved TODOs in critical paths
  3. Deprecated Fields: Multiple deprecated protocol fields still marked as required
  4. Comment Accuracy: TotalSlotsKey comment says "MiB" but measures slots

Error Handling

  1. Incomplete Error Paths: Several unwrap() calls that should handle errors gracefully
  2. Missing Context: Error messages lack sufficient debugging information
  3. Inconsistent Returns: Some functions return Ok(()) where errors should be propagated

🔧 Recommendations

Immediate Fixes Required

  1. Fix SQL Parameter Binding: Correct the missing parameter in runner deletion query
  2. Add Remaining Slots Check: Validate slot availability before allocation attempts
  3. Improve Error Handling: Replace unwrap() calls with proper error handling
  4. Fix UUID Generation: Use Uuid::new_v4() instead of Uuid::nil() for runner IDs

Architecture Improvements

  1. Migration Strategy: Provide clear documentation for migrating from isolate-v8-runner
  2. Test Coverage: Add comprehensive integration tests for the new runner system
  3. Monitoring: Implement metrics for runner allocation success/failure rates
  4. Documentation: Update API documentation to reflect the new allocation types

Performance Optimizations

  1. Resource Cleanup: Implement proper cleanup for ping threads and WebSocket connections
  2. Feature Minimization: Use specific Tokio features instead of "full"
  3. Allocation Efficiency: Add early termination conditions in allocation loops

✅ Overall Assessment

This PR represents a significant and necessary architectural improvement that consolidates runner management and enables multi-slot allocation. While the core design is sound, several critical bugs and missing error handling need to be addressed before merging.

Verdict: Approve with required changes - The architectural direction is correct, but the implementation needs refinement in error handling, test coverage, and bug fixes identified above.

Priority Issues to Address:

  1. SQL parameter binding fix (High)
  2. Remaining slots validation (High)
  3. Error handling improvements (High)
  4. Test coverage expansion (Medium)
  5. Documentation updates (Medium)

@MasterPtato MasterPtato force-pushed the 05-27-chore_pegboard_add_exit_code_log_from_container_runner branch from c6943e6 to d651a8c Compare July 11, 2025 22:23
@MasterPtato MasterPtato force-pushed the 05-08-feat_add_runners_to_pb_protocol branch from 0074232 to c518c13 Compare July 11, 2025 22:23
Copy link

claude bot commented Jul 11, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add runners to pb protocol

I've completed a comprehensive review of this major architectural change. This PR introduces significant improvements to the Rivet platform by adding runner support to the pegboard protocol while removing the isolate-v8-runner implementation.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze overall architecture and design decisions
  • Review actor2 workflow implementation
  • Examine runner management and protocol changes
  • Check security implications
  • Assess error handling and edge cases
  • Evaluate performance considerations
  • Review test coverage
  • Provide structured feedback and recommendations

🏗️ Architecture & Design

Strengths:

  • Unified Runner Protocol: The consolidation from separate isolate and container runners to a generic runner protocol is a solid architectural improvement
  • Multi-slot Support: The new slot-based allocation system (BuildAllocationType::Multi) enables better resource utilization
  • Actor2 Workflow: Clean separation between actor and actor2 workflows allows for incremental migration
  • Database Design: The new runner tracking in FDB with remaining/total slots provides efficient allocation lookups

Concerns:

  • Breaking Change Scope: Removing the entire isolate-v8-runner package without clear migration path could break existing JavaScript runtime functionality
  • Dual Workflow Complexity: Having both actor and actor2 workflows increases cognitive load and maintenance burden

🐛 Potential Bugs & Issues

Critical Issues

  1. Runner Allocation Logic Gap (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:255):

    // Missing remaining_slots > 0 check for multi-type builds
    if let BuildAllocationType::Multi = input.build_allocation_type {
        loop {
            // Should verify remaining_slots > 0 before allocation
  2. SQL Parameter Mismatch (packages/edge/infra/client/manager/src/runner/mod.rs:510):

    SET delete_ts = ?3  -- Should be ?2, missing bind parameter
    WHERE
  3. Race Condition in Runner Socket (packages/edge/infra/client/manager/src/ctx.rs:877):
    Runner socket connection could still occur between the connection check lines.

  4. Build Type Validation Error (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:1021):

    // build.allocation_type accessed before checking if build exists
    let build = unwrap_with!(build_res.builds.first(), BUILD_NOT_FOUND);
    // Later used without null check
    destroy::clear_ports_and_resources(..., build.allocation_type, ...)

Security Concerns

  1. Hardcoded UUIDs (packages/edge/infra/client/echo/src/main.rs:68):

    let payload = json!({
        "init": {
            "runner_id": Uuid::nil(), // Should be unique UUID
        },
    });
  2. Missing Input Validation (packages/edge/infra/client/echo/src/main.rs:107):
    No error handling for missing actor_id or generation fields in packets.

🚀 Performance Considerations

Improvements:

  • FDB Allocation Strategy: Using SNAPSHOT reads for allocation prevents conflicts while maintaining consistency
  • Bin-packing Algorithm: Reverse ordering in allocation queries optimizes resource utilization
  • Connection Pooling: WebSocket connection management looks efficient

Potential Issues:

  • Tokio "full" Feature: packages/edge/infra/client/echo/Cargo.toml enables all Tokio features unnecessarily
  • Allocation Loop Performance: The multi-slot allocation loop could benefit from early termination checks
  • Ping Thread Cleanup: Ping threads continue running even after main WebSocket connections close

🔒 Security Assessment

Positive:

  • Process Isolation: Container runner maintains proper process separation
  • Signal Handling: Appropriate use of Unix signals for process management
  • Resource Bounds: Memory and CPU limits properly enforced

Concerns:

  • Error Exposure: WebSocket error messages might leak internal system information
  • Resource Exhaustion: Missing rate limiting could allow unconstrained logging (removed throttle.rs)

🧪 Test Coverage

Missing Tests:

  • Runner Lifecycle: No comprehensive tests for runner creation, allocation, and cleanup
  • Error Scenarios: Limited testing of failure modes and edge cases
  • Protocol Validation: Missing tests for malformed protocol messages
  • Multi-slot Allocation: No tests for the new slot-based allocation system

Existing Tests:

  • Basic client state tests exist but assume generation 0
  • Integration tests are minimal given the scope of changes

📝 Code Quality Issues

Style & Maintainability

  1. Inconsistent Naming: actor2 should be actor_v2 for clarity
  2. TODO Overuse: Multiple unresolved TODOs in critical paths
  3. Deprecated Fields: Multiple deprecated protocol fields still marked as required
  4. Comment Accuracy: TotalSlotsKey comment says "MiB" but measures slots

Error Handling

  1. Incomplete Error Paths: Several unwrap() calls that should handle errors gracefully
  2. Missing Context: Error messages lack sufficient debugging information
  3. Inconsistent Returns: Some functions return Ok(()) where errors should be propagated

🔧 Recommendations

Immediate Fixes Required

  1. Fix SQL Parameter Binding: Correct the missing parameter in runner deletion query
  2. Add Remaining Slots Check: Validate slot availability before allocation attempts
  3. Improve Error Handling: Replace unwrap() calls with proper error handling
  4. Fix UUID Generation: Use Uuid::new_v4() instead of Uuid::nil() for runner IDs

Architecture Improvements

  1. Migration Strategy: Provide clear documentation for migrating from isolate-v8-runner
  2. Test Coverage: Add comprehensive integration tests for the new runner system
  3. Monitoring: Implement metrics for runner allocation success/failure rates
  4. Documentation: Update API documentation to reflect the new allocation types

Performance Optimizations

  1. Resource Cleanup: Implement proper cleanup for ping threads and WebSocket connections
  2. Feature Minimization: Use specific Tokio features instead of "full"
  3. Allocation Efficiency: Add early termination conditions in allocation loops

✅ Overall Assessment

This PR represents a significant and necessary architectural improvement that consolidates runner management and enables multi-slot allocation. While the core design is sound, several critical bugs and missing error handling need to be addressed before merging.

Verdict: Approve with required changes - The architectural direction is correct, but the implementation needs refinement in error handling, test coverage, and bug fixes identified above.

Priority Issues to Address:

  1. SQL parameter binding fix (High)
  2. Remaining slots validation (High)
  3. Error handling improvements (High)
  4. Test coverage expansion (Medium)
  5. Documentation updates (Medium)

Copy link

claude bot commented Jul 11, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add runners to pb protocol

I've completed a comprehensive review of this major architectural change. This PR introduces significant improvements to the Rivet platform by adding runner support to the pegboard protocol while removing the isolate-v8-runner implementation.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze overall architecture and design decisions
  • Review actor2 workflow implementation
  • Examine runner management and protocol changes
  • Check security implications
  • Assess error handling and edge cases
  • Evaluate performance considerations
  • Review test coverage
  • Provide structured feedback and recommendations

🏗️ Architecture & Design

Strengths:

  • Unified Runner Protocol: The consolidation from separate isolate and container runners to a generic runner protocol is a solid architectural improvement
  • Multi-slot Support: The new slot-based allocation system (BuildAllocationType::Multi) enables better resource utilization
  • Actor2 Workflow: Clean separation between actor and actor2 workflows allows for incremental migration
  • Database Design: The new runner tracking in FDB with remaining/total slots provides efficient allocation lookups

Concerns:

  • Breaking Change Scope: Removing the entire isolate-v8-runner package without clear migration path could break existing JavaScript runtime functionality
  • Dual Workflow Complexity: Having both actor and actor2 workflows increases cognitive load and maintenance burden

🐛 Potential Bugs & Issues

Critical Issues

  1. Runner Allocation Logic Gap (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:255):

    // Missing remaining_slots > 0 check for multi-type builds
    if let BuildAllocationType::Multi = input.build_allocation_type {
        loop {
            // Should verify remaining_slots > 0 before allocation
  2. SQL Parameter Mismatch (packages/edge/infra/client/manager/src/runner/mod.rs:510):

    SET delete_ts = ?3  -- Should be ?2, missing bind parameter
    WHERE
  3. Race Condition in Runner Socket (packages/edge/infra/client/manager/src/ctx.rs:877):
    Runner socket connection could still occur between the connection check lines.

  4. Build Type Validation Error (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:1021):

    // build.allocation_type accessed before checking if build exists
    let build = unwrap_with!(build_res.builds.first(), BUILD_NOT_FOUND);
    // Later used without null check
    destroy::clear_ports_and_resources(..., build.allocation_type, ...)

Security Concerns

  1. Hardcoded UUIDs (packages/edge/infra/client/echo/src/main.rs:68):

    let payload = json!({
        "init": {
            "runner_id": Uuid::nil(), // Should be unique UUID
        },
    });
  2. Missing Input Validation (packages/edge/infra/client/echo/src/main.rs:107):
    No error handling for missing actor_id or generation fields in packets.

🚀 Performance Considerations

Improvements:

  • FDB Allocation Strategy: Using SNAPSHOT reads for allocation prevents conflicts while maintaining consistency
  • Bin-packing Algorithm: Reverse ordering in allocation queries optimizes resource utilization
  • Connection Pooling: WebSocket connection management looks efficient

Potential Issues:

  • Tokio "full" Feature: packages/edge/infra/client/echo/Cargo.toml enables all Tokio features unnecessarily
  • Allocation Loop Performance: The multi-slot allocation loop could benefit from early termination checks
  • Ping Thread Cleanup: Ping threads continue running even after main WebSocket connections close

🔒 Security Assessment

Positive:

  • Process Isolation: Container runner maintains proper process separation
  • Signal Handling: Appropriate use of Unix signals for process management
  • Resource Bounds: Memory and CPU limits properly enforced

Concerns:

  • Error Exposure: WebSocket error messages might leak internal system information
  • Resource Exhaustion: Missing rate limiting could allow unconstrained logging (removed throttle.rs)

🧪 Test Coverage

Missing Tests:

  • Runner Lifecycle: No comprehensive tests for runner creation, allocation, and cleanup
  • Error Scenarios: Limited testing of failure modes and edge cases
  • Protocol Validation: Missing tests for malformed protocol messages
  • Multi-slot Allocation: No tests for the new slot-based allocation system

Existing Tests:

  • Basic client state tests exist but assume generation 0
  • Integration tests are minimal given the scope of changes

📝 Code Quality Issues

Style & Maintainability

  1. Inconsistent Naming: actor2 should be actor_v2 for clarity
  2. TODO Overuse: Multiple unresolved TODOs in critical paths
  3. Deprecated Fields: Multiple deprecated protocol fields still marked as required
  4. Comment Accuracy: TotalSlotsKey comment says "MiB" but measures slots

Error Handling

  1. Incomplete Error Paths: Several unwrap() calls that should handle errors gracefully
  2. Missing Context: Error messages lack sufficient debugging information
  3. Inconsistent Returns: Some functions return Ok(()) where errors should be propagated

🔧 Recommendations

Immediate Fixes Required

  1. Fix SQL Parameter Binding: Correct the missing parameter in runner deletion query
  2. Add Remaining Slots Check: Validate slot availability before allocation attempts
  3. Improve Error Handling: Replace unwrap() calls with proper error handling
  4. Fix UUID Generation: Use Uuid::new_v4() instead of Uuid::nil() for runner IDs

Architecture Improvements

  1. Migration Strategy: Provide clear documentation for migrating from isolate-v8-runner
  2. Test Coverage: Add comprehensive integration tests for the new runner system
  3. Monitoring: Implement metrics for runner allocation success/failure rates
  4. Documentation: Update API documentation to reflect the new allocation types

Performance Optimizations

  1. Resource Cleanup: Implement proper cleanup for ping threads and WebSocket connections
  2. Feature Minimization: Use specific Tokio features instead of "full"
  3. Allocation Efficiency: Add early termination conditions in allocation loops

✅ Overall Assessment

This PR represents a significant and necessary architectural improvement that consolidates runner management and enables multi-slot allocation. While the core design is sound, several critical bugs and missing error handling need to be addressed before merging.

Verdict: Approve with required changes - The architectural direction is correct, but the implementation needs refinement in error handling, test coverage, and bug fixes identified above.

Priority Issues to Address:

  1. SQL parameter binding fix (High)
  2. Remaining slots validation (High)
  3. Error handling improvements (High)
  4. Test coverage expansion (Medium)
  5. Documentation updates (Medium)

@MasterPtato MasterPtato force-pushed the 05-08-feat_add_runners_to_pb_protocol branch from c518c13 to 6c559a1 Compare July 12, 2025 00:40
@MasterPtato MasterPtato force-pushed the 05-27-chore_pegboard_add_exit_code_log_from_container_runner branch from d651a8c to 017878a Compare July 12, 2025 00:40
Copy link

claude bot commented Jul 12, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add runners to pb protocol

I've completed a comprehensive review of this major architectural change. This PR introduces significant improvements to the Rivet platform by adding runner support to the pegboard protocol while removing the isolate-v8-runner implementation.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze overall architecture and design decisions
  • Review actor2 workflow implementation
  • Examine runner management and protocol changes
  • Check security implications
  • Assess error handling and edge cases
  • Evaluate performance considerations
  • Review test coverage
  • Provide structured feedback and recommendations

🏗️ Architecture & Design

Strengths:

  • Unified Runner Protocol: The consolidation from separate isolate and container runners to a generic runner protocol is a solid architectural improvement
  • Multi-slot Support: The new slot-based allocation system (BuildAllocationType::Multi) enables better resource utilization
  • Actor2 Workflow: Clean separation between actor and actor2 workflows allows for incremental migration
  • Database Design: The new runner tracking in FDB with remaining/total slots provides efficient allocation lookups

Concerns:

  • Breaking Change Scope: Removing the entire isolate-v8-runner package without clear migration path could break existing JavaScript runtime functionality
  • Dual Workflow Complexity: Having both actor and actor2 workflows increases cognitive load and maintenance burden

🐛 Potential Bugs & Issues

Critical Issues

  1. Runner Allocation Logic Gap (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:255):

    // Missing remaining_slots > 0 check for multi-type builds
    if let BuildAllocationType::Multi = input.build_allocation_type {
        loop {
            // Should verify remaining_slots > 0 before allocation
  2. SQL Parameter Mismatch (packages/edge/infra/client/manager/src/runner/mod.rs:510):

    SET delete_ts = ?3  -- Should be ?2, missing bind parameter
    WHERE
  3. Race Condition in Runner Socket (packages/edge/infra/client/manager/src/ctx.rs:877):
    Runner socket connection could still occur between the connection check lines.

  4. Build Type Validation Error (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:1021):

    // build.allocation_type accessed before checking if build exists
    destroy::clear_ports_and_resources(
        input.actor_id,
        input.image_id,
        Some(build.allocation_type), // Potential panic

Security Concerns

  1. Hardcoded UUIDs (packages/edge/infra/client/echo/src/main.rs:68):

    let payload = json!({
        "init": {
            "runner_id": Uuid::nil(), // Should be unique UUID
        },
    });
  2. Missing Input Validation (packages/edge/infra/client/echo/src/main.rs:107):
    No error handling for missing actor_id or generation fields in packets.

🚀 Performance Considerations

Improvements:

  • FDB Allocation Strategy: Using SNAPSHOT reads for allocation prevents conflicts while maintaining consistency
  • Bin-packing Algorithm: Reverse ordering in allocation queries optimizes resource utilization
  • Connection Pooling: WebSocket connection management looks efficient

Potential Issues:

  • Tokio "full" Feature: packages/edge/infra/client/echo/Cargo.toml enables all Tokio features unnecessarily
  • Allocation Loop Performance: The multi-slot allocation loop could benefit from early termination checks
  • Ping Thread Cleanup: Ping threads continue running even after main WebSocket connections close

🔒 Security Assessment

Positive:

  • Process Isolation: Container runner maintains proper process separation
  • Signal Handling: Appropriate use of Unix signals for process management
  • Resource Bounds: Memory and CPU limits properly enforced

Concerns:

  • Error Exposure: WebSocket error messages might leak internal system information
  • Resource Exhaustion: Missing rate limiting could allow unconstrained logging (removed throttle.rs)

🧪 Test Coverage

Missing Tests:

  • Runner Lifecycle: No comprehensive tests for runner creation, allocation, and cleanup
  • Error Scenarios: Limited testing of failure modes and edge cases
  • Protocol Validation: Missing tests for malformed protocol messages
  • Multi-slot Allocation: No tests for the new slot-based allocation system

Existing Tests:

  • Basic client state tests exist but assume generation 0
  • Integration tests are minimal given the scope of changes

📝 Code Quality Issues

Style & Maintainability

  1. Inconsistent Naming: actor2 should be actor_v2 for clarity
  2. TODO Overuse: Multiple unresolved TODOs in critical paths
  3. Deprecated Fields: Multiple deprecated protocol fields still marked as required
  4. Comment Accuracy: TotalSlotsKey comment says "MiB" but measures slots

Error Handling

  1. Incomplete Error Paths: Several unwrap() calls that should handle errors gracefully
  2. Missing Context: Error messages lack sufficient debugging information
  3. Inconsistent Returns: Some functions return Ok(()) where errors should be propagated

🔧 Recommendations

Immediate Fixes Required

  1. Fix SQL Parameter Binding: Correct the missing parameter in runner deletion query
  2. Add Remaining Slots Check: Validate slot availability before allocation attempts
  3. Improve Error Handling: Replace unwrap() calls with proper error handling
  4. Fix UUID Generation: Use Uuid::new_v4() instead of Uuid::nil() for runner IDs

Architecture Improvements

  1. Migration Strategy: Provide clear documentation for migrating from isolate-v8-runner
  2. Test Coverage: Add comprehensive integration tests for the new runner system
  3. Monitoring: Implement metrics for runner allocation success/failure rates
  4. Documentation: Update API documentation to reflect the new allocation types

Performance Optimizations

  1. Resource Cleanup: Implement proper cleanup for ping threads and WebSocket connections
  2. Feature Minimization: Use specific Tokio features instead of "full"
  3. Allocation Efficiency: Add early termination conditions in allocation loops

✅ Overall Assessment

This PR represents a significant and necessary architectural improvement that consolidates runner management and enables multi-slot allocation. While the core design is sound, several critical bugs and missing error handling need to be addressed before merging.

Verdict: Approve with required changes - The architectural direction is correct, but the implementation needs refinement in error handling, test coverage, and bug fixes identified above.

Priority Issues to Address:

  1. SQL parameter binding fix (High)
  2. Remaining slots validation (High)
  3. Error handling improvements (High)
  4. Test coverage expansion (Medium)
  5. Documentation updates (Medium)

Copy link

claude bot commented Jul 12, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add runners to pb protocol

I've completed a comprehensive review of this major architectural change. This PR introduces significant improvements to the Rivet platform by adding runner support to the pegboard protocol while removing the isolate-v8-runner implementation.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze overall architecture and design decisions
  • Review actor2 workflow implementation
  • Examine runner management and protocol changes
  • Check security implications
  • Assess error handling and edge cases
  • Evaluate performance considerations
  • Review test coverage
  • Provide structured feedback and recommendations

🏗️ Architecture & Design

Strengths:

  • Unified Runner Protocol: The consolidation from separate isolate and container runners to a generic runner protocol is a solid architectural improvement
  • Multi-slot Support: The new slot-based allocation system (BuildAllocationType::Multi) enables better resource utilization
  • Actor2 Workflow: Clean separation between actor and actor2 workflows allows for incremental migration
  • Database Design: The new runner tracking in FDB with remaining/total slots provides efficient allocation lookups

Concerns:

  • Breaking Change Scope: Removing the entire isolate-v8-runner package without clear migration path could break existing JavaScript runtime functionality
  • Dual Workflow Complexity: Having both actor and actor2 workflows increases cognitive load and maintenance burden

🐛 Potential Bugs & Issues

Critical Issues

  1. Runner Allocation Logic Gap (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:255):

    // Missing remaining_slots > 0 check for multi-type builds
    if let BuildAllocationType::Multi = input.build_allocation_type {
        loop {
            // Should verify remaining_slots > 0 before allocation
  2. SQL Parameter Mismatch (packages/edge/infra/client/manager/src/runner/mod.rs:510):

    SET delete_ts = ?3  -- Should be ?2, missing bind parameter
    WHERE
  3. Race Condition in Runner Socket (packages/edge/infra/client/manager/src/ctx.rs:877):
    Runner socket connection could still occur between the connection check lines.

  4. Build Type Validation Error (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:1021):

    // build.allocation_type accessed before checking if build exists
    let build = unwrap_with!(build_res.builds.first(), BUILD_NOT_FOUND);
    // Later used without null check
    destroy::clear_ports_and_resources(..., build.allocation_type, ...)

Security Concerns

  1. Hardcoded UUIDs (packages/edge/infra/client/echo/src/main.rs:68):

    let payload = json!({
        "init": {
            "runner_id": Uuid::nil(), // Should be unique UUID
        },
    });
  2. Missing Input Validation (packages/edge/infra/client/echo/src/main.rs:107):
    No error handling for missing actor_id or generation fields in packets.

🚀 Performance Considerations

Improvements:

  • FDB Allocation Strategy: Using SNAPSHOT reads for allocation prevents conflicts while maintaining consistency
  • Bin-packing Algorithm: Reverse ordering in allocation queries optimizes resource utilization
  • Connection Pooling: WebSocket connection management looks efficient

Potential Issues:

  • Tokio "full" Feature: packages/edge/infra/client/echo/Cargo.toml enables all Tokio features unnecessarily
  • Allocation Loop Performance: The multi-slot allocation loop could benefit from early termination checks
  • Ping Thread Cleanup: Ping threads continue running even after main WebSocket connections close

🔒 Security Assessment

Positive:

  • Process Isolation: Container runner maintains proper process separation
  • Signal Handling: Appropriate use of Unix signals for process management
  • Resource Bounds: Memory and CPU limits properly enforced

Concerns:

  • Error Exposure: WebSocket error messages might leak internal system information
  • Resource Exhaustion: Missing rate limiting could allow unconstrained logging (removed throttle.rs)

🧪 Test Coverage

Missing Tests:

  • Runner Lifecycle: No comprehensive tests for runner creation, allocation, and cleanup
  • Error Scenarios: Limited testing of failure modes and edge cases
  • Protocol Validation: Missing tests for malformed protocol messages
  • Multi-slot Allocation: No tests for the new slot-based allocation system

Existing Tests:

  • Basic client state tests exist but assume generation 0
  • Integration tests are minimal given the scope of changes

📝 Code Quality Issues

Style & Maintainability

  1. Inconsistent Naming: actor2 should be actor_v2 for clarity
  2. TODO Overuse: Multiple unresolved TODOs in critical paths
  3. Deprecated Fields: Multiple deprecated protocol fields still marked as required
  4. Comment Accuracy: TotalSlotsKey comment says "MiB" but measures slots

Error Handling

  1. Incomplete Error Paths: Several unwrap() calls that should handle errors gracefully
  2. Missing Context: Error messages lack sufficient debugging information
  3. Inconsistent Returns: Some functions return Ok(()) where errors should be propagated

🔧 Recommendations

Immediate Fixes Required

  1. Fix SQL Parameter Binding: Correct the missing parameter in runner deletion query
  2. Add Remaining Slots Check: Validate slot availability before allocation attempts
  3. Improve Error Handling: Replace unwrap() calls with proper error handling
  4. Fix UUID Generation: Use Uuid::new_v4() instead of Uuid::nil() for runner IDs

Architecture Improvements

  1. Migration Strategy: Provide clear documentation for migrating from isolate-v8-runner
  2. Test Coverage: Add comprehensive integration tests for the new runner system
  3. Monitoring: Implement metrics for runner allocation success/failure rates
  4. Documentation: Update API documentation to reflect the new allocation types

Performance Optimizations

  1. Resource Cleanup: Implement proper cleanup for ping threads and WebSocket connections
  2. Feature Minimization: Use specific Tokio features instead of "full"
  3. Allocation Efficiency: Add early termination conditions in allocation loops

✅ Overall Assessment

This PR represents a significant and necessary architectural improvement that consolidates runner management and enables multi-slot allocation. While the core design is sound, several critical bugs and missing error handling need to be addressed before merging.

Verdict: Approve with required changes - The architectural direction is correct, but the implementation needs refinement in error handling, test coverage, and bug fixes identified above.

Priority Issues to Address:

  1. SQL parameter binding fix (High)
  2. Remaining slots validation (High)
  3. Error handling improvements (High)
  4. Test coverage expansion (Medium)
  5. Documentation updates (Medium)

@MasterPtato MasterPtato force-pushed the 05-27-chore_pegboard_add_exit_code_log_from_container_runner branch from 017878a to d6072b5 Compare July 18, 2025 18:51
@MasterPtato MasterPtato force-pushed the 05-08-feat_add_runners_to_pb_protocol branch from 6c559a1 to cdf9857 Compare July 18, 2025 18:51
Copy link

vercel bot commented Jul 18, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
rivet-site ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 18, 2025 7:04pm

Copy link

claude bot commented Jul 18, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add runners to pb protocol

I've completed a comprehensive review of this major architectural change. This PR introduces significant improvements to the Rivet platform by adding runner support to the pegboard protocol while removing the isolate-v8-runner implementation.

Review Tasks:

  • Read repository guidelines (CLAUDE.md)
  • Analyze overall architecture and design decisions
  • Review specific code changes and implementation details
  • Assess security implications and potential vulnerabilities
  • Evaluate performance considerations
  • Check test coverage and quality assurance
  • Provide structured feedback and recommendations

🏗️ Architecture & Design

Strengths:

  • Unified Runner Protocol: The consolidation from separate isolate and container runners to a generic runner protocol is a solid architectural improvement
  • Multi-slot Support: The new slot-based allocation system (BuildAllocationType::Multi) enables better resource utilization
  • Actor2 Workflow: Clean separation between actor and actor2 workflows allows for incremental migration
  • Database Design: The new runner tracking in FDB with remaining/total slots provides efficient allocation lookups

Concerns:

  • Breaking Change Scope: Removing the entire isolate-v8-runner package without clear migration path could break existing JavaScript runtime functionality
  • Dual Workflow Complexity: Having both actor and actor2 workflows increases cognitive load and maintenance burden

🐛 Potential Bugs & Issues

Critical Issues

  1. Runner Allocation Logic Gap (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:255):

    // Missing remaining_slots > 0 check for multi-type builds
    if let BuildAllocationType::Multi = input.build_allocation_type {
        loop {
            // Should verify remaining_slots > 0 before allocation
  2. SQL Parameter Mismatch (packages/edge/infra/client/manager/src/runner/mod.rs:510):

    SET delete_ts = ?3  -- Should be ?2, missing bind parameter
    WHERE
  3. Race Condition in Runner Socket (packages/edge/infra/client/manager/src/ctx.rs:877):
    Runner socket connection could still occur between the connection check lines.

  4. Build Type Validation Error (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:1021):

    // build.allocation_type accessed before checking if build exists
    let build = unwrap_with!(build_res.builds.first(), BUILD_NOT_FOUND);
    // Later used without null check
    destroy::clear_ports_and_resources(..., build.allocation_type, ...)

Security Concerns

  1. Hardcoded UUIDs (packages/edge/infra/client/echo/src/main.rs:68):

    let payload = json!({
        "init": {
            "runner_id": Uuid::nil(), // Should be unique UUID
        },
    });
  2. Missing Input Validation (packages/edge/infra/client/echo/src/main.rs:107):
    No error handling for missing actor_id or generation fields in packets.

🚀 Performance Considerations

Improvements:

  • FDB Allocation Strategy: Using SNAPSHOT reads for allocation prevents conflicts while maintaining consistency
  • Bin-packing Algorithm: Reverse ordering in allocation queries optimizes resource utilization
  • Connection Pooling: WebSocket connection management looks efficient

Potential Issues:

  • Tokio "full" Feature: packages/edge/infra/client/echo/Cargo.toml enables all Tokio features unnecessarily
  • Allocation Loop Performance: The multi-slot allocation loop could benefit from early termination checks
  • Ping Thread Cleanup: Ping threads continue running even after main WebSocket connections close

🔒 Security Assessment

Positive:

  • Process Isolation: Container runner maintains proper process separation
  • Signal Handling: Appropriate use of Unix signals for process management
  • Resource Bounds: Memory and CPU limits properly enforced

Concerns:

  • Error Exposure: WebSocket error messages might leak internal system information
  • Resource Exhaustion: Missing rate limiting could allow unconstrained logging (removed throttle.rs)

🧪 Test Coverage

Missing Tests:

  • Runner Lifecycle: No comprehensive tests for runner creation, allocation, and cleanup
  • Error Scenarios: Limited testing of failure modes and edge cases
  • Protocol Validation: Missing tests for malformed protocol messages
  • Multi-slot Allocation: No tests for the new slot-based allocation system

Existing Tests:

  • Basic client state tests exist but assume generation 0
  • Integration tests are minimal given the scope of changes

📝 Code Quality Issues

Style & Maintainability

  1. Inconsistent Naming: actor2 should be actor_v2 for clarity
  2. TODO Overuse: Multiple unresolved TODOs in critical paths
  3. Deprecated Fields: Multiple deprecated protocol fields still marked as required
  4. Comment Accuracy: TotalSlotsKey comment says "MiB" but measures slots

Error Handling

  1. Incomplete Error Paths: Several unwrap() calls that should handle errors gracefully
  2. Missing Context: Error messages lack sufficient debugging information
  3. Inconsistent Returns: Some functions return Ok(()) where errors should be propagated

🔧 Recommendations

Immediate Fixes Required

  1. Fix SQL Parameter Binding: Correct the missing parameter in runner deletion query
  2. Add Remaining Slots Check: Validate slot availability before allocation attempts
  3. Improve Error Handling: Replace unwrap() calls with proper error handling
  4. Fix UUID Generation: Use Uuid::new_v4() instead of Uuid::nil() for runner IDs

Architecture Improvements

  1. Migration Strategy: Provide clear documentation for migrating from isolate-v8-runner
  2. Test Coverage: Add comprehensive integration tests for the new runner system
  3. Monitoring: Implement metrics for runner allocation success/failure rates
  4. Documentation: Update API documentation to reflect the new allocation types

Performance Optimizations

  1. Resource Cleanup: Implement proper cleanup for ping threads and WebSocket connections
  2. Feature Minimization: Use specific Tokio features instead of "full"
  3. Allocation Efficiency: Add early termination conditions in allocation loops

✅ Overall Assessment

This PR represents a significant and necessary architectural improvement that consolidates runner management and enables multi-slot allocation. While the core design is sound, several critical bugs and missing error handling need to be addressed before merging.

Verdict: Approve with required changes - The architectural direction is correct, but the implementation needs refinement in error handling, test coverage, and bug fixes identified above.

Priority Issues to Address:

  1. SQL parameter binding fix (High)
  2. Remaining slots validation (High)
  3. Error handling improvements (High)
  4. Test coverage expansion (Medium)
  5. Documentation updates (Medium)

Copy link

claude bot commented Jul 18, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add runners to pb protocol

I've completed a comprehensive review of this major architectural change. This PR introduces significant improvements to the Rivet platform by adding runner support to the pegboard protocol while removing the isolate-v8-runner implementation.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze overall architecture and design decisions
  • Review actor2 workflow implementation
  • Examine runner management and protocol changes
  • Check security implications
  • Assess error handling and edge cases
  • Evaluate performance considerations
  • Review test coverage
  • Provide structured feedback and recommendations

🏗️ Architecture & Design

Strengths:

  • Unified Runner Protocol: The consolidation from separate isolate and container runners to a generic runner protocol is a solid architectural improvement
  • Multi-slot Support: The new slot-based allocation system (BuildAllocationType::Multi) enables better resource utilization
  • Actor2 Workflow: Clean separation between actor and actor2 workflows allows for incremental migration
  • Database Design: The new runner tracking in FDB with remaining/total slots provides efficient allocation lookups

Concerns:

  • Breaking Change Scope: Removing the entire isolate-v8-runner package without clear migration path could break existing JavaScript runtime functionality
  • Dual Workflow Complexity: Having both actor and actor2 workflows increases cognitive load and maintenance burden

🐛 Potential Bugs & Issues

Critical Issues

  1. Runner Allocation Logic Gap (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:255):

    // Missing remaining_slots > 0 check for multi-type builds
    if let BuildAllocationType::Multi = input.build_allocation_type {
        loop {
            // Should verify remaining_slots > 0 before allocation
  2. SQL Parameter Mismatch (packages/edge/infra/client/manager/src/runner/mod.rs:510):

    SET delete_ts = ?3  -- Should be ?2, missing bind parameter
    WHERE
  3. Race Condition in Runner Socket (packages/edge/infra/client/manager/src/ctx.rs:877):
    Runner socket connection could still occur between the connection check lines.

  4. Build Type Validation Error (packages/edge/services/pegboard/src/workflows/actor2/runtime.rs:1021):

    // build.allocation_type accessed before checking if build exists
    let build = unwrap_with!(build_res.builds.first(), BUILD_NOT_FOUND);
    // Later used without null check
    destroy::clear_ports_and_resources(..., build.allocation_type, ...)

Security Concerns

  1. Hardcoded UUIDs (packages/edge/infra/client/echo/src/main.rs:68):

    let payload = json!({
        "init": {
            "runner_id": Uuid::nil(), // Should be unique UUID
        },
    });
  2. Missing Input Validation (packages/edge/infra/client/echo/src/main.rs:107):
    No error handling for missing actor_id or generation fields in packets.

🚀 Performance Considerations

Improvements:

  • FDB Allocation Strategy: Using SNAPSHOT reads for allocation prevents conflicts while maintaining consistency
  • Bin-packing Algorithm: Reverse ordering in allocation queries optimizes resource utilization
  • Connection Pooling: WebSocket connection management looks efficient

Potential Issues:

  • Tokio "full" Feature: packages/edge/infra/client/echo/Cargo.toml enables all Tokio features unnecessarily
  • Allocation Loop Performance: The multi-slot allocation loop could benefit from early termination checks
  • Ping Thread Cleanup: Ping threads continue running even after main WebSocket connections close

🔒 Security Assessment

Positive:

  • Process Isolation: Container runner maintains proper process separation
  • Signal Handling: Appropriate use of Unix signals for process management
  • Resource Bounds: Memory and CPU limits properly enforced

Concerns:

  • Error Exposure: WebSocket error messages might leak internal system information
  • Resource Exhaustion: Missing rate limiting could allow unconstrained logging (removed throttle.rs)

🧪 Test Coverage

Missing Tests:

  • Runner Lifecycle: No comprehensive tests for runner creation, allocation, and cleanup
  • Error Scenarios: Limited testing of failure modes and edge cases
  • Protocol Validation: Missing tests for malformed protocol messages
  • Multi-slot Allocation: No tests for the new slot-based allocation system

Existing Tests:

  • Basic client state tests exist but assume generation 0
  • Integration tests are minimal given the scope of changes

📝 Code Quality Issues

Style & Maintainability

  1. Inconsistent Naming: actor2 should be actor_v2 for clarity
  2. TODO Overuse: Multiple unresolved TODOs in critical paths
  3. Deprecated Fields: Multiple deprecated protocol fields still marked as required
  4. Comment Accuracy: TotalSlotsKey comment says "MiB" but measures slots

Error Handling

  1. Incomplete Error Paths: Several unwrap() calls that should handle errors gracefully
  2. Missing Context: Error messages lack sufficient debugging information
  3. Inconsistent Returns: Some functions return Ok(()) where errors should be propagated

🔧 Recommendations

Immediate Fixes Required

  1. Fix SQL Parameter Binding: Correct the missing parameter in runner deletion query
  2. Add Remaining Slots Check: Validate slot availability before allocation attempts
  3. Improve Error Handling: Replace unwrap() calls with proper error handling
  4. Fix UUID Generation: Use Uuid::new_v4() instead of Uuid::nil() for runner IDs

Architecture Improvements

  1. Migration Strategy: Provide clear documentation for migrating from isolate-v8-runner
  2. Test Coverage: Add comprehensive integration tests for the new runner system
  3. Monitoring: Implement metrics for runner allocation success/failure rates
  4. Documentation: Update API documentation to reflect the new allocation types

Performance Optimizations

  1. Resource Cleanup: Implement proper cleanup for ping threads and WebSocket connections
  2. Feature Minimization: Use specific Tokio features instead of "full"
  3. Allocation Efficiency: Add early termination conditions in allocation loops

✅ Overall Assessment

This PR represents a significant and necessary architectural improvement that consolidates runner management and enables multi-slot allocation. While the core design is sound, several critical bugs and missing error handling need to be addressed before merging.

Verdict: Approve with required changes - The architectural direction is correct, but the implementation needs refinement in error handling, test coverage, and bug fixes identified above.

Priority Issues to Address:

  1. SQL parameter binding fix (High)
  2. Remaining slots validation (High)
  3. Error handling improvements (High)
  4. Test coverage expansion (Medium)
  5. Documentation updates (Medium)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants