📌 Context & Rationale for Shifting Architecture
Currently, the CoReason Rust backend has drifted from its original distributed architecture. The engine crate has implemented its own custom "OpenShell" WebAssembly sandboxing layer using extism.
This shift violates our core design principles and introduces several severe issues:
- Violation of Zero Waste & OSS Preference (
AGENTS.md Rule 7): We are reinventing sandboxing, memory capping, and fuel metering inside capability_allocator.rs. Instead of building custom proprietary infrastructure, we should be leveraging established OSS alternatives (specifically, the official NVIDIA OpenShell gateway).
- Engine Bloat & Slow Compilations: Importing heavy WebAssembly execution runtimes into the
engine crate forces massive compiler bloat, drastically increasing our cargo build times (currently 10+ minutes). The execution engine should be purely stateless and lightweight.
- Developer Velocity Drop: Compiling capabilities into
.wasm binaries using system-level languages slows down our Data Science teams. Reverting to the old architecture allows teams to author tools rapidly using standard Python functions and the simple @mcp.tool() FastMCP decorator.
- Loss of Distributed Scalability: The current WASM engine traps execution locally within the API gateway's memory. Reverting to MCP over NATS federation restores a truly horizontally scalable, decentralized worker mesh.
🗑️ Deprecation Tasks (Code to Remove)
To clean up the heavy WASM dependencies and remove the custom sandbox logic, the following must be deleted:
🏗️ Implementation Tasks (Code to Add)
To restore the NATS Federation and MCP routing, add the following Rust infrastructure:
🐳 Infrastructure Updates (Docker Compose)
The multi-container E2E test mesh must be updated to support the new distributed topology. Update crates/test-suite/tests/e2e_swarm/docker-compose.e2e.yaml with the following changes:
openshell-manager:
image: nvcr.io/nvidia/openshell:latest
ports:
- "8080:8080"
environment:
- OPENSHELL_ENV=local
privileged: true # Required to instantiate secure kernel namespaces
environment:
- OPENSHELL_MANAGER_URL=http://openshell-manager:8080
🧪 Verification & Integration Tests
To ensure the new multi-container architecture spins up correctly without getting trapped in a crash loop, we need a dedicated liveness test for the network topology.
use reqwest::Client;
use std::time::Duration;
#[tokio::test]
async fn test_docker_mesh_liveness() {
let client = Client::builder().timeout(Duration::from_secs(5)).build().unwrap();
// 1. Verify Rust API Gateway Liveness
let gateway_url = std::env::var("COREASON_GATEWAY_URL").unwrap_or_else(|_| "http://localhost:8080".to_string());
let res = client.get(&format!("{}/health", gateway_url)).send().await;
assert!(res.is_ok() && res.unwrap().status().is_success(), "Rust API Gateway unreachable.");
// 2. Verify NATS Broker Connectivity
let nats_url = std::env::var("NATS_URL").unwrap_or_else(|_| "127.0.0.1:4222".to_string());
let nats_client = tokio::time::timeout(Duration::from_secs(3), async_nats::connect(&nats_url)).await;
assert!(nats_client.is_ok() && nats_client.unwrap().is_ok(), "NATS Broker unreachable.");
// 3. Verify Python Sovereign LLM Proxy (Sidecar)
let sidecar_url = std::env::var("PYTHON_SIDECAR_URL").unwrap_or_else(|_| "http://localhost:8000".to_string());
let res = client.get(&format!("{}/api/v1/auth/status", sidecar_url)).send().await;
assert!(res.is_ok() && res.unwrap().status().is_success(), "Python Sidecar unreachable.");
// 4. Verify NVIDIA OpenShell Host Daemon
let openshell_url = std::env::var("OPENSHELL_MANAGER_URL").unwrap_or_else(|_| "http://localhost:8080".to_string());
let res = client.get(&format!("{}/health", openshell_url)).send().await;
assert!(res.is_ok() && res.unwrap().status().is_success(), "NVIDIA OpenShell Daemon unreachable.");
}
🚀 Standard Developer Launch Procedure
Once this RFC is merged, the standard operating procedure for launching the local CoReason Swarm will be:
1. Build the network images:
./build_images.sh # Or .\build_images.ps1 on Windows
2. Launch the E2E Orchestration Mesh:
docker-compose -f crates/test-suite/tests/e2e_swarm/docker-compose.e2e.yaml up -d
3. Verify the Mesh Liveness:
cargo test -p test-suite test_docker_mesh_liveness -- --nocapture
📌 Context & Rationale for Shifting Architecture
Currently, the CoReason Rust backend has drifted from its original distributed architecture. The
enginecrate has implemented its own custom "OpenShell" WebAssembly sandboxing layer usingextism.This shift violates our core design principles and introduces several severe issues:
AGENTS.mdRule 7): We are reinventing sandboxing, memory capping, and fuel metering insidecapability_allocator.rs. Instead of building custom proprietary infrastructure, we should be leveraging established OSS alternatives (specifically, the official NVIDIA OpenShell gateway).enginecrate forces massive compiler bloat, drastically increasing ourcargo buildtimes (currently 10+ minutes). The execution engine should be purely stateless and lightweight..wasmbinaries using system-level languages slows down our Data Science teams. Reverting to the old architecture allows teams to author tools rapidly using standard Python functions and the simple@mcp.tool()FastMCP decorator.🗑️ Deprecation Tasks (Code to Remove)
To clean up the heavy WASM dependencies and remove the custom sandbox logic, the following must be deleted:
crates/engine/Cargo.toml: Remove theextism = "1.21.0"dependency to lighten the compiler footprint.crates/engine/src/capability_allocator.rs: Remove entirely (contains custom Extism Plugin instantiation and WASM attestation checks).crates/engine/src/wasm_dispatcher.rs: Remove entirely.crates/server/src/runtime_routes.rs: Delete the entirematch tool.as_str()block insideexecute_capabilitythat natively fakes the execution of dummy tools.crates/test-suite/: Remove obsolete tests (crates/test-suite/tests/e2e_swarm/openshell_wasm_cpu_sandboxing.rsand Use Case 6 inreal_world_integration_tests.rs).🏗️ Implementation Tasks (Code to Add)
To restore the NATS Federation and MCP routing, add the following Rust infrastructure:
crates/server/src/runtime_routes.rs): Convert the incoming HTTPCapabilityExecutePayloadinto a standard JSON-RPC request formatted as{"method": "tools/call"}and publish it asynchronously to the NATS broker usingasync-nats.crates/server/src/mcp_gateway.rs**: Add a native Rust gateway service to handle dynamic NATS subject resolution (mapping incoming Tool URNs tocoreason.tool.<urn>.invoke).crates/engine/src/nvidia_openshell_client.rs**: Implement an asynchronous HTTP client to forward heavily sandboxed execution payloads completely outside the Docker mesh to the official NVIDIA OpenShell daemon natively running on the host machine.crates/server/src/openshell_translator.rs**: Implement a bridge service to securely translate communications between the isolated NVIDIA OpenShell host daemon and the internal NATS message broker.🐳 Infrastructure Updates (Docker Compose)
The multi-container E2E test mesh must be updated to support the new distributed topology. Update
crates/test-suite/tests/e2e_swarm/docker-compose.e2e.yamlwith the following changes:Ensure both the
gatewayandruntimeservices have the target URL injected into their environments so they can discover the sandbox:🧪 Verification & Integration Tests
To ensure the new multi-container architecture spins up correctly without getting trapped in a crash loop, we need a dedicated liveness test for the network topology.
crates/test-suite/tests/e2e_swarm/docker_orchestration_liveness.rs:🚀 Standard Developer Launch Procedure
Once this RFC is merged, the standard operating procedure for launching the local CoReason Swarm will be:
1. Build the network images:
./build_images.sh # Or .\build_images.ps1 on Windows2. Launch the E2E Orchestration Mesh:
3. Verify the Mesh Liveness:
cargo test -p test-suite test_docker_mesh_liveness -- --nocapture