Skip to content

[RFC/Refactor] Architectural Rollback: Re-adopt Distributed MCP over NATS & Official NVIDIA OpenShell (Deprecate Local WASM Engine) #16

@dk-uppi-aks

Description

@dk-uppi-aks

📌 Context & Rationale for Shifting Architecture

Currently, the CoReason Rust backend has drifted from its original distributed architecture. The engine crate has implemented its own custom "OpenShell" WebAssembly sandboxing layer using extism.

This shift violates our core design principles and introduces several severe issues:

  1. Violation of Zero Waste & OSS Preference (AGENTS.md Rule 7): We are reinventing sandboxing, memory capping, and fuel metering inside capability_allocator.rs. Instead of building custom proprietary infrastructure, we should be leveraging established OSS alternatives (specifically, the official NVIDIA OpenShell gateway).
  2. Engine Bloat & Slow Compilations: Importing heavy WebAssembly execution runtimes into the engine crate forces massive compiler bloat, drastically increasing our cargo build times (currently 10+ minutes). The execution engine should be purely stateless and lightweight.
  3. Developer Velocity Drop: Compiling capabilities into .wasm binaries using system-level languages slows down our Data Science teams. Reverting to the old architecture allows teams to author tools rapidly using standard Python functions and the simple @mcp.tool() FastMCP decorator.
  4. Loss of Distributed Scalability: The current WASM engine traps execution locally within the API gateway's memory. Reverting to MCP over NATS federation restores a truly horizontally scalable, decentralized worker mesh.

🗑️ Deprecation Tasks (Code to Remove)

To clean up the heavy WASM dependencies and remove the custom sandbox logic, the following must be deleted:

  • crates/engine/Cargo.toml: Remove the extism = "1.21.0" dependency to lighten the compiler footprint.
  • crates/engine/src/capability_allocator.rs: Remove entirely (contains custom Extism Plugin instantiation and WASM attestation checks).
  • crates/engine/src/wasm_dispatcher.rs: Remove entirely.
  • crates/server/src/runtime_routes.rs: Delete the entire match tool.as_str() block inside execute_capability that natively fakes the execution of dummy tools.
  • crates/test-suite/: Remove obsolete tests (crates/test-suite/tests/e2e_swarm/openshell_wasm_cpu_sandboxing.rs and Use Case 6 in real_world_integration_tests.rs).

🏗️ Implementation Tasks (Code to Add)

To restore the NATS Federation and MCP routing, add the following Rust infrastructure:

  • Refactor Gateway Routing (crates/server/src/runtime_routes.rs): Convert the incoming HTTP CapabilityExecutePayload into a standard JSON-RPC request formatted as {"method": "tools/call"} and publish it asynchronously to the NATS broker using async-nats.
  • **Create crates/server/src/mcp_gateway.rs**: Add a native Rust gateway service to handle dynamic NATS subject resolution (mapping incoming Tool URNs to coreason.tool.<urn>.invoke).
  • **Create crates/engine/src/nvidia_openshell_client.rs**: Implement an asynchronous HTTP client to forward heavily sandboxed execution payloads completely outside the Docker mesh to the official NVIDIA OpenShell daemon natively running on the host machine.
  • **Create crates/server/src/openshell_translator.rs**: Implement a bridge service to securely translate communications between the isolated NVIDIA OpenShell host daemon and the internal NATS message broker.

🐳 Infrastructure Updates (Docker Compose)

The multi-container E2E test mesh must be updated to support the new distributed topology. Update crates/test-suite/tests/e2e_swarm/docker-compose.e2e.yaml with the following changes:

  • Add the OpenShell Daemon Service:
  openshell-manager:
    image: nvcr.io/nvidia/openshell:latest
    ports:
      - "8080:8080"
    environment:
      - OPENSHELL_ENV=local
    privileged: true # Required to instantiate secure kernel namespaces
  • Inject OpenShell URL into Gateway/Runtime:
    Ensure both the gateway and runtime services have the target URL injected into their environments so they can discover the sandbox:
    environment:
      - OPENSHELL_MANAGER_URL=http://openshell-manager:8080

🧪 Verification & Integration Tests

To ensure the new multi-container architecture spins up correctly without getting trapped in a crash loop, we need a dedicated liveness test for the network topology.

  • Create crates/test-suite/tests/e2e_swarm/docker_orchestration_liveness.rs:
use reqwest::Client;
use std::time::Duration;

#[tokio::test]
async fn test_docker_mesh_liveness() {
    let client = Client::builder().timeout(Duration::from_secs(5)).build().unwrap();

    // 1. Verify Rust API Gateway Liveness
    let gateway_url = std::env::var("COREASON_GATEWAY_URL").unwrap_or_else(|_| "http://localhost:8080".to_string());
    let res = client.get(&format!("{}/health", gateway_url)).send().await;
    assert!(res.is_ok() && res.unwrap().status().is_success(), "Rust API Gateway unreachable.");

    // 2. Verify NATS Broker Connectivity
    let nats_url = std::env::var("NATS_URL").unwrap_or_else(|_| "127.0.0.1:4222".to_string());
    let nats_client = tokio::time::timeout(Duration::from_secs(3), async_nats::connect(&nats_url)).await;
    assert!(nats_client.is_ok() && nats_client.unwrap().is_ok(), "NATS Broker unreachable.");

    // 3. Verify Python Sovereign LLM Proxy (Sidecar)
    let sidecar_url = std::env::var("PYTHON_SIDECAR_URL").unwrap_or_else(|_| "http://localhost:8000".to_string());
    let res = client.get(&format!("{}/api/v1/auth/status", sidecar_url)).send().await;
    assert!(res.is_ok() && res.unwrap().status().is_success(), "Python Sidecar unreachable.");

    // 4. Verify NVIDIA OpenShell Host Daemon
    let openshell_url = std::env::var("OPENSHELL_MANAGER_URL").unwrap_or_else(|_| "http://localhost:8080".to_string());
    let res = client.get(&format!("{}/health", openshell_url)).send().await;
    assert!(res.is_ok() && res.unwrap().status().is_success(), "NVIDIA OpenShell Daemon unreachable.");
}

🚀 Standard Developer Launch Procedure

Once this RFC is merged, the standard operating procedure for launching the local CoReason Swarm will be:

1. Build the network images:

./build_images.sh  # Or .\build_images.ps1 on Windows

2. Launch the E2E Orchestration Mesh:

docker-compose -f crates/test-suite/tests/e2e_swarm/docker-compose.e2e.yaml up -d

3. Verify the Mesh Liveness:

cargo test -p test-suite test_docker_mesh_liveness -- --nocapture

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions