This repo has been consolidated into the
OperationsPAI/aegismonorepo.
- Code now lives under
AegisLab/in the monorepo (original directory names preserved).- Pre-migration git history remains viewable here.
- For all new PRs and issues, go to OperationsPAI/aegis.
- This repository is archived as of 2026-04-19.
RCABench is a comprehensive benchmarking platform designed for evaluating root cause analysis (RCA) algorithms in microservices environments. It provides automated fault injection, algorithm execution, and evaluation capabilities for distributed systems research.
RCABench enables researchers and practitioners to:
- Inject faults into microservices using chaos engineering principles
- Execute RCA algorithms on collected observability data
- Evaluate and compare different root cause analysis approaches
- Benchmark performance across various microservice architectures
- Manage datasets of fault scenarios and observability traces
The current backend architecture is a single repository with a single go.mod, but it supports both:
- local monolith-style development modes for speed
- split-service runtime modes for service-boundary validation
The main service boundaries are:
api-gateway: external HTTP/OpenAPI entrypointiam-service: auth, user, RBAC, team, api-keyresource-service: project, label, container, dataset, evaluation metadata/queryorchestrator-service: submit, task, trace, retry, dead-letter, workflow control-planeruntime-worker-service: Redis async consumption, K8s/BuildKit/Helm/Chaos runtime executionsystem-service: config, audit, monitor, health, metrics
Key implementation rules:
- External APIs are HTTP/OpenAPI.
- Internal synchronous calls are gRPC via
src/internalclient/*. - Long-running execution stays asynchronous on Redis; it is not converted into synchronous execution RPC.
- Module-owned DB access lives in
src/module/*/repository.go. - Infra connectivity and low-level operations live in
src/infra/*.
The backend now has two categories of startup modes:
- local integrated modes:
producer,consumer,both - dedicated service modes:
api-gateway,iam-service,resource-service,orchestrator-service,runtime-worker-service,system-service
both is not the six-service topology.
It starts:
- the local HTTP stack
- the local worker/consumer stack
It is the fastest option for local end-to-end debugging such as:
- submit -> queue -> worker -> state update
- task/trace/log flow
- API + async worker integration
| Mode / Service | What Starts | Local Owner Implementations Injected | Internal Clients Required | Best For |
|---|---|---|---|---|
producer |
HTTP server only | Yes, local HTTP-facing modules | No | API, handler/service, Swagger, frontend integration |
consumer |
worker/controller/receiver side only | Yes, local runtime-side owners | Optional depending on config | queue/runtime/worker-only debugging |
both |
HTTP + worker/controller/receiver | Yes, local owners for integrated debugging | Optional depending on config | full local async loop |
api-gateway |
external HTTP gateway | No cross-owner local fallback as main path; service-specific remote wiring is expected | Yes | gateway boundary and remote-first debugging |
iam-service |
IAM gRPC service | Yes, IAM-local owners only | Only if a specific cross-service read path needs it | auth/user/rbac/team/api-key |
resource-service |
Resource gRPC service | Yes, resource-local owners only | Yes for orchestrator-backed queries like some statistics/evaluation views | project/container/dataset/label/evaluation |
orchestrator-service |
Orchestrator gRPC service | Yes, orchestrator-local owners only | Optional runtime/resource dependencies as needed | submit/task/trace/workflow |
runtime-worker-service |
runtime worker + runtime gRPC | Yes, runtime-side execution infrastructure only | Yes, especially orchestrator target | Redis consumer, K8s/build/helm runtime |
system-service |
system gRPC service | Yes, system-local owners only | Yes, especially runtime target | config/audit/monitor/metrics |
- Use
producerfor normal API development. - Use
bothwhen you need the local async loop. - Use the six dedicated services when you need to verify service boundaries, internal gRPC, or remote-first behavior.
- Docker (>= 20.10)
- Kubernetes (>= 1.25) or kind/minikube for local development
- kubectl (compatible with your cluster version)
- Go (>= 1.23) for development
- Python (>= 3.10) for SDK usage
- CPU: 4+ cores recommended
- Memory: 8GB+ RAM
- Storage: 20GB+ available disk space
- Network: Stable internet connection for image pulls
# Clone the repository
git clone https://github.com/OperationsPAI/AegisLab.git
cd AegisLab
# Start core dependencies
docker compose up -d redis mysql etcd jaeger buildkitd loki prometheus grafanacd src && go run . producer -conf ./config.dev.toml -port 8082
# HTTP: http://localhost:8082
# Health: http://localhost:8082/system/health
# Docs: http://localhost:8082/docs/doc.jsoncd src && go run . both -conf ./config.dev.toml -port 8082Use this mode when you need:
- HTTP + worker in one local process set
- submit -> queue -> consumer -> query loop
- task / trace / logs integration
# terminal 1
cd src && go run ./cmd/iam-service -conf ./config.dev.toml
# terminal 2
cd src && go run ./cmd/orchestrator-service -conf ./config.dev.toml
# terminal 3
cd src && go run ./cmd/resource-service -conf ./config.dev.toml
# terminal 4
cd src && go run ./cmd/runtime-worker-service -conf ./config.dev.toml
# terminal 5
cd src && go run ./cmd/system-service -conf ./config.dev.toml
# terminal 6
cd src && go run ./cmd/api-gateway -conf ./config.dev.toml -port 8082# Check prerequisites
just check-prerequisites
# Deploy to Kubernetes cluster
just runIf you use scripts/start.sh directly, the external install URLs can now be overridden with env vars such as:
CERT_MANAGER_MANIFEST_URLCHAOS_MESH_REPO_URLCLICKSTACK_REPO_URLOPEN_TELEMETRY_REPO_URLOTEL_DEMO_REPO_URLJUICEFS_REPO_URLTEST_HTTP_PROXYTEST_HTTPS_PROXYTEST_NO_PROXY
- Report Index: Consolidated backend refactor, runtime, governance, SDK/auth, and validation notes
- Refactor TODO: Source-of-truth task list and final acceptance checklist
- API Key Auth TODO: Key ID / Key Secret auth execution checklist and signing contract
- Package Rename TODO: Go package naming cleanup record for
interface/module/infra/app - Frontend Redesign: Frontend redesign plan and IA notes
- Frontend UI Guidelines: Frontend visual/system guidelines
Copy and modify the configuration file:
cp src/config.dev.toml src/config.tomlKey configuration sections:
[database]
mysql_host = "localhost"
mysql_port = "3306"
mysql_user = "root"
mysql_password = "yourpassword"
mysql_db = "rcabench"
[redis]
host = "localhost:6379"
[k8s]
namespace = "default"
[clients.iam]
target = "127.0.0.1:9091"
[clients.resource]
target = "127.0.0.1:9093"
[clients.orchestrator]
target = "127.0.0.1:9092"
[clients.runtime]
target = "127.0.0.1:9094"
[clients.system]
target = "127.0.0.1:9095"
[iam.grpc]
addr = ":9091"
[resource.grpc]
addr = ":9093"
[orchestrator.grpc]
addr = ":9092"
[runtime_worker.grpc]
addr = ":9094"
[system.grpc]
addr = ":9095"
[injection]
benchmark = ["workload-name"]
target_label_key = "app"Important config rules:
producerandbothcan use local owner implementations for fast debugging.- dedicated services should use the appropriate
clients.*.targetvalues when a remote dependency is required. api-gatewayvalidatesclients.iam.target,clients.resource.target,clients.orchestrator.target, andclients.system.target.runtime-worker-servicevalidatesclients.orchestrator.target.system-servicevalidatesclients.runtime.target.resource-servicevalidatesclients.orchestrator.targetfor remote-backed query paths.
For production deployment, configure persistent volumes:
# Create persistent volumes (adjust paths as needed)
kubectl apply -f scripts/k8s/pv.yamlfrom rcabench import RCABenchSDK
# Initialize the SDK
sdk = RCABenchSDK("http://localhost:8082")
# List available algorithms
algorithms = sdk.algorithm.list()
print(f"Available algorithms: {algorithms}")
# Submit a fault injection
injection_request = [{
"duration": 300, # 5 minutes
"faultType": 5, # CPU stress
"injectNamespace": "default",
"injectPod": "my-service",
"spec": {"CPULoad": 80, "CPUWorker": 2},
"benchmark": "my-workload"
}]
response = sdk.injection.execute(injection_request)
# Execute an RCA algorithm
algorithm_request = [{
"benchmark": "my-workload",
"algorithm": "rca-algorithm-name",
"dataset": "fault-scenario-dataset"
}]
result = sdk.algorithm.execute(algorithm_request)# Get algorithm list
curl -X GET http://localhost:8082/api/v1/algorithms
# Submit fault injection
curl -X POST http://localhost:8082/api/v1/injection \
-H "Content-Type: application/json" \
-d '[{
"duration": 300,
"faultType": 5,
"injectNamespace": "default",
"injectPod": "my-service",
"spec": {"CPULoad": 80}
}]'RCABench supports various chaos engineering patterns:
- Network Chaos: Latency, packet loss, bandwidth limitation
- Pod Chaos: Pod failure, pod kill
- Stress Chaos: CPU stress, memory stress
- Time Chaos: Clock skew
- DNS Chaos: DNS resolution failures
- HTTP Chaos: HTTP request/response manipulation
- JVM Chaos: JVM-specific faults (GC pressure, etc.)
The platform provides comprehensive evaluation metrics:
- Accuracy: Precision, recall, F1-score for root cause identification
- Latency: Time to detection and diagnosis
- Scalability: Performance across different system sizes
- Robustness: Performance under various fault scenarios
RCABench integrates with:
- Jaeger: Distributed tracing
- Prometheus: Metrics collection
- Grafana: Visualization dashboards
- ClickHouse: Analytics and data warehouse
Access monitoring:
- Jaeger UI: http://localhost:16686
- API Metrics: http://localhost:8082/metrics
Choose the mode first:
- API-only debugging ->
producer - local async loop debugging ->
both - service-boundary / gRPC debugging -> six dedicated services
Start here:
src/router/*src/module/*/handler.gosrc/module/*/service.gosrc/module/*/repository.go
If the problem only appears in split-service mode, then also check:
src/app/gateway/*src/internalclient/*
Start here:
src/internalclient/*src/interface/grpc/*src/app/{gateway,iam,resource,orchestrator,runtime,system}/*
Start here:
src/service/consumer/*src/interface/worker/*src/interface/controller/*src/infra/k8s/*src/infra/buildkit/*src/infra/helm/*src/infra/chaos/*
Check:
src/module/auth/*src/module/user/*src/module/rbac/*src/module/team/*
Split-service path:
src/app/gateway/{auth,user,rbac,team}_services.gosrc/internalclient/iamclient/*src/interface/grpc/iam/*
Check:
src/module/project/*src/module/label/*src/module/container/*src/module/dataset/*
Split-service path:
src/app/gateway/resource_services.gosrc/internalclient/resourceclient/*src/interface/grpc/resource/*
Check:
src/module/injection/*src/module/execution/*src/module/task/*src/module/trace/*src/module/group/*src/module/notification/*
Split-service path:
src/app/gateway/orchestrator_services.gosrc/internalclient/orchestratorclient/*src/interface/grpc/orchestrator/*src/service/consumer/*
Check:
src/module/system/*src/module/systemmetric/*
Split-service path:
src/app/gateway/system_services.gosrc/internalclient/systemclient/*src/internalclient/runtimeclient/*src/interface/grpc/system/*src/interface/grpc/runtime/*
Check:
src/service/consumer/*src/interface/worker/*src/interface/controller/*src/infra/k8s/*src/infra/buildkit/*src/infra/helm/*src/infra/chaos/*src/infra/redis/*
# Build the main application
cd src
go build -o rcabench main.go
# Regenerate OpenAPI / Swagger artifacts
cd ..
just swagger-init 1.2.3
# Generate SDK packages
just generate-portal 1.2.3
just generate-admin 1.2.3
just generate-python-sdk 1.2.3
# Run tests
cd src
go test ./...cd sdk/python
# Install in development mode
pip install -e .
# Run tests
python -m pytest tests/just --list # Show all available commands
just run # Deploy to the configured Kubernetes target
just local-deploy # Boot local infra dependencies with Docker Compose
just local-debug # Start local producer+consumer debug process
just swagger-init 1.2.3 # Regenerate OpenAPI / Swagger artifacts
just generate-portal 1.2.3 # Generate portal TypeScript SDK
just generate-admin 1.2.3 # Generate admin TypeScript SDK
just generate-python-sdk 1.2.3 # Generate Python SDK
just release-portal 1.2.3 # Generate release-ready portal TypeScript SDK
just release-admin 1.2.3 # Generate release-ready admin TypeScript SDK
just release-python-sdk 1.2.3 # Generate release-ready Python SDK
just test-regression # Run the Python SDK regression workflow-
Database Connection Failed
# Check database status kubectl get pods | grep mysql # Re-run the local debug stack after fixing config/env just local-debug
-
Pod Scheduling Issues
# Check node resources kubectl describe nodes # Check pod status kubectl describe pod <pod-name>
-
Permission Errors
# Check RBAC permissions kubectl auth can-i create pods --namespace=default -
A Request Works In
producerBut Fails In Split-Service ModeCheck in this order:
- are the dedicated services actually running?
- are the required
clients.*.targetvalues configured? - is the request going through
src/internalclient/*as expected? - is the destination gRPC service registered and listening?
-
Submit Works But Task State Does Not Move
Check in this order:
- Redis queue health
src/service/consumer/*- runtime infra (
src/infra/k8s/*,src/infra/buildkit/*,src/infra/helm/*) - orchestrator owner write-back path
cd src && go test ./...
cd src && go test ./app -run 'TestProducerOptionsValidate|TestProducerOptionsStartStopSmoke|TestProducerOptionsHTTPIntegrationSmoke'
cd src && go test ./app -run 'TestConsumerOptions|TestBothOptions'
cd src && go test ./router ./docs ./interface/httpReal-cluster K8s validation:
cd src && RUN_K8S_INTEGRATION=1 go test ./infra/k8s -run TestK8sGatewayJobLifecycleIntegration- Review the consolidated notes in
docs/report-index.md - Run
just --listto inspect the supported local workflows - Verify configuration in
src/config.dev.toml
For optimal performance:
- Resource Allocation: Ensure adequate CPU/memory for workloads
- Storage: Use SSD storage for databases
- Network: Stable network connectivity for distributed components
- Scaling: Horizontal scaling supported via Kubernetes deployments
- Default credentials should be changed in production
- API endpoints should be secured with proper authentication
- Network policies recommended for production deployments
- Regular security updates for container images
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions! Please see our contributing guidelines for details on:
- Code style and standards
- Pull request process
- Issue reporting
- Documentation improvements
For questions, issues, or contributions, please use the project's issue tracker or discussion forums.