R-Pingmesh is still under heavy development. Please do not use it in production.
The service-aware RoCE network monitoring and diagnostic system based on end-to-end active probing.
R-Pingmesh is a production-ready monitoring system designed for RDMA over Converged Ethernet (RoCE) networks. Built on cutting-edge research from SIGCOMM 2024, it delivers unprecedented visibility into RoCE network performance, enabling rapid detection and precise localization of network problems that can severely impact distributed services.
Modern data centers rely heavily on RoCE networks for high-performance computing workloads like distributed machine learning and storage systems. As these networks scale to tens of thousands of RNICs, traditional monitoring approaches fall short:
- Single-point failures can devastate entire training clusters
- Performance bottlenecks masquerade as network issues
- Troubleshooting becomes time-consuming and error-prone
- Service impact assessment remains largely guesswork
R-Pingmesh solves these challenges with active probing, precise measurements, and service-aware monitoring.
- Accurate RTT measurement using commodity RDMA NICs
- End-host processing delay separation from network latency
- Sub-microsecond precision with CQE timestamps
%%{init: {'theme':'base', 'themeVariables': {'primaryTextColor':'#333333', 'fontSize':'14px'}}}%%
sequenceDiagram
participant P as Prober
participant PN as Prober RNIC
participant N as RoCE Network
participant RN as Responder RNIC
participant R as Responder
Note over P,R: RTT Measurement Process
P->>P: T1: Application post send
P->>PN: Post probe packet
PN->>PN: T2: CQE send completion (HW timestamp)
PN->>N: Probe packet transmission
N->>RN: Network delivery
RN->>RN: T3: CQE receive (HW timestamp)
RN->>R: Deliver to application
R->>R: Process probe packet
R->>RN: Post ACK packet
RN->>RN: T4: CQE ACK send (HW timestamp)
RN->>N: ACK transmission
N->>PN: Network delivery
PN->>PN: T5: CQE ACK receive (HW timestamp)
PN->>P: Completion notification
P->>P: T6: Application poll complete
Note over P,R: Calculations
Note over P: Network RTT = (T5-T2) - (T4-T3)
Note over P: Prober Delay = (T6-T1) - (T5-T2)
Note over R: Responder Delay = T4-T3
- RNIC vs. network failure distinction through ToR-mesh probing
- Real-time anomaly detection with minimal false positives
- Service impact assessment to prioritize critical issues
- Automatic service flow discovery using eBPF tracing
- Path-specific probing following actual service traffic
- 5-tuple aware measurements for ECMP environments
%%{init: {'theme':'base', 'themeVariables': {'primaryTextColor':'#333333', 'fontSize':'14px'}}}%%
flowchart TD
subgraph "Service Discovery Process"
A[Application creates RDMA connection] --> B[eBPF hooks modify_qp syscall]
B --> C{QP State = RTR?}
C -->|Yes| D[Extract 5-tuple:<br/>Src/Dst GID, Src/Dst QPN]
C -->|No| E[Ignore event]
D --> F[Send event to userspace via ring buffer]
F --> G[Agent receives connection event]
G --> H[Query Controller for target RNIC info]
H --> I[Start service-specific probing]
I --> J[Monitor actual service path]
end
subgraph "Monitoring Modes Comparison"
direction LR
K[Cluster Monitoring<br/>β’ Always-on<br/>β’ ToR-mesh coverage<br/>β’ Network health]
L[Service Tracing<br/>β’ Dynamic<br/>β’ Follows real traffic<br/>β’ Service-aware]
end
style A fill:#E8F5E8
style D fill:#FFF3E0
style I fill:#E3F2FD
style J fill:#F3E5F5
R-Pingmesh consists of three core components.
%%{init: {'theme':'base', 'themeVariables': {'primaryTextColor':'#333333', 'lineColor':'#666666', 'fontSize':'16px'}}}%%
flowchart TD
%% Agent Layer
A1["π₯οΈ Agent (Host 1)<br/>ββββββββββββββββ<br/>RDMA Manager<br/>eBPF Service Tracer<br/>Active Probing Engine<br/>Path Tracer<br/>Controller Client<br/>Upload Client"]
A1_HW["βοΈ Hardware Layer<br/>ββββββββββββββββ<br/>RDMA Hardware<br/>UD Queue Pairs<br/>CQE Timestamps"]
A1_KERNEL["π§ Kernel Layer<br/>ββββββββββββββββ<br/>eBPF Programs<br/>modify_qp/destroy_qp<br/>Ring Buffer Events"]
AN["π₯οΈ Agent (Host N)<br/>ββββββββββββββββ<br/>Core Modules<br/>RDMA Hardware<br/>eBPF Programs"]
%% Network Infrastructure
NET["π RoCE Network<br/>ββββββββββββββββ<br/>RoCE Fabric<br/>ToR Switches<br/>Spine Switches<br/>Active Probing Paths"]
%% Controller
C["ποΈ Controller<br/>ββββββββββββββββ<br/>RNIC Registry<br/>Pinglist Generator<br/>gRPC Server<br/>Configuration Manager"]
C_DB["πΎ Controller Storage<br/>ββββββββββββββββ<br/>RNIC Database<br/>GID β RNIC Info<br/>ToR ID β RNIC List"]
%% Analyzer
AZ["π Analyzer<br/>ββββββββββββββββ<br/>Data Ingestion API<br/>Anomaly Detection<br/>Root Cause Analysis<br/>SLA Tracker"]
%% Monitoring Capabilities
MONITORING["π Monitoring Modes<br/>ββββββββββββββββ<br/>β’ Cluster Monitoring<br/> (ToR-mesh, Inter-ToR)<br/>β’ Service Tracing<br/> (eBPF Flow Discovery)<br/>β’ Path Tracing<br/> (Network Topology)<br/>β’ Anomaly Detection<br/> (RNIC vs Network)"]
%% OpenTelemetry Integration
OTLP["π‘ OpenTelemetry (OTLP)<br/>ββββββββββββββββ<br/>RTT Metrics Export"]
%% Vertical Flow
A1 --> A1_HW
A1 --> A1_KERNEL
A1_HW --> NET
A1_KERNEL --> NET
AN --> NET
NET --> C
C --> C_DB
C --> AZ
AZ --> MONITORING
A1 -.-> MONITORING
AN -.-> MONITORING
%% OpenTelemetry Integration
A1 -->|"OTLP Export<br/>RTT Metrics"| OTLP
AN -->|"OTLP Export"| OTLP
OTLP --> MONITORING
%% Communication Labels
A1 -.->|"Active Probing<br/>RTT Measurement"| NET
AN -.->|"Active Probing"| NET
A1 <-.->|"gRPC Registration<br/>Pinglists"| C
AN <-.->|"gRPC"| C
A1 -->|"gRPC Upload<br/>Probe Results"| AZ
AN -->|"Data Upload"| AZ
%% Styling
classDef agentClass fill:#4CAF50,stroke:#2E7D32,stroke-width:3px,color:#fff,font-weight:bold
classDef controllerClass fill:#2196F3,stroke:#1565C0,stroke-width:3px,color:#fff,font-weight:bold
classDef analyzerClass fill:#FF9800,stroke:#E65100,stroke-width:3px,color:#fff,font-weight:bold
classDef networkClass fill:#E3F2FD,stroke:#1976D2,stroke-width:3px,color:#1976D2,font-weight:bold
classDef storageClass fill:#9E9E9E,stroke:#424242,stroke-width:3px,color:#fff,font-weight:bold
classDef monitoringClass fill:#F3E5F5,stroke:#7B1FA2,stroke-width:3px,color:#7B1FA2,font-weight:bold
classDef otlpClass fill:#E8F5E8,stroke:#4CAF50,stroke-width:3px,color:#2E7D32,font-weight:bold
class A1,AN agentClass
class C controllerClass
class AZ analyzerClass
class NET networkClass
class A1_HW,A1_KERNEL,C_DB storageClass
class MONITORING monitoringClass
class OTLP otlpClass
Deployed on every RoCE host, the Agent performs:
- Active probing using UD Queue Pairs
- Service flow monitoring via eBPF programs
- Path tracing for network topology discovery
- Real-time measurements with hardware timestamps
Centralized coordination service providing:
- RNIC registry management
- Pinglist generation (ToR-mesh and Inter-ToR)
- Target resolution for service tracing
- Configuration distribution
Advanced analytics engine delivering:
- Anomaly detection and root cause analysis
- SLA tracking and performance trending
- Service impact assessment
- Alert generation and escalation
-
Agent β Controller (gRPC):
- Agent registers RNICs with Controller on startup
- Agent requests Pinglists for Cluster Monitoring (ToR-mesh, Inter-ToR)
- Agent requests target RNIC information for Service Tracing
-
Agent β Analyzer (gRPC):
- Agent uploads probe results (RTT, delays, timeouts)
- Agent uploads path trace information
- Agent uploads aggregated local statistics
-
Controller β Agent (gRPC responses):
- Controller provides Pinglists and RNIC information based on Agent requests
- Go: with Cgo for RDMA integration
- RDMA Verbs:
libibverbs
Cgo wrapper for low-level RDMA operations - gRPC: for communication with each component
- RQLite: Database for Controller https://rqlite.io/
- OpenTelemetry: for probe metrics instrumentation
- eBPF:
cilium/ebpf
library for service flow monitoring
- Linux kernel 5.8+ with eBPF support
- RDMA-capable network interfaces
- Docker (recommended) or native Go 1.24+ environment
- Root privileges or appropriate capabilities
TBD
Continuous network health assessment across the entire RoCE cluster:
- ToR-mesh probing: Detects faulty RNICs and local issues
- Inter-ToR probing: Monitors switch and link health
- Always-on operation: Independent of running services
- Comprehensive SLA tracking: RTT, packet loss, and processing delays
Dynamic monitoring of active service communications:
- Automatic flow discovery: eBPF-based connection tracking
- Path-specific measurements: Follows actual service traffic
- Service impact correlation: Links network issues to service performance
- Real-time adaptation: Adjusts to changing service patterns
# agent.yaml
controller:
address: "controller.example.com:8080"
analyzer:
address: "analyzer.example.com:8081"
probing:
interval: "1s"
timeout: "5s"
ebpf:
enabled: true
buffer_size: 1024
# controller.yaml
server:
address: ":8080"
database:
type: "sqlite"
path: "/data/controller.db"
pinglist:
tor_mesh_size: 10
inter_tor_coverage: 0.1
R-Pingmesh is designed for production environments with minimal overhead:
- CPU Usage: <1% per RNIC under normal load
- Memory Footprint: ~50MB per Agent instance
- Network Overhead: <0.1% of link capacity
- Measurement Accuracy: Sub-microsecond precision
- Scalability: Tested with 10,000+ RNICs
R-Pingmesh is based on the research paper:
Kefei Liu, Zhuo Jiang, Jiao Zhang, Shixian Guo, Xuan Zhang, Yangyang Bai, Yongbin Dong, Feng Luo, Zhang Zhang, Lei Wang, Xiang Shi, Haohan Xu, Yang Bai, Dongyang Song, Haoran Wei, Bo Li, Yongchen Pan, Tian Pan, Tao Huang, "R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic System", the 38th annual conference of the ACM Special Interest Group on Data Communication (SIGCOMM), 2024.
Key innovations include:
- Novel timestamp-based RTT measurement using CQE events
- ToR-mesh probing for RNIC anomaly detection
- eBPF-based service flow discovery with minimal overhead
- Service-aware impact assessment methodology
We welcome contributions! Please see our Contributing Guide for details.
# Clone repository
git clone https://github.com/yuuki/rpingmesh.git
cd rpingmesh
# Run tests
make test
# Build and test locally
make build-local
make test-local
- Software Design Document - Comprehensive technical design
- Architecture Diagrams - Visual system overview
- Architecture Overview
- Deployment Guide
- Configuration Reference
- Troubleshooting
- API Documentation
This project is licensed under the MIT License - see the LICENSE file for details.
The eBPF programs in internal/ebpf/bpf/
are dual-licensed under MIT and GPLv2.
- The original R-Pingmesh research team
- The Go, RDMA, eBPF, and Linux communities