ModelEngine-Group · ygwpz · Nov 28, 2025 · Nov 22, 2025 · Nov 24, 2025 · Nov 24, 2025
@@ -0,0 +1,34 @@
+name: cpp-linter
+
+on:
+  push:
+    branches: [ "*" ]
+  pull_request:
+    branches: [ "dev*", "main", "*release" ]
+
+
+jobs:
+  cpp-linter:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@1af3b93b6815bc44a9784bd300feb67ff0d1eeb3  # v6.0.0
+        with:
+          persist-credentials: false
+      - uses: cpp-linter/cpp-linter-action@main
+        id: linter
+        continue-on-error: true
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        with:
+          style: file
+          tidy-checks: '-*'
+          files-changed-only: true
+          lines-changed-only: diff
+          format-review: true
+          thread-comments: ${{ github.event_name == 'pull_request' && 'update' }}
+
+      - name: Fail fast?!
+        if: steps.linter.outputs.checks-failed != 0
+        run: |
+          echo "some linter checks failed. ${{ steps.linter.outputs.checks-failed }}"
+          exit 1
@@ -57,6 +57,7 @@ getting-started/installation_npu
 user-guide/prefix-cache/index
 user-guide/sparse-attention/index
 user-guide/pd-disaggregation/index
+user-guide/metrics/metrics
 :::
 
 :::{toctree}

@@ -0,0 +1,193 @@
+# Observability
+
+UCM (Unified Cache Management) provides detailed metrics monitoring through Prometheus endpoints, allowing in-depth monitoring of cache performance and behavior. This document describes how to enable and configure observability from the embedded vLLM `/metrics` API endpoint.
+
+---
+
+## Quick Start Guide
+
+### 1) On UCM Side
+
+First, set the `PROMETHEUS_MULTIPROC_DIR` environment variable.
+
+```bash
+export PROMETHEUS_MULTIPROC_DIR=/vllm-workspace
+```
+
+Then, start the UCM service.
+
+```bash
+export CUDA_VISIBLE_DEVICES=0
+vllm serve /home/models/Qwen2.5-14B-Instruct  \
+    --max-model-len 5000 \
+    --tensor-parallel-size 1 \
+    --gpu_memory_utilization 0.87 \
+    --trust-remote-code \
+    --disable-log-requests \
+    --no-enable-prefix-caching \
+    --enforce-eager \
+    --max-num-batched-tokens 40000 \
+    --max-num-seqs 10 \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --kv-transfer-config \
+    '{
+        "kv_connector": "UCMConnector",
+        "kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
+        "kv_role": "kv_both",
+        "kv_connector_extra_config": {
+            "UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config.yaml"
+        }
+    }'
+```
+**Note**: You can refer to the `ucm_config.yaml` file at https://github.com/ModelEngine-Group/unified-cache-management/tree/develop/examples to configure the `metrics_config_path` parameter.
+
+You can use the `vllm bench serve` command to run benchmarks:
+
+```bash
+vllm bench serve \
+    --backend vllm \
+    --model /home/models/Qwen2.5-14B-Instruct \
+    --host 127.0.0.1 \
+    --port 8000 \
+    --dataset-name random \
+    --num-prompts 20 \
+    --random-input-len 200 \
+    --random-output-len 10 \
+    --request-rate 1 \
+    --ignore-eos
+```
+
+Once the HTTP server is running, you can access the UCM metrics at the `/metrics` endpoint.
+
+```bash
+curl http://$<vllm-worker-ip>:8000/metrics | grep ucm:
+```
+
+You will also find some `.db` files in the `$PROMETHEUS_MULTIPROC_DIR` directory, which are temporary files used by Prometheus.
+
+### 2) Start Prometheus and Grafana with Docker Compose
+
+#### Create Docker Compose Configuration Files
+
+First, create the `docker-compose.yaml` file:
+
+```yaml
+# docker-compose.yaml
+version: "3"
+
+services:
+  prometheus:
+    image: prom/prometheus:latest
+    extra_hosts:
+      - "host.docker.internal:host-gateway"     
+    ports:
+      - "9090:9090"   
+    volumes:
+      - ${PWD}/prometheus.yaml:/etc/prometheus/prometheus.yml 
+
+  grafana:
+    image: grafana/grafana:latest
+    depends_on:
+      - prometheus
+    ports:
+      - "3000:3000" 
+```
+
+Then, create the `prometheus.yaml` configuration file:
+
+```yaml
+# prometheus.yaml
+global:
+  scrape_interval: 5s
+  evaluation_interval: 30s
+
+scrape_configs:
+  - job_name: vllm
+    static_configs:
+      - targets:
+          - 'host.docker.internal:8000'  
+```
+
+**Note**: Make sure the port number in `prometheus.yaml` matches the port number used when starting the vLLM service.
+
+#### Start Services
+
+Run the following command in the directory containing `docker-compose.yaml` and `prometheus.yaml`:
+
+```bash
+docker compose up
+```
+
+This will start Prometheus and Grafana services.
+
+### 3) Configure Grafana Dashboard
+
+#### Access Grafana
+
+Navigate to `http://<your-host>:3000`. Log in with the default username (`admin`) and password (`admin`). You will be prompted to change the password on first login.
+
+#### Add Prometheus Data Source
+
+1. Navigate to `http://<your-host>:3000/connections/datasources/new` and select **Prometheus**.
+
+2. On the Prometheus configuration page, add the Prometheus server URL in the **Connection** section. For this Docker Compose setup, Grafana and Prometheus run in separate containers, but Docker creates DNS names for each container. You can directly use `http://prometheus:9090`.
+
+3. Click **Save & Test**. You should see a green checkmark showing "Successfully queried the Prometheus API."
+
+#### Import Dashboard
+
+1. Navigate to `http://<your-host>:3000/dashboard/import`.
+
+2. Click **Upload JSON file**, then upload the `unified-cache-management/examples/metrics/grafana.json` file.
+
+3. Select the Prometheus data source configured earlier.
+
+4. Click **Import** to complete the import.
+
+You should now be able to see the UCM monitoring dashboard with real-time visualization of all 9 metrics.
+
+## Available Metrics
+
+UCM exposes various metrics to monitor its performance. The following table lists all available metrics organized by category:
+
+| Metric Name | Type | Description |
+|------------|------|-------------|
+| **Load Operation Metrics** | | |
+| `ucm:load_requests_num` | Histogram | Number of requests loaded per `start_load_kv` call |
+| `ucm:load_blocks_num` | Histogram | Number of blocks loaded per `start_load_kv` call |
+| `ucm:load_duration` | Histogram | Time to load KV cache from UCM (milliseconds) |
+| `ucm:load_speed` | Histogram | Speed of loading from UCM (GB/s) |
+| **Save Operation Metrics** | | |
+| `ucm:save_requests_num` | Histogram | Number of requests saved per `wait_for_save` call |
+| `ucm:save_blocks_num` | Histogram | Number of blocks saved per `wait_for_save` call |
+| `ucm:save_duration` | Histogram | Time to save to UCM (milliseconds) |
+| `ucm:save_speed` | Histogram | Speed of saving to UCM (GB/s) |
+| **Lookup Hit Rate Metrics** | | |
+| `ucm:interval_lookup_hit_rates` | Histogram | Hit rate of UCM lookup requests |
+
+## Prometheus Configuration
+
+Metrics configuration is defined in the `unified-cache-management/examples/metrics/metrics_configs.yaml` file:
+
+```yaml
+log_interval: 5  # Interval in seconds for logging metrics
+
+prometheus:
+  multiproc_dir: "/vllm-workspace"  # Prometheus directory
+  metric_prefix: "ucm:"  # Metric name prefix
+
+  enabled_metrics:
+    counters: true
+    gauges: true
+    histograms: true
+
+  histograms:
+    - name: "load_requests_num"
+      documentation: "Number of requests loaded from ucm"
+      buckets: [1, 5, 10, 20, 50, 100, 200, 500, 1000]
+    # ... other metric configurations
+```
+
+---
+
@@ -79,6 +79,5 @@ performance.
 
 :::{toctree}
 :maxdepth: 1
-dram_store
 nfs_store
 :::
@@ -135,7 +135,7 @@ def _get_package_data_with_so():
 
 setup(
     name="uc-manager",
-    version="0.1.0rc3",
+    version="0.1.0rc4",
     description="Unified Cache Management",
     author="Unified Cache Team",
     packages=find_packages(),

@@ -1,3 +1,4 @@
+import functools
 from typing import Any, Dict, List
 
 from common.db_utils import write_to_db
@@ -44,6 +45,7 @@ def post_process(table_name: str, **kwargs) -> List[Dict[str, Any]]:
 
 # ---------------- decorator ----------------
 def export_vars(func):
+    @functools.wraps(func)
     def wrapper(*args, **kwargs):
         result = func(*args, **kwargs)
         # If the function returns a dict containing '_data' or 'data', post-process it