Resources ‐ Resources Utilization

Description

The Resources Utilization rule collects and aggregates cluster resource utilization data across all nodes. This is an informational rule that provides comprehensive visibility into how cluster resources are allocated and consumed.

Collected Metrics:

Capacity: Total resource capacity per node
Allocatable: Resources available for pod scheduling (capacity minus system reservations)
Requests: Sum of all pod resource requests on the node
Limits: Sum of all pod resource limits on the node
Utilization Levels: Categorization of resource usage (low < 50%, medium 50-74%, high ≥ 75%)

Resource Categories:

Core Resources: cpu, memory, ephemeral-storage, hugepages-*
Extended Resources: Custom resources (GPUs, SR-IOV virtual functions, FPGAs, etc.)

This rule runs at the orchestrator level and aggregates data from all cluster nodes to provide a cluster-wide resource utilization view.

Value

Understanding resource utilization is critical for:

Capacity Planning: Identify when cluster needs scaling (nodes near capacity)
Cost Optimization: Detect over-provisioned resources (low utilization)
Performance Troubleshooting: Find resource contention or throttling
Scheduling Issues: Understand why pods are pending (insufficient allocatable resources)
Resource Quotas: Validate that workload resource requests align with actual capacity
Hardware Validation: Ensure nodes expose expected resources (GPUs, SR-IOV devices, hugepages)

Diagnostics

Manually check resource utilization using these commands:

# List all nodes with capacity and allocatable resources
oc get nodes -o custom-columns=NAME:.metadata.name,ROLES:.metadata.labels.'node-role\.kubernetes\.io/*',CPU-CAPACITY:.status.capacity.cpu,CPU-ALLOCATABLE:.status.allocatable.cpu,MEMORY-CAPACITY:.status.capacity.memory,MEMORY-ALLOCATABLE:.status.allocatable.memory

# Get detailed resource breakdown for a specific node
oc describe node <node-name>

# Check allocated resources percentage (from "Allocated resources" section)
oc describe node <node-name> | grep -A 10 "Allocated resources:"

# List extended resources (GPUs, SR-IOV, etc.) across all nodes
oc get nodes -o json | jq '.items[] | {name: .metadata.name, capacity: .status.capacity, allocatable: .status.allocatable}'

# Check pod resource requests and limits on a node
oc describe node <node-name> | grep -E "(cpu|memory|ephemeral-storage)" | grep -E "(Requests|Limits)"

Interpreting Results

The rule returns resource data with utilization levels for each node:

low (< 50%): Healthy utilization, room for growth
medium (50-74%): Moderate utilization, monitor for growth
high (≥ 75%): Near capacity, consider scaling or rebalancing workloads

Example Output Structure:

{
  "nodes": [
    {
      "name": "worker-0",
      "roles": ["worker"],
      "schedulable": true,
      "core_resources": {
        "cpu": {
          "capacity": "16 cores",
          "allocatable": "15800m",
          "requests": {
            "allocated": "6933m",
            "percentage": "92%",
            "utilization_level": "high"
          },
          "limits": {
            "allocated": "2660m",
            "percentage": "35%",
            "utilization_level": "low"
          }
        },
        "memory": {
          "capacity": "64Gi",
          "allocatable": "62Gi",
          "requests": {
            "allocated": "25724Mi",
            "percentage": "83%",
            "utilization_level": "high"
          },
          "limits": {
            "allocated": "30212Mi",
            "percentage": "97%",
            "utilization_level": "high"
          }
        },
        "ephemeral-storage": {
          "capacity": "191655242229B",
          "allocatable": "176303616Ki",
          "requests": {
            "allocated": "0",
            "percentage": "0%",
            "utilization_level": "low"
          },
          "limits": {
            "allocated": "0",
            "percentage": "0%",
            "utilization_level": "low"
          }
        }
      },
      "extended_resources": {
        "nvidia.com/gpu": {
          "capacity": "2",
          "allocatable": "2",
          "requests": {
            "allocated": "1",
            "percentage": "50%",
            "utilization_level": "medium"
          }
        }
      }
    }
  ]
}

Resources ‐ Resources Utilization

Description

Value

Diagnostics

Interpreting Results

Resources

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally