Skip to content

<fix>[prometheus]: add label in gpu#508

Closed
zstack-robot-2 wants to merge 1 commit intofeature-5.1.8-GPU-monitorfrom
sync/qiuyu.zhang/zstac-65807@@2
Closed

<fix>[prometheus]: add label in gpu#508
zstack-robot-2 wants to merge 1 commit intofeature-5.1.8-GPU-monitorfrom
sync/qiuyu.zhang/zstac-65807@@2

Conversation

@zstack-robot-2
Copy link
Copy Markdown
Collaborator

Resolves: ZSTAC-65807

Change-Id: I696675636b66726a6d6d6c68796e7a756f6a6972

sync from gitlab !4770

Resolves: ZSTAC-65807

Change-Id: I696675636b66726a6d6d6c68796e7a756f6a6972
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented May 17, 2024

Walkthrough

此次变更主要涉及对 Prometheus 插件的 GPU 监控指标的修改。所有 GPU 相关的指标现在都增加了 gpu_serial 标签,以便更精确地识别 GPU。此外,gpu_state 指标增加了 gpuState 标签,并对状态描述进行了改进,以更准确地反映 GPU 的状态。

Changes

文件路径 变更摘要
kvmagent/kvmagent/plugins/prometheus.py 更新了多个 GaugeMetricFamily 实例,增加了 gpu_serial 标签,并调整了 gpu_state 指标的描述和标签。

🐰
代码变更如春风,
GPU指标更精准。
标签增加序列号,
状态描述更分明。
Prometheus 插件新气象,
监控世界更宽广。
🌸


Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Out of diff range and nitpick comments (2)
kvmagent/kvmagent/plugins/prometheus.py (2)

Line range hint 447-447: 请确保在单行中不要编写多条语句。

-    if ALARM_CONFIG is None: return
+    if ALARM_CONFIG is None:
+        return

Line range hint 1617-1619: 请确保在单行中不要编写多条语句。

-    if output == "" or output is None: PAGE_SIZE = 4096
+    if output == "" or output is None:
+        PAGE_SIZE = 4096
Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between c29d5c8 and 7e37c08.
Files selected for processing (1)
  • kvmagent/kvmagent/plugins/prometheus.py (4 hunks)
Additional Context Used
Ruff (4)
kvmagent/kvmagent/plugins/prometheus.py (4)

447-447: SyntaxError: Simple statements must be separated by newlines or semicolons


1617-1617: Multiple statements on one line (colon)


1618-1618: Multiple statements on one line (colon)


1619-1619: Multiple statements on one line (colon)

Additional comments not posted (4)
kvmagent/kvmagent/plugins/prometheus.py (4)

1274-1281: 添加了新的标签 gpu_serialgpuState 以及更新了 gpu_state 的描述。


1322-1323: 在 gpu_rxpci_in_bytesgpu_txpci_in_bytes 指标中添加了新的标签 gpu_serial


1340-1347: 为 AMD GPU 相关的指标添加了新的标签 gpu_serialgpuState


1363-1370: 确保 convert_pci_state_to_int 函数能够正确处理所有可能的 PCI 状态。

Comment on lines +1302 to +1308
metrics['gpu_power_draw'].add_metric([pci_device_address, gpu_serial], float(info[0].replace('W', '').strip()))
metrics['gpu_temperature'].add_metric([pci_device_address, gpu_serial], float(info[1].strip()))
metrics['gpu_fan_speed'].add_metric([pci_device_address, gpu_serial], float(info[2].replace('%', '').strip()))
metrics['gpu_utilization'].add_metric([pci_device_address, gpu_serial], float(info[3].replace('%', '').strip()))
metrics['gpu_memory_utilization'].add_metric([pci_device_address, gpu_serial], float(info[4].replace('%', '').strip()))
gpuState, gpu_state_int_value = convert_pci_state_to_int(pci_device_address)
metrics['gpu_state'].add_metric([pci_device_address, gpuState, gpu_serial], gpu_state_int_value)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

def convert_pci_state_to_int(pci_address):
    r, pci_status = bash_ro("lspci -s %s| grep -i 'ref ff'" % pci_address)
    if r == 0 and len(pci_status.strip()) != 0:
        return "critical", 0

    return "nominal", 1

The convert_pci_state_to_int function uses the lspci command to check the PCI status. If the status contains 'ref ff', it returns "critical" with a value of 0. Otherwise, it returns "nominal" with a value of 1. This function appears to handle two states: "critical" and "nominal".

Based on this implementation, it seems the function is designed to handle these specific states. However, it may not cover all possible PCI states comprehensively.

Final Response

确保 convert_pci_state_to_int 函数能够正确处理所有可能的 PCI 状态。目前,该函数仅处理 "critical" 和 "nominal" 两种状态,可能需要扩展以涵盖所有可能的 PCI 状态。

  • 文件: kvmagent/kvmagent/plugins/prometheus.py
  • 函数: convert_pci_state_to_int
Analysis chain

确保 convert_pci_state_to_int 函数能够正确处理所有可能的 PCI 状态。

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# 检查 `convert_pci_state_to_int` 函数的实现,确保它能正确处理所有 PCI 状态。
ast-grep --lang python --pattern $'def convert_pci_state_to_int($_, $_) { $$$ }'

Length of output: 548



Script:

#!/bin/bash
# Correcting the pattern to search for the `convert_pci_state_to_int` function implementation.
ast-grep --lang python --pattern 'def convert_pci_state_to_int($_) { $$$ }'

Length of output: 540



Script:

#!/bin/bash
# Correcting the pattern to search for the `convert_pci_state_to_int` function implementation.
ast-grep --lang python --pattern 'def convert_pci_state_to_int($_):'

Length of output: 70



Script:

#!/bin/bash
# Using `rg` to search for the `convert_pci_state_to_int` function in the codebase.
rg 'def convert_pci_state_to_int'

Length of output: 118



Script:

#!/bin/bash
# Extracting the implementation of the `convert_pci_state_to_int` function from the specified file.
ast-grep --lang python --pattern 'def convert_pci_state_to_int($_): { $$$ }' kvmagent/kvmagent/plugins/prometheus.py

Length of output: 118



Script:

#!/bin/bash
# Displaying the contents of the file to locate the `convert_pci_state_to_int` function.
cat kvmagent/kvmagent/plugins/prometheus.py

Length of output: 78715

Comment on lines +1378 to +1380
return "critical", 0

return 1
return "nominal", 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_pci_state_to_int 函数应该在返回之前检查 pci_status 的内容,以确保状态转换的准确性。

-    if r == 0 and len(pci_status.strip()) != 0:
+    if r == 0 and 'ref ff' in pci_status:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
return "critical", 0
return 1
return "nominal", 1
return "critical", 0
if r == 0 and 'ref ff' in pci_status:
return "nominal", 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants