Skip to content

Kylerkang new dashboards#23427

Merged
zoedt merged 6 commits intomasterfrom
kylerkang-new-dashboards
Apr 30, 2026
Merged

Kylerkang new dashboards#23427
zoedt merged 6 commits intomasterfrom
kylerkang-new-dashboards

Conversation

@kylerkang
Copy link
Copy Markdown
Contributor

What does this PR do?

Motivation

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Updating our 'Network Device Monitoring' OOTB dashboard (hover over 'Dashboards' in the global bar')
Update the 'Interface performance' OOTB dashboard found in the 'Dashboards' dropdown
@github-actions
Copy link
Copy Markdown
Contributor

⚠️ Recommendation: Add qa/skip-qa label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

@kylerkang kylerkang requested review from Copilot and removed request for Pierre-L42 April 22, 2026 17:46
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Apr 22, 2026

Validation Report

All 20 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refreshes the SNMP dashboard JSON assets, expanding the existing Interface Performance and BGP/OSPF Overview dashboards with new sections, widgets, and updated descriptions to provide more actionable visibility into interface behavior and routing health.

Changes:

  • Reworked Interface Performance into multiple grouped sections (overview, interface tables, throughput/utilization, errors/discards, status, packet mix) and added a new NetFlow section.
  • Reorganized BGP/OSPF Overview into clearer “device context”, “BGP session health”, and “OSPF IGP health” groups with additional snapshot and trend widgets.
  • Updated template variable blocks and added pause_auto_refresh.

Reviewed changes

Copilot reviewed 1 out of 5 changed files in this pull request and generated 14 comments.

File Description
snmp/assets/dashboards/interface_performance.json Major dashboard restructure and new widgets for interface KPIs; adds a NetFlow section.
snmp/assets/dashboards/bgp_ospf_overview.json Reorganized routing overview into grouped sections; adds uptime/context widgets and BGP/OSPF trend views.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1432 to 1459
@@ -100,31 +1455,49 @@
],
"yaxis": {
"scale": "linear",
"label": "",
"label": "discards/s",
"include_zero": true,
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeseries labels the y-axis as discards/s, but the query uses the raw counter snmp.ifInDiscards without converting it to a per-second rate (e.g., via .as_count() / rate). Update the query so the data matches the displayed units.

Copilot uses AI. Check for mistakes.
Comment on lines 1487 to 1514
@@ -137,85 +1510,272 @@
],
"yaxis": {
"scale": "linear",
"label": "",
"label": "discards/s",
"include_zero": true,
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeseries labels the y-axis as discards/s, but the query uses the raw counter snmp.ifOutDiscards without converting it to a per-second rate (e.g., via .as_count() / rate). Update the query so the data matches the displayed units.

Copilot uses AI. Check for mistakes.
Comment on lines +1895 to +1905
"query": "avg:snmp.ifHCInUcastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}"
},
{
"data_source": "metrics",
"name": "query2",
"query": "avg:snmp.ifHCInMulticastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}"
},
{
"data_source": "metrics",
"name": "query3",
"query": "avg:snmp.ifHCInBroadcastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}"
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The packet counters (snmp.ifHC*Pkts) are monotonic counters, but these charts label the unit as pkts/s without converting the counters to rates (e.g., via .as_count() / rate). As written, this will graph cumulative packet counts rather than packets per second. Convert these queries to rates or update the labels to reflect counts.

Suggested change
"query": "avg:snmp.ifHCInUcastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}"
},
{
"data_source": "metrics",
"name": "query2",
"query": "avg:snmp.ifHCInMulticastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}"
},
{
"data_source": "metrics",
"name": "query3",
"query": "avg:snmp.ifHCInBroadcastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}"
"query": "avg:snmp.ifHCInUcastPkts{$snmp_host,$snmp_device,$interface}.as_rate() by {snmp_device,snmp_host,interface}"
},
{
"data_source": "metrics",
"name": "query2",
"query": "avg:snmp.ifHCInMulticastPkts{$snmp_host,$snmp_device,$interface}.as_rate() by {snmp_device,snmp_host,interface}"
},
{
"data_source": "metrics",
"name": "query3",
"query": "avg:snmp.ifHCInBroadcastPkts{$snmp_host,$snmp_device,$interface}.as_rate() by {snmp_device,snmp_host,interface}"

Copilot uses AI. Check for mistakes.
{
"data_source": "metrics",
"name": "query1",
"query": "avg:snmp.ifBandwidthInUsage.rate{*} by {snmp_host,interface,interface_alias}",
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several toplists in this section hard-code {*} instead of using the dashboard template variables, so they ignore $snmp_host, $snmp_device, and $interface even though the dashboard description says those filters scope the board. Update these toplist queries to use the template variables so filtering behaves consistently.

Suggested change
"query": "avg:snmp.ifBandwidthInUsage.rate{*} by {snmp_host,interface,interface_alias}",
"query": "avg:snmp.ifBandwidthInUsage.rate{snmp_host:$snmp_host,snmp_device:$snmp_device,interface:$interface} by {snmp_host,interface,interface_alias}",

Copilot uses AI. Check for mistakes.
Comment on lines +584 to +640
"id": 8222666460674068,
"definition": {
"title": "Top 10 Interface Throughput (outbound bps)",
"title_size": "16",
"title_align": "left",
"type": "toplist",
"requests": [
{
"response_format": "scalar",
"queries": [
{
"data_source": "metrics",
"name": "query1",
"query": "avg:snmp.ifHCOutOctets.rate{*} by {snmp_host,interface,interface_alias}",
"aggregator": "avg"
}
],
"formulas": [
{
"alias": "bps",
"formula": "query1 * 8"
}
],
"sort": {
"count": 10,
"order_by": [
{
"type": "formula",
"index": 0,
"order": "desc"
}
]
}
}
],
"custom_links": [
{
"link": "/screen/integration/Interface%20Performance?live=true&tpl_var_snmp_host={{snmp_host.value}}&tpl_var_interface={{interface.value}}",
"label": "Interface Performance"
}
],
"style": {}
},
"layout": {
"x": 0,
"y": 7,
"width": 6,
"height": 3
}
},
{
"id": 1210504657783681,
"definition": {
"title": "Top 10 Interface Throughput (inbound bps)",
"title_size": "16",
"title_align": "left",
"type": "toplist",
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throughput section contains duplicate toplist widgets with the exact same title, query, and link (the second pair at y=7 duplicates the pair at y=4). Remove the duplicates or change the queries/titles so each widget adds distinct information.

Copilot uses AI. Check for mistakes.
"id": 3763492863117901,
"definition": {
"type": "note",
"content": "**Tip:** To fully understand the current state of an interface, we compare two different metrics, `snmp.AdminStatus` and `snmp.OperStatus`.\n\n**Admin Status** is defined by the owner of the device. This is the state a port is supposed to be at configuration time.\n\nMeanwhile, the actual current state of the interface is known as the **Operational Status**.\n\nWe can infer a healthy interface when the interface is operationally the same as its administratively defined states. ",
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note references snmp.AdminStatus and snmp.OperStatus, but the dashboard queries use snmp.ifAdminStatus and snmp.ifOperStatus. Update the note to match the actual metric names to avoid confusing users.

Suggested change
"content": "**Tip:** To fully understand the current state of an interface, we compare two different metrics, `snmp.AdminStatus` and `snmp.OperStatus`.\n\n**Admin Status** is defined by the owner of the device. This is the state a port is supposed to be at configuration time.\n\nMeanwhile, the actual current state of the interface is known as the **Operational Status**.\n\nWe can infer a healthy interface when the interface is operationally the same as its administratively defined states. ",
"content": "**Tip:** To fully understand the current state of an interface, we compare two different metrics, `snmp.ifAdminStatus` and `snmp.ifOperStatus`.\n\n**Admin Status** is defined by the owner of the device. This is the state a port is supposed to be at configuration time.\n\nMeanwhile, the actual current state of the interface is known as the **Operational Status**.\n\nWe can infer a healthy interface when the interface is operationally the same as its administratively defined states. ",

Copilot uses AI. Check for mistakes.
"query": "max:snmp.sysUpTimeInstance{$profile,$snmp_host,$snmp_device} by {snmp_host}",
"data_source": "metrics",
"name": "query1",
"aggregator": "min"
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uptime distribution query uses max:snmp.sysUpTimeInstance... but the query-level aggregator is set to min, which will skew results toward the smallest uptime in the time window rather than the current/max value per device. Set the query aggregator to match the intent (typically last or max) so the distribution reflects current device uptimes.

Suggested change
"aggregator": "min"
"aggregator": "max"

Copilot uses AI. Check for mistakes.
"response_format": "scalar",
"queries": [
{
"query": "avg:snmp.sysUpTimeInstance{$profile,$snmp_host,$snmp_device,$interface} by {snmp_host}",
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snmp.sysUpTimeInstance is a device-level metric and typically does not carry an interface tag. Including $interface in this query will cause the widget to go empty whenever the user filters to a specific interface. Remove $interface from the metric scope for this uptime toplist.

Suggested change
"query": "avg:snmp.sysUpTimeInstance{$profile,$snmp_host,$snmp_device,$interface} by {snmp_host}",
"query": "avg:snmp.sysUpTimeInstance{$profile,$snmp_host,$snmp_device} by {snmp_host}",

Copilot uses AI. Check for mistakes.
Comment on lines 1322 to +1348
@@ -32,27 +1345,51 @@
],
"yaxis": {
"scale": "linear",
"label": "",
"label": "errors/s",
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeseries labels the y-axis as errors/s, but the query uses the raw counter snmp.ifInErrors without converting it to a per-second rate (e.g., via .as_count() / rate). As written, the chart will plot the cumulative counter value rather than errors per second. Update the query (and similar ones in this section) so the units match the data.

Copilot uses AI. Check for mistakes.
Comment on lines 1377 to 1404
@@ -63,31 +1400,49 @@
],
"yaxis": {
"scale": "linear",
"label": "",
"label": "errors/s",
"include_zero": true,
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeseries labels the y-axis as errors/s, but the query uses the raw counter snmp.ifOutErrors without converting it to a per-second rate (e.g., via .as_count() / rate). Update the query so the data matches the displayed units.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b74eff0a18

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
]
"type": "note",
"content": "**Scope here:** Agent health, SNMP check cost, autodiscovery, traps, and NetFlow **pipeline** metrics (receive, flush, index). **Per-interface** utilization, errors, packet mix, and detailed state history live on [Interface Performance](/dash/integration/Interface%20Performance?live=true&tpl_var_snmp_host={{snmp_host.value}}&tpl_var_snmp_device={{snmp_device.value}}&tpl_var_interface={{interface.value}})—set **snmp_device** + **interface** there to match the device you care about here (`$SNMP_Device` is the same device tag as **snmp_device** on that dashboard).\n\n**How the panels fit together:** Agent availability, SNMP reachability and poll duration, trap flow, then NetFlow received/flushed/indexed; version-specific sequence/drop widgets sit with the pipeline section. **At the bottom:** SNMP integration memory and ICMP latency/unreachable help separate agent or path issues from the device’s SNMP behavior.\n\n**Filters:** `$Agent_Hostname`, `$SNMP_Device`, `$Netflow_Exporter`, `$Custom_Tag`. [NDM docs](https://docs.datadoghq.com/network_monitoring/devices/).",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use existing template vars in Interface Performance deep link

The note’s Interface Performance URL interpolates {{snmp_host.value}}, {{snmp_device.value}}, and {{interface.value}}, but this dashboard only defines Agent_Hostname, SNMP_Device, Netflow_Exporter, and Custom_Tag template variables. In a static note widget those undefined placeholders won’t resolve, so the cross-dashboard jump loses the intended device/interface context and opens an unscoped target view.

Useful? React with 👍 / 👎.

"id": 9400000000000012,
"definition": {
"type": "note",
"content": "**At a glance:** Aggregate volume and top lists (ASN, protocol, port/service) frame what is on the wire. **Talkers and conversations** tables (filter by exporter, device, interface) drill into sources and destinations. **SNMP side:** [Interface Performance](/dash/integration/Interface%20Performance?live=true&tpl_var_snmp_host={{snmp_host.value}}&tpl_var_snmp_device={{snmp_device.value}}&tpl_var_interface={{interface.value}}) with the same template variables sits next to this view logically; a single chart that overlays SNMP utilization and flow volume usually means a [Screenboard](https://docs.datadoghq.com/dashboards/guide/screenboard/) or notebook. **Trends:** widen to 24h/7d and use compare-to-prior where available. **Handy for tickets:** copy IP, port, and volume from tables—unexpected external traffic often routes to security; sustained growth to capacity planning; a dominant flow on a saturated interface to QoS or change control.",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Replace undefined SNMP placeholders in note link

This link also uses {{snmp_host.value}}, {{snmp_device.value}}, and {{interface.value}}, but the NetFlow dashboard’s template variables are device.name, device.ip, exporter.ip, source.ip, destination.ip, and interface/device facets with dotted names. Because these placeholders are undefined here, the Interface Performance deep link cannot carry over a selected scope, so users are sent to a broad view instead of the correlated interface context.

Useful? React with 👍 / 👎.

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

Monitor Template Quality Assessment

52 monitors analyzed across 9 integrations.

  • 30 monitors have missing sections
  • Most common missing sections: WHY, IMPACT, HOW_TO_TROUBLESHOOT
Monitors with missing sections
Integration Monitor Missing Sections Suggested Links
kubernetes [Kubernetes] Monitor Kubernetes Deployments Replica Pods IMPACT, WHY, HOW_TO_TROUBLESHOOT, RELATED_LINKS - Logs

@zoedt zoedt added this pull request to the merge queue Apr 30, 2026
Merged via the queue into master with commit 4f8c1f4 Apr 30, 2026
70 of 76 checks passed
@zoedt zoedt deleted the kylerkang-new-dashboards branch April 30, 2026 18:55
@dd-octo-sts dd-octo-sts Bot added this to the 7.79.0 milestone Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants