Conversation
Updating our 'Network Device Monitoring' OOTB dashboard (hover over 'Dashboards' in the global bar')
Update the 'Interface performance' OOTB dashboard found in the 'Dashboards' dropdown
|
This PR does not modify any files shipped with the agent. To help streamline the release process, please consider adding the |
Validation ReportAll 20 validations passed. Show details
|
There was a problem hiding this comment.
Pull request overview
This PR refreshes the SNMP dashboard JSON assets, expanding the existing Interface Performance and BGP/OSPF Overview dashboards with new sections, widgets, and updated descriptions to provide more actionable visibility into interface behavior and routing health.
Changes:
- Reworked Interface Performance into multiple grouped sections (overview, interface tables, throughput/utilization, errors/discards, status, packet mix) and added a new NetFlow section.
- Reorganized BGP/OSPF Overview into clearer “device context”, “BGP session health”, and “OSPF IGP health” groups with additional snapshot and trend widgets.
- Updated template variable blocks and added
pause_auto_refresh.
Reviewed changes
Copilot reviewed 1 out of 5 changed files in this pull request and generated 14 comments.
| File | Description |
|---|---|
snmp/assets/dashboards/interface_performance.json |
Major dashboard restructure and new widgets for interface KPIs; adds a NetFlow section. |
snmp/assets/dashboards/bgp_ospf_overview.json |
Reorganized routing overview into grouped sections; adds uptime/context widgets and BGP/OSPF trend views. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -100,31 +1455,49 @@ | |||
| ], | |||
| "yaxis": { | |||
| "scale": "linear", | |||
| "label": "", | |||
| "label": "discards/s", | |||
| "include_zero": true, | |||
There was a problem hiding this comment.
This timeseries labels the y-axis as discards/s, but the query uses the raw counter snmp.ifInDiscards without converting it to a per-second rate (e.g., via .as_count() / rate). Update the query so the data matches the displayed units.
| @@ -137,85 +1510,272 @@ | |||
| ], | |||
| "yaxis": { | |||
| "scale": "linear", | |||
| "label": "", | |||
| "label": "discards/s", | |||
| "include_zero": true, | |||
There was a problem hiding this comment.
This timeseries labels the y-axis as discards/s, but the query uses the raw counter snmp.ifOutDiscards without converting it to a per-second rate (e.g., via .as_count() / rate). Update the query so the data matches the displayed units.
| "query": "avg:snmp.ifHCInUcastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}" | ||
| }, | ||
| { | ||
| "data_source": "metrics", | ||
| "name": "query2", | ||
| "query": "avg:snmp.ifHCInMulticastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}" | ||
| }, | ||
| { | ||
| "data_source": "metrics", | ||
| "name": "query3", | ||
| "query": "avg:snmp.ifHCInBroadcastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}" |
There was a problem hiding this comment.
The packet counters (snmp.ifHC*Pkts) are monotonic counters, but these charts label the unit as pkts/s without converting the counters to rates (e.g., via .as_count() / rate). As written, this will graph cumulative packet counts rather than packets per second. Convert these queries to rates or update the labels to reflect counts.
| "query": "avg:snmp.ifHCInUcastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}" | |
| }, | |
| { | |
| "data_source": "metrics", | |
| "name": "query2", | |
| "query": "avg:snmp.ifHCInMulticastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}" | |
| }, | |
| { | |
| "data_source": "metrics", | |
| "name": "query3", | |
| "query": "avg:snmp.ifHCInBroadcastPkts{$snmp_host,$snmp_device,$interface} by {snmp_device,snmp_host,interface}" | |
| "query": "avg:snmp.ifHCInUcastPkts{$snmp_host,$snmp_device,$interface}.as_rate() by {snmp_device,snmp_host,interface}" | |
| }, | |
| { | |
| "data_source": "metrics", | |
| "name": "query2", | |
| "query": "avg:snmp.ifHCInMulticastPkts{$snmp_host,$snmp_device,$interface}.as_rate() by {snmp_device,snmp_host,interface}" | |
| }, | |
| { | |
| "data_source": "metrics", | |
| "name": "query3", | |
| "query": "avg:snmp.ifHCInBroadcastPkts{$snmp_host,$snmp_device,$interface}.as_rate() by {snmp_device,snmp_host,interface}" |
| { | ||
| "data_source": "metrics", | ||
| "name": "query1", | ||
| "query": "avg:snmp.ifBandwidthInUsage.rate{*} by {snmp_host,interface,interface_alias}", |
There was a problem hiding this comment.
Several toplists in this section hard-code {*} instead of using the dashboard template variables, so they ignore $snmp_host, $snmp_device, and $interface even though the dashboard description says those filters scope the board. Update these toplist queries to use the template variables so filtering behaves consistently.
| "query": "avg:snmp.ifBandwidthInUsage.rate{*} by {snmp_host,interface,interface_alias}", | |
| "query": "avg:snmp.ifBandwidthInUsage.rate{snmp_host:$snmp_host,snmp_device:$snmp_device,interface:$interface} by {snmp_host,interface,interface_alias}", |
| "id": 8222666460674068, | ||
| "definition": { | ||
| "title": "Top 10 Interface Throughput (outbound bps)", | ||
| "title_size": "16", | ||
| "title_align": "left", | ||
| "type": "toplist", | ||
| "requests": [ | ||
| { | ||
| "response_format": "scalar", | ||
| "queries": [ | ||
| { | ||
| "data_source": "metrics", | ||
| "name": "query1", | ||
| "query": "avg:snmp.ifHCOutOctets.rate{*} by {snmp_host,interface,interface_alias}", | ||
| "aggregator": "avg" | ||
| } | ||
| ], | ||
| "formulas": [ | ||
| { | ||
| "alias": "bps", | ||
| "formula": "query1 * 8" | ||
| } | ||
| ], | ||
| "sort": { | ||
| "count": 10, | ||
| "order_by": [ | ||
| { | ||
| "type": "formula", | ||
| "index": 0, | ||
| "order": "desc" | ||
| } | ||
| ] | ||
| } | ||
| } | ||
| ], | ||
| "custom_links": [ | ||
| { | ||
| "link": "/screen/integration/Interface%20Performance?live=true&tpl_var_snmp_host={{snmp_host.value}}&tpl_var_interface={{interface.value}}", | ||
| "label": "Interface Performance" | ||
| } | ||
| ], | ||
| "style": {} | ||
| }, | ||
| "layout": { | ||
| "x": 0, | ||
| "y": 7, | ||
| "width": 6, | ||
| "height": 3 | ||
| } | ||
| }, | ||
| { | ||
| "id": 1210504657783681, | ||
| "definition": { | ||
| "title": "Top 10 Interface Throughput (inbound bps)", | ||
| "title_size": "16", | ||
| "title_align": "left", | ||
| "type": "toplist", |
There was a problem hiding this comment.
This throughput section contains duplicate toplist widgets with the exact same title, query, and link (the second pair at y=7 duplicates the pair at y=4). Remove the duplicates or change the queries/titles so each widget adds distinct information.
| "id": 3763492863117901, | ||
| "definition": { | ||
| "type": "note", | ||
| "content": "**Tip:** To fully understand the current state of an interface, we compare two different metrics, `snmp.AdminStatus` and `snmp.OperStatus`.\n\n**Admin Status** is defined by the owner of the device. This is the state a port is supposed to be at configuration time.\n\nMeanwhile, the actual current state of the interface is known as the **Operational Status**.\n\nWe can infer a healthy interface when the interface is operationally the same as its administratively defined states. ", |
There was a problem hiding this comment.
This note references snmp.AdminStatus and snmp.OperStatus, but the dashboard queries use snmp.ifAdminStatus and snmp.ifOperStatus. Update the note to match the actual metric names to avoid confusing users.
| "content": "**Tip:** To fully understand the current state of an interface, we compare two different metrics, `snmp.AdminStatus` and `snmp.OperStatus`.\n\n**Admin Status** is defined by the owner of the device. This is the state a port is supposed to be at configuration time.\n\nMeanwhile, the actual current state of the interface is known as the **Operational Status**.\n\nWe can infer a healthy interface when the interface is operationally the same as its administratively defined states. ", | |
| "content": "**Tip:** To fully understand the current state of an interface, we compare two different metrics, `snmp.ifAdminStatus` and `snmp.ifOperStatus`.\n\n**Admin Status** is defined by the owner of the device. This is the state a port is supposed to be at configuration time.\n\nMeanwhile, the actual current state of the interface is known as the **Operational Status**.\n\nWe can infer a healthy interface when the interface is operationally the same as its administratively defined states. ", |
| "query": "max:snmp.sysUpTimeInstance{$profile,$snmp_host,$snmp_device} by {snmp_host}", | ||
| "data_source": "metrics", | ||
| "name": "query1", | ||
| "aggregator": "min" |
There was a problem hiding this comment.
The uptime distribution query uses max:snmp.sysUpTimeInstance... but the query-level aggregator is set to min, which will skew results toward the smallest uptime in the time window rather than the current/max value per device. Set the query aggregator to match the intent (typically last or max) so the distribution reflects current device uptimes.
| "aggregator": "min" | |
| "aggregator": "max" |
| "response_format": "scalar", | ||
| "queries": [ | ||
| { | ||
| "query": "avg:snmp.sysUpTimeInstance{$profile,$snmp_host,$snmp_device,$interface} by {snmp_host}", |
There was a problem hiding this comment.
snmp.sysUpTimeInstance is a device-level metric and typically does not carry an interface tag. Including $interface in this query will cause the widget to go empty whenever the user filters to a specific interface. Remove $interface from the metric scope for this uptime toplist.
| "query": "avg:snmp.sysUpTimeInstance{$profile,$snmp_host,$snmp_device,$interface} by {snmp_host}", | |
| "query": "avg:snmp.sysUpTimeInstance{$profile,$snmp_host,$snmp_device} by {snmp_host}", |
| @@ -32,27 +1345,51 @@ | |||
| ], | |||
| "yaxis": { | |||
| "scale": "linear", | |||
| "label": "", | |||
| "label": "errors/s", | |||
There was a problem hiding this comment.
This timeseries labels the y-axis as errors/s, but the query uses the raw counter snmp.ifInErrors without converting it to a per-second rate (e.g., via .as_count() / rate). As written, the chart will plot the cumulative counter value rather than errors per second. Update the query (and similar ones in this section) so the units match the data.
| @@ -63,31 +1400,49 @@ | |||
| ], | |||
| "yaxis": { | |||
| "scale": "linear", | |||
| "label": "", | |||
| "label": "errors/s", | |||
| "include_zero": true, | |||
There was a problem hiding this comment.
This timeseries labels the y-axis as errors/s, but the query uses the raw counter snmp.ifOutErrors without converting it to a per-second rate (e.g., via .as_count() / rate). Update the query so the data matches the displayed units.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b74eff0a18
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| ] | ||
| "type": "note", | ||
| "content": "**Scope here:** Agent health, SNMP check cost, autodiscovery, traps, and NetFlow **pipeline** metrics (receive, flush, index). **Per-interface** utilization, errors, packet mix, and detailed state history live on [Interface Performance](/dash/integration/Interface%20Performance?live=true&tpl_var_snmp_host={{snmp_host.value}}&tpl_var_snmp_device={{snmp_device.value}}&tpl_var_interface={{interface.value}})—set **snmp_device** + **interface** there to match the device you care about here (`$SNMP_Device` is the same device tag as **snmp_device** on that dashboard).\n\n**How the panels fit together:** Agent availability, SNMP reachability and poll duration, trap flow, then NetFlow received/flushed/indexed; version-specific sequence/drop widgets sit with the pipeline section. **At the bottom:** SNMP integration memory and ICMP latency/unreachable help separate agent or path issues from the device’s SNMP behavior.\n\n**Filters:** `$Agent_Hostname`, `$SNMP_Device`, `$Netflow_Exporter`, `$Custom_Tag`. [NDM docs](https://docs.datadoghq.com/network_monitoring/devices/).", |
There was a problem hiding this comment.
Use existing template vars in Interface Performance deep link
The note’s Interface Performance URL interpolates {{snmp_host.value}}, {{snmp_device.value}}, and {{interface.value}}, but this dashboard only defines Agent_Hostname, SNMP_Device, Netflow_Exporter, and Custom_Tag template variables. In a static note widget those undefined placeholders won’t resolve, so the cross-dashboard jump loses the intended device/interface context and opens an unscoped target view.
Useful? React with 👍 / 👎.
| "id": 9400000000000012, | ||
| "definition": { | ||
| "type": "note", | ||
| "content": "**At a glance:** Aggregate volume and top lists (ASN, protocol, port/service) frame what is on the wire. **Talkers and conversations** tables (filter by exporter, device, interface) drill into sources and destinations. **SNMP side:** [Interface Performance](/dash/integration/Interface%20Performance?live=true&tpl_var_snmp_host={{snmp_host.value}}&tpl_var_snmp_device={{snmp_device.value}}&tpl_var_interface={{interface.value}}) with the same template variables sits next to this view logically; a single chart that overlays SNMP utilization and flow volume usually means a [Screenboard](https://docs.datadoghq.com/dashboards/guide/screenboard/) or notebook. **Trends:** widen to 24h/7d and use compare-to-prior where available. **Handy for tickets:** copy IP, port, and volume from tables—unexpected external traffic often routes to security; sustained growth to capacity planning; a dominant flow on a saturated interface to QoS or change control.", |
There was a problem hiding this comment.
Replace undefined SNMP placeholders in note link
This link also uses {{snmp_host.value}}, {{snmp_device.value}}, and {{interface.value}}, but the NetFlow dashboard’s template variables are device.name, device.ip, exporter.ip, source.ip, destination.ip, and interface/device facets with dotted names. Because these placeholders are undefined here, the Interface Performance deep link cannot carry over a selected scope, so users are sent to a broad view instead of the correlated interface context.
Useful? React with 👍 / 👎.
Monitor Template Quality Assessment52 monitors analyzed across 9 integrations.
Monitors with missing sections
|
What does this PR do?
Motivation
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged