Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking dashboard and discovery tool refactor #1020

Merged
merged 98 commits into from
Dec 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
ba2af1d
wip
ludoo Nov 15, 2022
3e3d440
wip
ludoo Nov 15, 2022
279c529
wip
ludoo Nov 15, 2022
473cd99
wip
ludoo Nov 15, 2022
9b8c60b
wip
ludoo Nov 16, 2022
bd77b52
discovery
ludoo Nov 16, 2022
16c26c0
single discovery
ludoo Nov 16, 2022
f2ae291
page token
ludoo Nov 16, 2022
1a33a1b
batch requests
ludoo Nov 17, 2022
4d5e936
remove plugin name
ludoo Nov 17, 2022
5f9c406
streamline
ludoo Nov 17, 2022
d1c9a8a
streamline
ludoo Nov 17, 2022
10a4b1f
dynamic routes
ludoo Nov 17, 2022
1274780
dynamic routes
ludoo Nov 17, 2022
9c48e44
forwarding rules and addresses
ludoo Nov 17, 2022
5321f78
batch requests
ludoo Nov 17, 2022
ad90fa9
Merge remote-tracking branch 'origin/master' into ludo/net-dash
ludoo Nov 18, 2022
fb2efab
metrics
ludoo Nov 19, 2022
5cd622b
notes
ludoo Nov 19, 2022
dea4a13
notes
ludoo Nov 20, 2022
23c48ba
streamline
ludoo Nov 20, 2022
5e82dd4
fixes, dump
ludoo Nov 20, 2022
260fbe5
streamline
ludoo Nov 21, 2022
320ffec
remove globals
ludoo Nov 21, 2022
f596946
wip metrics
ludoo Nov 21, 2022
bf340e2
subnet time series
ludoo Nov 21, 2022
37f74de
networks per project plugin
ludoo Nov 21, 2022
2d9422a
firewall rules timeseries
ludoo Nov 21, 2022
b5ffc13
use names in metric labels
ludoo Nov 21, 2022
42168fd
firewall policies timeseries
ludoo Nov 21, 2022
0a6483c
Merge remote-tracking branch 'origin/master' into ludo/net-dash
ludoo Nov 21, 2022
d177ecc
wip
ludoo Nov 22, 2022
e6d8571
instances per network timeseries
ludoo Nov 22, 2022
d0cc32b
routes timeseries
ludoo Nov 22, 2022
061a16e
custom quota
ludoo Nov 22, 2022
b38b1f7
Merge remote-tracking branch 'origin/master' into ludo/net-dash
ludoo Nov 22, 2022
1530080
simpler quota, network peering timeseries
ludoo Nov 23, 2022
25577aa
peering timeseries
ludoo Nov 23, 2022
2a0abf7
Merge remote-tracking branch 'origin/master' into ludo/net-dash
ludoo Nov 23, 2022
3ca8d97
timeseries names
ludoo Nov 24, 2022
678fdf0
wip descriptors
ludoo Nov 24, 2022
7e49eda
metric descriptors
ludoo Nov 24, 2022
b9bfb47
fixes
ludoo Nov 24, 2022
bf82c85
wip
ludoo Nov 24, 2022
1ba7ac8
Use partial for all cf init functions
juliocc Nov 24, 2022
a9b4806
Add requirements.txt
juliocc Nov 24, 2022
045112d
fix org key mismatch
ludoo Nov 24, 2022
cfe21fb
Merge pull request #1008 from GoogleCloudPlatform/jccb/net-dash-partial
juliocc Nov 24, 2022
a04bbf5
Fix folder short cli name
juliocc Nov 24, 2022
897e873
Fix instance_networks when iterable is empty
juliocc Nov 24, 2022
e5e1c66
more readability and fixing some strings
juliocc Nov 24, 2022
e64c3bb
replace() -> removeprefix and remove unneeded quoting
juliocc Nov 24, 2022
7b70ee5
setdefault in init()s
juliocc Nov 24, 2022
1854216
Fix next hop type
juliocc Nov 24, 2022
0d2ce94
Remove unneeded fstring
juliocc Nov 24, 2022
f7a4b65
create descriptors
ludoo Nov 25, 2022
eaaa55a
create descriptors log
ludoo Nov 25, 2022
9b5102f
rename descriptor requests function
ludoo Nov 25, 2022
0511700
non-working metrics implementation (duplicate timeseries batched)
ludoo Nov 25, 2022
0b57905
Merge remote-tracking branch 'origin/master' into ludo/net-dash
ludoo Nov 25, 2022
8528d31
Merge remote-tracking branch 'origin/master' into ludo/net-dash
ludoo Nov 25, 2022
19cb604
timeseries
ludoo Nov 25, 2022
6b85b87
fixes
ludoo Nov 25, 2022
0dadd4a
write timseries
ludoo Nov 26, 2022
4628bd4
fix timeseries plugins
ludoo Nov 26, 2022
651a893
start documenting code
ludoo Nov 26, 2022
67665b8
docstrings and comments
ludoo Nov 26, 2022
97c4c31
docstrings comments and small fixes
ludoo Nov 27, 2022
5d7f585
Merge remote-tracking branch 'origin/master' into ludo/net-dash
ludoo Nov 27, 2022
a721ab9
rename cf to src
ludoo Nov 28, 2022
a64f754
discover nodes instead of just projects
ludoo Nov 28, 2022
a0b085a
discovery node can be a folder or org
ludoo Nov 28, 2022
1dcbdb0
cf entrypoint and fixes
ludoo Nov 28, 2022
4cfd810
cf deployment
ludoo Nov 28, 2022
6d6dabc
remove old paths
ludoo Nov 28, 2022
b31a22e
cloud function deploy readme
ludoo Nov 28, 2022
08d77f0
diagrams
ludoo Nov 28, 2022
ffa6e6f
resource ids in example
ludoo Nov 28, 2022
314cf62
discovery tool readme
ludoo Nov 28, 2022
2fd3e94
top-level README
ludoo Nov 28, 2022
582fa72
Merge branch 'master' into ludo/net-dash
ludoo Nov 28, 2022
6f2a68d
Merge branch 'master' into ludo/net-dash
ludoo Nov 28, 2022
d856057
Some documentation fixes
juliocc Nov 29, 2022
139adbe
Merge branch 'master' into ludo/net-dash
juliocc Nov 29, 2022
50b583c
Add secondary ranges
juliocc Nov 29, 2022
a8e9ba5
Merge branch 'master' into ludo/net-dash
ludoo Nov 29, 2022
19191c1
Update README.md
aurelienlegrand Dec 6, 2022
83eb3c1
Merge remote-tracking branch 'origin/master' into ludo/net-dash
ludoo Dec 6, 2022
7d0376c
add legend to scope diagram
ludoo Dec 6, 2022
cc16e34
improve description of discovery configuration variable
ludoo Dec 6, 2022
b7fae9e
add comment in example for custom quotas file
ludoo Dec 6, 2022
290671f
Merge remote-tracking branch 'origin/master' into ludo/net-dash
ludoo Dec 9, 2022
395d42e
rename op_project to monitoring_project
ludoo Dec 9, 2022
417cbf5
dashboard metric rename wip
ludoo Dec 12, 2022
d25f0b3
Update discover-cai-compute.py
ludoo Dec 17, 2022
3a25b14
Merge branch 'master' into ludo/net-dash
ludoo Dec 17, 2022
f146568
Merge branch 'ludo/net-dash' of github.com:GoogleCloudPlatform/cloud-…
ludoo Dec 18, 2022
d85b0bb
deploy sample dashboard
ludoo Dec 18, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
178 changes: 82 additions & 96 deletions blueprints/cloud-operations/network-dashboard/README.md
Original file line number Diff line number Diff line change
@@ -1,103 +1,89 @@
# Networking Dashboard
# Network Dashboard and Discovery Tool

This repository provides an end-to-end solution to gather some GCP Networking quotas and limits (that cannot be seen in the GCP console today) and display them in a dashboard.
The goal is to allow for better visibility of these limits, facilitating capacity planning and avoiding hitting these limits.
This repository provides an end-to-end solution to gather some GCP networking quotas, limits, and their corresponding usage, store them in Cloud Operations timeseries which can displayed in one or more dashboards or wired to alerts.

Here is an example of dashboard you can get with this solution:
The goal is to allow for better visibility of these limits, some of which cannot be seen in the GCP console today, facilitating capacity planning and being notified when actual usage approaches them.

The tool tracks several distinct usage types across a variety of resources: projects, policies, networks, subnetworks, peering groups, etc. For each usage type three distinct metrics are created tracking usage count, limit and utilization ratio.

The screenshot below is an example of a simple dashboard provided with this blueprint, showing utilization for a specific metric (number of instances per VPC) for multiple VPCs and projects:

<img src="metric.png" width="640px">

Here you see utilization (usage compared to the limit) for a specific metric (number of instances per VPC) for multiple VPCs and projects.

Three metric descriptors are created for each monitored resource: usage, limit and utilization. You can follow each of these and create alerting policies if a threshold is reached.

## Usage

Clone this repository, then go through the following steps to create resources:
- Create a terraform.tfvars file with the following content:
```tfvars
organization_id = "<YOUR-ORG-ID>"
billing_account = "<YOUR-BILLING-ACCOUNT>"
monitoring_project_id = "<YOUR-MONITORING-PROJECT>"
# Monitoring project where the dashboard will be created and the solution deployed, a project named "mon-network-dahshboard" will be created if left blank
monitored_projects_list = ["project-1", "project2"]
# Projects to be monitored by the solution
monitored_folders_list = ["folder_id"]
# Folders to be monitored by the solution
prefix = "<YOUR-PREFIX>"
# Monitoring project name prefix, monitoring project name is <YOUR-PREFIX>-network-dashboard, ignored if monitoring_project_id variable is provided
cf_version = V1|V2
# Set to V2 to use V2 Cloud Functions environment
```
- `terraform init`
- `terraform apply`

Note: Org level viewing permission is required for some metrics such as firewall policies.

Once the resources are deployed, go to the following page to see the dashboard: https://console.cloud.google.com/monitoring/dashboards?project=<YOUR-MONITORING-PROJECT> a dashboard called "quotas-utilization" should be created.

The Cloud Function runs every 10 minutes by default so you should start getting some data points after a few minutes.
You can use the metric explorer to view the data points for the different custom metrics created: https://console.cloud.google.com/monitoring/metrics-explorer?project=<YOUR-MONITORING-PROJECT>.
You can change this frequency by modifying the "schedule_cron" variable in variables.tf.

Note that some charts in the dashboard align values over 1h so you might need to wait 1h to see charts on the dashboard views.

Once done testing, you can clean up resources by running `terraform destroy`.

## Supported limits and quotas
The Cloud Function currently tracks usage, limit and utilization of:
- active VPC peerings per VPC
- VPC peerings per VPC
- instances per VPC
- instances per VPC peering group
- Subnet IP ranges per VPC peering group
- internal forwarding rules for internal L4 load balancers per VPC
- internal forwarding rules for internal L7 load balancers per VPC
- internal forwarding rules for internal L4 load balancers per VPC peering group
- internal forwarding rules for internal L7 load balancers per VPC peering group
- Dynamic routes per VPC
- Dynamic routes per VPC peering group
- Static routes per project (VPC drill down is available for usage)
- Static routes per VPC peering group
- IP utilization per subnet (% of IP addresses used in a subnet)
- VPC firewall rules per project (VPC drill down is available for usage)
- Tuples per Firewall Policy

It writes this values to custom metrics in Cloud Monitoring and creates a dashboard to visualize the current utilization of these metrics in Cloud Monitoring.

Note that metrics are created in the cloud-function/metrics.yaml file. You can also edit default limits for a specific network in that file. See the example for `vpc_peering_per_network`.
One other example is the IP utilization information per subnet, allowing you to monitor the percentage of used IP addresses in your GCP subnets.

More complex scenarios are possible by leveraging and combining the 50 different timeseries created by this tool, and connecting them to Cloud Operations dashboards and alerts.

Refer to the [Cloud Function deployment instructions](./deploy-cloud-function/) for a high level overview and an end-to-end deployment example, and to the[discovery tool documentation](./src/) to try it as a standalone program or to package it in alternative ways.

## Metrics created

- `firewall_policy/tuples_available`
- `firewall_policy/tuples_used`
- `firewall_policy/tuples_used_ratio`
- `network/firewall_rules_used`
- `network/forwarding_rules_l4_available`
- `network/forwarding_rules_l4_used`
- `network/forwarding_rules_l4_used_ratio`
- `network/forwarding_rules_l7_available`
- `network/forwarding_rules_l7_used`
- `network/forwarding_rules_l7_used_ratio`
- `network/instances_available`
- `network/instances_used`
- `network/instances_used_ratio`
- `network/peerings_active_available`
- `network/peerings_active_used`
- `network/peerings_active_used_ratio`
- `network/peerings_total_available`
- `network/peerings_total_used`
- `network/peerings_total_used_ratio`
- `network/routes_dynamic_available`
- `network/routes_dynamic_used`
- `network/routes_dynamic_used_ratio`
- `network/routes_static_used`
- `network/subnets_available`
- `network/subnets_used`
- `network/subnets_used_ratio`
- `peering_group/forwarding_rules_l4_available`
- `peering_group/forwarding_rules_l4_used`
- `peering_group/forwarding_rules_l4_used_ratio`
- `peering_group/forwarding_rules_l7_available`
- `peering_group/forwarding_rules_l7_used`
- `peering_group/forwarding_rules_l7_used_ratio`
- `peering_group/instances_available`
- `peering_group/instances_used`
- `peering_group/instances_used_ratio`
- `peering_group/routes_dynamic_available`
- `peering_group/routes_dynamic_used`
- `peering_group/routes_dynamic_used_ratio`
- `peering_group/routes_static_available`
- `peering_group/routes_static_used`
- `peering_group/routes_static_used_ratio`
- `project/firewall_rules_available`
- `project/firewall_rules_used`
- `project/firewall_rules_used_ratio`
- `project/routes_static_available`
- `project/routes_static_used`
- `project/routes_static_used_ratio`
- `subnetwork/addresses_available`
- `subnetwork/addresses_used`
- `subnetwork/addresses_used_ratio`

## Assumptions and limitations
- The CF assumes that all VPCs in peering groups are within the same organization, except for PSA peerings
- The CF will only fetch subnet utilization data from the PSA peerings (not the VMs, ILB or routes usage)
- The CF assumes global routing is ON, this impacts dynamic routes usage calculation
- The CF assumes custom routes importing/exporting is ON, this impacts static and dynamic routes usage calculation
- The CF assumes all networks in peering groups have the same global routing and custom routes sharing configuration

## Next steps and ideas
In a future release, we could support:
- Google managed VPCs that are peered with PSA (such as Cloud SQL or Memorystore)
- Dynamic routes calculation for VPCs/PPGs with "global routing" set to OFF
- Static routes calculation for projects/PPGs with "custom routes importing/exporting" set to OFF
- Calculations for cross Organization peering groups
- Support different scopes (reduced and fine-grained)

If you are interested in this and/or would like to contribute, please contact legranda@google.com.
<!-- BEGIN TFDOC -->

## Variables

| name | description | type | required | default |
|---|---|:---:|:---:|:---:|
| [billing_account](variables.tf#L17) | The ID of the billing account to associate this project with. | <code></code> | ✓ | |
| [monitored_projects_list](variables.tf#L36) | ID of the projects to be monitored (where limits and quotas data will be pulled). | <code>list&#40;string&#41;</code> | ✓ | |
| [organization_id](variables.tf#L46) | The organization id for the associated services. | <code></code> | ✓ | |
| [prefix](variables.tf#L50) | Prefix used for resource names. | <code>string</code> | ✓ | |
| [cf_version](variables.tf#L21) | Cloud Function version 2nd Gen or 1st Gen. Possible options: 'V1' or 'V2'.Use CFv2 if your Cloud Function timeouts after 9 minutes. By default it is using CFv1. | <code></code> | | <code>V1</code> |
| [monitored_folders_list](variables.tf#L30) | ID of the projects to be monitored (where limits and quotas data will be pulled). | <code>list&#40;string&#41;</code> | | <code>&#91;&#93;</code> |
| [monitoring_project_id](variables.tf#L41) | Monitoring project where the dashboard will be created and the solution deployed; a project will be created if set to empty string. | <code></code> | | |
| [project_monitoring_services](variables.tf#L59) | Service APIs enabled in the monitoring project if it will be created. | <code></code> | | <code title="&#91;&#10; &#34;artifactregistry.googleapis.com&#34;,&#10; &#34;cloudasset.googleapis.com&#34;,&#10; &#34;cloudbilling.googleapis.com&#34;,&#10; &#34;cloudbuild.googleapis.com&#34;,&#10; &#34;cloudfunctions.googleapis.com&#34;,&#10; &#34;cloudresourcemanager.googleapis.com&#34;,&#10; &#34;cloudscheduler.googleapis.com&#34;,&#10; &#34;compute.googleapis.com&#34;,&#10; &#34;iam.googleapis.com&#34;,&#10; &#34;iamcredentials.googleapis.com&#34;,&#10; &#34;logging.googleapis.com&#34;,&#10; &#34;monitoring.googleapis.com&#34;,&#10; &#34;pubsub.googleapis.com&#34;,&#10; &#34;run.googleapis.com&#34;,&#10; &#34;servicenetworking.googleapis.com&#34;,&#10; &#34;serviceusage.googleapis.com&#34;,&#10; &#34;storage-component.googleapis.com&#34;&#10;&#93;">&#91;&#8230;&#93;</code> |
| [region](variables.tf#L81) | Region used to deploy the cloud functions and scheduler. | <code></code> | | <code>europe-west1</code> |
| [schedule_cron](variables.tf#L86) | Cron format schedule to run the Cloud Function. Default is every 10 minutes. | <code></code> | | <code>&#42;&#47;10 &#42; &#42; &#42; &#42;</code> |

<!-- END TFDOC -->

- The tool assumes all VPCs in peering groups are within the same organization, except for PSA peerings.
- The tool will only fetch subnet utilization data from the PSA peerings (not the VMs, ILB or routes usage).
- The tool assumes global routing is ON, this impacts dynamic routes usage calculation.
- The tool assumes custom routes importing/exporting is ON, this impacts static and dynamic routes usage calculation.
- The tool assumes all networks in peering groups have the same global routing and custom routes sharing configuration.

## TODO

These are some of our ideas for additional features:

- support PSA-peered Google VPCs (Cloud SQL, Memorystore, etc.)
- dynamic routes for VPCs/peering groups with "global routing" turned off
- static routes calculation for projects/peering groups with custom routes import/export turned off
- cross-organization peering groups

If you are interested in this and/or would like to contribute, please open an issue in this repository or send us a PR.