Support workload identity #315

matthias-froomle · 2020-02-17T14:16:47Z

When deploying the Stackdriver custom metrics adapter inside a GKE cluster with workload identity enabled, the adaptor (v0.10.2) fails to start.

Steps taken:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

Adapter deployment log:

"unable to construct client config: unable to construct lister client config to initialize provider: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory" 
source: "adapter.go:55"

The text was updated successfully, but these errors were encountered:

pdecat · 2020-03-10T16:45:39Z

Hi,

I too had trouble with CMSA and GKE Workload Identity on GKE v1.15.7-gke.23.

The error messages at startup differed though:

I0310 16:01:28.490640       1 serving.go:312] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0310 16:01:30.116329       1 secure_serving.go:116] Serving securely on [::]:443
E0310 16:01:32.987569       1 provider.go:241] Failed request to stackdriver api: Get https://monitoring.googleapis.com/v3/projects/myproject-preprod/metricDescriptors?alt=json&filter=resource.labels.project_id+%3D+%22myproject-preprod%22+AND+resource.labels.cluster_name+%3D+%22myproject-preprod-europe-west1-gke1%22+AND+resource.labels.location+%3D+%22europe-west1%22+AND+resource.type+%3D+one_of%28%22k8s_pod%22%2C%22k8s_node%22%29&prettyPrint=false: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fmonitoring.read: net/http: timeout awaiting response headers

The log would then be spammed by:

E0310 16:01:35.432807       1 provider.go:241] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden

I've managed to make it work with GKE Workload Identity by adding hostNetwork: true to the deployment's spec.

It works because of the following documented limitation:

Workload Identity can't be used with Pods running in the host network.

See https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#limitations

serathius · 2020-03-10T20:22:39Z

/cc @kawych

davidxia · 2020-04-03T15:09:00Z

@pdecat, in your case, running CMSA on a node with Workload Identity (WI) enabled broke it probably because the Google Service Account (GSA) associated with WI that CMSA was running as didn't have the roles/monitoring.viewer role on the relevant GCP projects that hold the metrics CMSA is trying to query.

When you changed CMSA to run on host network mode, it probably worked because now CMSA is using the GKE node's default GSA which is different than the WI-related GSA. This GKE node default GSA probably has roles/monitoring.viewer role or at least those permissions to query the metrics.

davidxia · 2020-04-04T01:52:59Z

Workload Identity with CMSA 0.10.2 seems to work for me. I'm seeing these logs which are the same as the ones when it wasn't using WI.

I0404 01:50:34.583939       1 serving.go:312] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0404 01:50:37.188933       1 secure_serving.go:116] Serving securely on [::]:443
E0404 01:50:41.273390       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E0404 01:50:41.273463       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed

pdecat · 2020-04-04T06:33:06Z

@davidxia do you have Horizontal Pod Autoscalers based on external Stackdriver metrics?

davidxia · 2020-04-04T11:33:09Z

Yes

varungbt · 2020-04-09T23:01:57Z

Seeing the same issue.

JacobSMoller · 2020-04-30T15:13:14Z

Seems to work for me as well using workload identity.

Getting never ending stream of these logs though

1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed E0430 15:12:25.531660 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}

davidxia · 2020-04-30T15:15:47Z

Same, would be great if these could be silenced or moved to a lower logging level.

cromaniuc · 2020-05-07T16:48:11Z

@davidxia, @JacobSMoller What role are you using for Google Service Account associated with the Workload Identity that CMSA is running as? I'm using roles/monitoring.admin and it fails with 403. When I'm using hostNetwork: true it works.
Thanks!

davidxia · 2020-05-09T13:36:15Z

roles/monitoring.viewer

JacobSMoller · 2020-05-13T05:22:34Z

roles/monitoring.admin

LouisTrezzini · 2020-05-21T15:42:27Z

We're facing the same issue
Workload identity works fine for us in every single deployment except this one, so we're guessing there's something going on here

aubm · 2020-07-15T14:50:30Z

I managed to make it work with WI using the following approach :

gcloud iam service-accounts create custom-metrics-sd-adapter --project "$GCP_PROJECT_ID"

gcloud projects add-iam-policy-binding "$GCP_PROJECT_ID" \
  --member "serviceAccount:custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/monitoring.editor"

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
  "iam.gke.io/gcp-service-account=custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --namespace custom-metrics

viniciusccarvalho · 2020-08-03T21:27:49Z

I have the same issue. @aubm steps do not work either. It will fail with WI on the adapter with errors 2020-08-03 17:26:32.524 EDT Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden The annotated service account does point to the right GSA, but this simply will not work as expected.

aubm · 2020-08-03T22:20:05Z

@viniciusccarvalho did you try kubectl delete pods -all -n custom-metrics after running my previous commands?

viniciusccarvalho · 2020-08-03T22:24:07Z

Yes, I deleted everything even the namespace, still won't work. Running 1.16.11-gke.5 on my cluster. Still no luck

apurvc · 2020-09-18T11:33:43Z

I managed to make it work with WI using the following approach :

gcloud iam service-accounts create custom-metrics-sd-adapter --project "$GCP_PROJECT_ID"

gcloud projects add-iam-policy-binding "$GCP_PROJECT_ID" \
  --member "serviceAccount:custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/monitoring.editor"

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
  "iam.gke.io/gcp-service-account=custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --namespace custom-metrics

This worked for me I am using 1.15.12-gke.2

stevenarvar · 2021-02-09T20:13:58Z

Running 1.17.14-gke.1600. Ran into this issue. I followed the steps describe in the README:
https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/custom-metrics-stackdriver-adapter/README.md

The instruction is the same as @aubm. Actually, the first time I configured it, CMSA works with WI. Then I needed to replace the GSA so I re-annotated the K8S service account. My new GSA has the same roles as the original working GSA. I don't see any miss configuration. The new config is outputting these errors:

E0209 19:19:11.124054       1 provider.go:270] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden
E0209 19:19:11.220395       1 provider.go:270] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden
E0209 19:19:11.316925       1 provider.go:270] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden

stevenarvar · 2021-02-09T23:31:08Z

After wait a while, I do see my CMSA and WI working fine. Not sure why GSA/WI/CMSA does not work right away. May be took some time for GCP to sync up.

jharshman · 2021-02-10T21:35:25Z

There does appear to be a few issues here.

One being that the GKE Metadata Service does not support all of the same endpoints that the GCE metadata service does. So if you don't run this workload with host networking enabled in a cluster with Workload Identity enabled, it fails immediately and gets thrown into a crashloop with the following error:
Failed to get GCE config: error while getting instance (node) name: metadata: GCE metadata "instance/name" not defined

This makes sense seeing that there is no instance/name endpoint on the GKE metadata service.
Supported endpoints are:

attributes/
hostname
id
service-accounts/
zone

The second issue being that when not directly using Workload Identity, and instead setting GOOGLE_APPLICATION_CREDENTIALS environment variable with a service account JSON mounted to the pod, authentication starts to fail.

W0210 00:09:53.360350       1 stackdriver.go:91] Error while fetching metric descriptors for kube-state-metrics: Get https://monitoring.googleapis.com/v3/projects/REDACTED/metricDescriptors?alt=json&filter=metric.type+%3D+starts_with%28%22custom.googleapis.com%2Fkube-state-metrics%22%29&prettyPrint=false: oauth2: cannot fetch token: 400 Bad Request
Response: {"error":"invalid_scope","error_description":"Invalid OAuth scope or ID token audience provided."}

The prometheus-to-sd daemonset that comes with GKE as part of the core tooling, appears to use host-networking to bypass the GKE metadata server and use the GCE metadata server.

If there are issues with this project functioning with Workload Identity enabled or with host-networking mode turned off perhaps some documentation would help.

AnthonMS · 2022-07-07T08:20:53Z

gcloud iam service-accounts create custom-metrics-sd-adapter --project "$GCP_PROJECT_ID"

gcloud projects add-iam-policy-binding "$GCP_PROJECT_ID" \
  --member "serviceAccount:custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/monitoring.editor"

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
  "iam.gke.io/gcp-service-account=custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --namespace custom-metrics

I have finally gotten passed the stage where it says permission denied, by cleaning up all the services and other stuff the other adapter yaml config creates. I have run these commands to create the service account, bind the correct roles and create the services and deployment.

It does however, look like I am getting a new error and I will post the logs below.
Ideally I would like to scale based on fpm metrics like in the example here. But I got the permission denied in the logs after applying that adapter.yaml and I was also getting a NaN error in the prometheous-to-sd container. But that's for another day.

I thought if I applyed this adapter to my google project/GKE cluster? Then I would be able to scale based on request_per_second to the pods? Something like in the first example in this custom metrics adapter.

E0707 08:13:38.145404       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="c85dd773-5209-4482-b579-453fb45609cc"
E0707 08:13:38.145544       1 timeout.go:135] post-timeout activity - time-elapsed: 4.57µs, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:13:38.145943       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:13:38.146055       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="1f67ea8f-7156-4051-a592-eb2e1d6f784a"
E0707 08:13:38.146170       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="22365708-84de-47d9-83c4-f3a65e1ec541"
E0707 08:13:38.146237       1 timeout.go:135] post-timeout activity - time-elapsed: 3.409µs, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:13:38.147755       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:13:38.151491       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:13:38.157277       1 timeout.go:135] post-timeout activity - time-elapsed: 105.423502ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:13:38.158473       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:13:38.159764       1 timeout.go:135] post-timeout activity - time-elapsed: 13.63787ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:07.944406       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.944539       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:07.944614       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="5b892969-1535-4b4c-9bcf-098b41715a3f"
E0707 08:14:07.950119       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:07.950496       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.951429       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.951550       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="81a6e59c-c9fd-4928-9970-3139892733df"
E0707 08:14:07.951884       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.955355       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="51b15586-d512-46ab-8043-96f15ba83db0"
E0707 08:14:07.955692       1 timeout.go:135] post-timeout activity - time-elapsed: 10.996794ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:07.956311       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="b3f23a0f-7a84-4fe0-93ed-11c74277106e"
E0707 08:14:07.956580       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:07.957598       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:08.046439       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:08.052644       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.053480       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.054590       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.057312       1 timeout.go:135] post-timeout activity - time-elapsed: 101.863564ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:08.144599       1 timeout.go:135] post-timeout activity - time-elapsed: 192.981022ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:08.145462       1 timeout.go:135] post-timeout activity - time-elapsed: 189.04884ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:08.146271       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:08.146469       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="33c48cb9-9f1d-4ed3-99f7-bb6b56f21a33"
E0707 08:14:08.150336       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:08.152789       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.154572       1 timeout.go:135] post-timeout activity - time-elapsed: 7.988662ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:37.747766       1 writers.go:111] apiserver was unable to close cleanly the response writer: http2: stream closed
E0707 08:14:37.852388       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="4ea88186-56b4-47a4-b73d-6a680fa8ce2f"
E0707 08:14:37.853284       1 timeout.go:135] post-timeout activity - time-elapsed: 13.497µs, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:38.046613       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:38.047320       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:38.047827       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="d12d9ec7-92c8-459d-ab49-f9953abda738"
E0707 08:14:38.054379       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:38.054542       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="9d28f636-0bb0-495a-9526-c81ebb4d0c98"
E0707 08:14:38.145771       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:38.147100       1 timeout.go:135] post-timeout activity - time-elapsed: 98.88436ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:38.152042       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:38.155348       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:38.156855       1 timeout.go:135] post-timeout activity - time-elapsed: 102.231415ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:38.158736       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="3bf9eb85-5fb1-4d18-8ec8-988ce1eca65b"
E0707 08:14:38.158967       1 writers.go:111] apiserver was unable to close cleanly the response writer: http: Handler timeout
E0707 08:14:38.161342       1 timeout.go:135] post-timeout activity - time-elapsed: 2.336539ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>

msathe-tech · 2022-08-22T20:35:27Z

I tried all known tricks in the book, created ns ahead of time, create the needed K8s SA (custom-metrics-stackdriver-adapter), annotated K8s SA with GCP SA, the GCP SA already has a Moniotoring Editor role, and then used https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

I still get following error - 
E0822 20:17:27.104881       1 provider.go:271] Failed request to stackdriver api: Get "https://monitoring.googleapis.com/v3/projects/<project>/metricDescriptors?alt=json&filter=resource.labels.project_id+%3D+%22prj-gke-mt-spike%22+AND+resource.labels.cluster_name+%3D+%22<cluster>%22+AND+resource.labels.location+%3D+%22<c cluster-zone>%22+AND+resource.type+%3D+one_of%28%22k8s_pod%22%2C%22k8s_node%22%2C%22k8s_container%22%29&prettyPrint=false": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: The caller does not have permission
This error could be caused by a missing IAM policy binding on the target IAM service account.
For more information, refer to the Workload Identity documentation:
        https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

msathe-tech · 2022-08-22T20:38:14Z

My GKE version is 1.24.3-gke.200

Without this feature the HPA simply doesn't work with custom metric. You need to rely on a SA key download which is against a security best practice.

eahrend · 2022-09-27T18:31:43Z

Hey, I'm getting this on 1.22.12-gke.300 as well, trying to use WIF.

I'm also getting this error:

E0927 18:27:46.752161       1 timeout.go:135] post-timeout activity - time-elapsed: 22.306873ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0927 18:27:46.755417       1 timeout.go:135] post-timeout activity - time-elapsed: 25.534497ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>

However when I run $ kubectl proxy --port=8080 and go to http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta2 and http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta1 the response is not nil and happens almost instantaneously.

red8888 · 2022-10-31T22:29:47Z

Can you confirm what service account is used by default by the metrics adapter? Is it the node's GSA?

I assign my own GSA to the node pools:

resource "google_container_node_pool" "mypool" {
  name       = "sdfsdfsdf"
  cluster    = google_container_cluster.cluster.name
  .....
  node_config {
    machine_type = "e2-highmem-4"
    // Assign a service account
    service_account = google_service_account.node-pool.email

Can you confirm that this GSA will need access? I don't need the adapter deployment to use workload id itself. Im ok with giving the node pool account this access.

I granted my node pool GSA the Monitoring Viewer role but still seeing this error in the deployment:
Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden

@pdecat, in your case, running CMSA on a node with Workload Identity (WI) enabled broke it probably because the Google Service Account (GSA) associated with WI that CMSA was running as didn't have the roles/monitoring.viewer role on the relevant GCP projects that hold the metrics CMSA is trying to query.

When you changed CMSA to run on host network mode, it probably worked because now CMSA is using the GKE node's default GSA which is different than the WI-related GSA. This GKE node default GSA probably has roles/monitoring.viewer role or at least those permissions to query the metrics.

red8888 · 2022-11-01T16:17:47Z

I tried all known tricks in the book, created ns ahead of time, create the needed K8s SA (custom-metrics-stackdriver-adapter), annotated K8s SA with GCP SA, the GCP SA already has a Moniotoring Editor role, and then used https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
I still get following error - 
E0822 20:17:27.104881       1 provider.go:271] Failed request to stackdriver api: Get "https://monitoring.googleapis.com/v3/projects/<project>/metricDescriptors?alt=json&filter=resource.labels.project_id+%3D+%22prj-gke-mt-spike%22+AND+resource.labels.cluster_name+%3D+%22<cluster>%22+AND+resource.labels.location+%3D+%22<c cluster-zone>%22+AND+resource.type+%3D+one_of%28%22k8s_pod%22%2C%22k8s_node%22%2C%22k8s_container%22%29&prettyPrint=false": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: The caller does not have permission
This error could be caused by a missing IAM policy binding on the target IAM service account.
For more information, refer to the Workload Identity documentation:
        https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

Sounds like your missing this piece

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

iamhritik · 2023-02-14T16:45:14Z

I'm also getting the same error in custom-stackdriver pod
Any new update ???

E0214 16:25:51.059563       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="de9aad6d-13d9-4f88-86fc-ef73c7eb568f"
E0214 16:25:51.059907       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="9ac55f68-c490-4953-b32e-d775adfb056d"
E0214 16:25:51.060088       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="5f791624-f01f-45c7-8d8f-76e43bf56a9c"

However when I run $ kubectl proxy --port=8080 and go to http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta2 and http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta1 the response is not nil and happens almost instantaneously.

nguyen-viet-hung · 2023-02-21T03:53:15Z

I have the same error as above:

E0221 03:19:29.283344       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0221 03:19:29.283406       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="927bf416-71ba-4845-880a-8ba80c0b044d"
E0221 03:19:29.283529       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="898079ad-6b51-45e9-b3ec-b723312ba8ff"

When I do kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq the response is not nil and happens almost instantaneously.

Does anybody have solutions?
My cluster version is: v1.24.9-gke.2000

perrornet · 2023-02-27T08:53:13Z

I have the same error as above:↳
E0221 03:19:29.283344       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0221 03:19:29.283406       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="927bf416-71ba-4845-880a-8ba80c0b044d"
E0221 03:19:29.283529       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="898079ad-6b51-45e9-b3ec-b723312ba8ff"
When I do kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq the response is not nil and happens almost instantaneously.↳

Does anybody have solutions? My cluster version is: v1.24.9-gke.2000↳

I have also encountered this situation.

E0227 08:50:42.495351       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="b63c2246-27c1-4ece-a779-e552782f1dcd"
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0227 08:50:42.523800       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="67d82f5f-5360-404b-b688-9639d9a89a88"
E0227 08:50:42.523807       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0227 08:50:42.526275       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0227 08:50:42.527535       1 timeout.go:135] post-timeout activity - time-elapsed: 32.12511ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>

My cluster version is: v1.25.5-gke.2000

Wazbat · 2023-07-07T14:56:02Z

Shame this adapter isn't included by default. Struggling to resolve this on my end

Permission errors with workload identity, and even with hostNetwork: true I start to get these errors

post-timeout activity - time-elapsed: 12.553843ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>

maxpain · 2023-08-30T17:42:50Z

Any updates?

MikSFG · 2023-09-07T11:13:45Z

I have the same error as above:↳
E0221 03:19:29.283344       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0221 03:19:29.283406       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="927bf416-71ba-4845-880a-8ba80c0b044d"
E0221 03:19:29.283529       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="898079ad-6b51-45e9-b3ec-b723312ba8ff"
When I do kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq the response is not nil and happens almost instantaneously.↳
Does anybody have solutions? My cluster version is: v1.24.9-gke.2000↳

I have also encountered this situation.

E0227 08:50:42.495351       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="b63c2246-27c1-4ece-a779-e552782f1dcd"
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0227 08:50:42.523800       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="67d82f5f-5360-404b-b688-9639d9a89a88"
E0227 08:50:42.523807       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0227 08:50:42.526275       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0227 08:50:42.527535       1 timeout.go:135] post-timeout activity - time-elapsed: 32.12511ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>

My cluster version is: v1.25.5-gke.2000

Happens to me as well, kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq works ok, and gives normal output.

IanKnighton · 2023-09-12T19:54:26Z

We're trying to setup the custom metrics so we can scale pods off of messages in a pub/sub queue.

Currently can't get past this error in the logs for the custom-metrics-stackdriver-adapter pod.

E0912 19:48:53.983876       1 provider.go:320] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden
E0912 19:48:53.984071       1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0912 19:48:53.984139       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0912 19:48:53.985476       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0912 19:48:53.986869       1 timeout.go:135] post-timeout activity - time-elapsed: 9m10.432424127s, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>

I've tried every combination of service account I can think of, but it appears that all of our nodes and all of our pods have the monitoring.metricDescriptors.list permission at the least.

It's kind of wild to me that it appears that this is all over the place. Also kind of annoying that I just followed the documentation and now we're here.

markhc · 2023-10-11T13:21:12Z

Same issue on our end. Followed every step in the GCP Readme and the alternative methods here as well, still getting 403 errors.

I can get the 403 errors to be resolved when using hostNetwork: true for the deployment, but then other issues pop-up and the pod enters in a crash loop every couple of seconds with these GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil> errors

GKE Cluster Version 1.24.15-gke.1700

EDIT: Finally managed to get it working. The crash loop I mentioned above was an OOMKilled, so I had to increase the resources for the custom-metrics Deployment.

Final working steps:

Follow @aubm steps and install the adapter.yaml resources NOT THE adapter_new_resource_model.yaml. The new adapter entered a different crash loop for me that I was not able to solve.
Modify that adapter Deployment, adding hostNetwork: true to the spec. This solves the "403 Forbidden" errors.
Increase request/limit resources. See what works for you, but my adapter is frequently reaching ~350Mb, and it only had 200Mb as a limit originally.

ltieman · 2023-10-19T17:01:52Z

I was still getting 403 exceptions using @aubm 's instructions until I added this:

kind: ClusterRole
metadata:
  name: custom-metrics-permissions
rules:
- apiGroups: [""]
  resources: ["subjectaccessreviews"]
  verbs: ["create"]

---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: custom-metrics-binding
subjects:
- kind: ServiceAccount
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics  # Replace with the appropriate namespace
roleRef:
  kind: ClusterRole
  name: custom-metrics-permissions
  apiGroup: rbac.authorization.k8s.io

once the IAM permissions propagated, it started working

PaulRudin · 2023-12-19T16:05:22Z

Just to be clear is monitoring.editor necessary? You'd have thought monitoring.viewer is enough - and that's what the README says.

Although it's somewhat academic in my case as I'm getting:

E1219 16:02:17.832184       1 provider.go:320] Failed request to stackdriver api: Get "https://monitoring.go │
│ 2023-12-19T16:02:17.832315130Z This error could be caused by a missing IAM policy binding on the target IAM service account.                │
│ 2023-12-19T16:02:17.832320455Z For more information, refer to the Workload Identity documentation:                                          │
│ 2023-12-19T16:02:17.832323944Z     https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

either way.

PaulRudin · 2023-12-20T07:27:28Z

After applying the suggestion here, the permissions issue is fixed, but I still get loads of this sort of thing in the logs:

1220 07:24:32.869016       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="9c852465-5ffa-4d77-a441-f801886e29e3"
1220 07:24:32.869108       1 writers.go:111] apiserver was unable to close cleanly the response writer: http2: stream closed
E1220 07:24:32.871209       1 timeout.go:135] post-timeout activity - time-elapsed: 102.731383ms, GET "/api/custom.metrics.k8s.io/v1beta1" result:  <nil>                         
E1220 07:24:32.873362       1 timeout.go:135] post-timeout activity - time-elapsed: 103.207074ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result:  <nil>
E1220 07:24:32.874493       1 timeout.go:135] post-timeout activity - time-elapsed: 109.99919ms, GET "/api/custom.metrics.k8s.io/v1beta2" result: <nil>
E1220 07:24:32.876604       1 timeout.go:135] post-timeout activity - time-elapsed: 7.572859ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: < nil>

PaulRudin · 2023-12-21T09:36:51Z

... and either I was mistaken, or the issue has resurfaced - I still see: this sort of thing:

apiserver received an error that is not an metav1.Status: &googleapi.Error{Code:403, Message:"Permission monitoring.timeSeries.list denied (or the resource may not exist).", Details:[]interface {}(nil), Body:"{\n  \"error\": {\n    \"code\": 403,\n    \"message\": \"Permission monitoring.timeSeries.list denied (or the resource may not exist).\",\n    \"errors\": [\n      {\n        \"message\": \"Permission monitoring.timeSeries.list denied (or the resource may not exist).\",\n        \"domain\": \"global\",\n        \"reason\": \"forbidden\"\n      }\n    ],\n    \"status\": \"PERMISSION_DENIED\"\n  }\n}\n", Header:http.Header(nil), Errors:[]googleapi.ErrorItem{googleapi.ErrorItem{Reason:"forbidden", Message:"Permission monitoring.timeSeries.list denied (or the resource may not exist)."}}}: googleapi: Error 403: Permission monitoring.timeSeries.list denied (or the resource may not exist)., forbidden

even though the relevant service account has the monitoring.viewer role.

apurvc mentioned this issue Sep 18, 2020

Autoscale of pubsub subscriber is failing with error GoogleCloudPlatform/kubernetes-engine-samples#140

Closed

AnthonMS mentioned this issue Jul 12, 2022

error calling MarshalJSON / unsupported value: NaN #479

Open

StevenACoffman mentioned this issue Jul 23, 2022

Custom Metrics Adapter Applicability to GKE Autopilot #480

Open

wyardley mentioned this issue Feb 10, 2023

custom-metrics-stackdriver-adapter not working - auth problems? #498

Open

Support workload identity #315

Support workload identity #315

Comments

matthias-froomle commented Feb 17, 2020

pdecat commented Mar 10, 2020 • edited

serathius commented Mar 10, 2020

davidxia commented Apr 3, 2020

davidxia commented Apr 4, 2020

pdecat commented Apr 4, 2020 • edited

davidxia commented Apr 4, 2020 via email • edited

varungbt commented Apr 9, 2020

JacobSMoller commented Apr 30, 2020

davidxia commented Apr 30, 2020

cromaniuc commented May 7, 2020

davidxia commented May 9, 2020

JacobSMoller commented May 13, 2020

LouisTrezzini commented May 21, 2020

aubm commented Jul 15, 2020

viniciusccarvalho commented Aug 3, 2020

aubm commented Aug 3, 2020

viniciusccarvalho commented Aug 3, 2020

apurvc commented Sep 18, 2020

stevenarvar commented Feb 9, 2021 • edited

stevenarvar commented Feb 9, 2021

jharshman commented Feb 10, 2021

AnthonMS commented Jul 7, 2022

msathe-tech commented Aug 22, 2022

msathe-tech commented Aug 22, 2022

eahrend commented Sep 27, 2022 • edited

red8888 commented Oct 31, 2022

red8888 commented Nov 1, 2022

iamhritik commented Feb 14, 2023

nguyen-viet-hung commented Feb 21, 2023

perrornet commented Feb 27, 2023

Wazbat commented Jul 7, 2023 • edited

maxpain commented Aug 30, 2023

MikSFG commented Sep 7, 2023 • edited

IanKnighton commented Sep 12, 2023

markhc commented Oct 11, 2023 • edited

ltieman commented Oct 19, 2023 • edited

PaulRudin commented Dec 19, 2023

PaulRudin commented Dec 20, 2023 • edited

PaulRudin commented Dec 21, 2023

pdecat commented Mar 10, 2020 •

edited

pdecat commented Apr 4, 2020 •

edited

davidxia commented Apr 4, 2020 via email •

edited

stevenarvar commented Feb 9, 2021 •

edited

eahrend commented Sep 27, 2022 •

edited

Wazbat commented Jul 7, 2023 •

edited

MikSFG commented Sep 7, 2023 •

edited

markhc commented Oct 11, 2023 •

edited

ltieman commented Oct 19, 2023 •

edited

PaulRudin commented Dec 20, 2023 •

edited