Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support workload identity #315

Open
matthias-froomle opened this issue Feb 17, 2020 · 39 comments
Open

Support workload identity #315

matthias-froomle opened this issue Feb 17, 2020 · 39 comments

Comments

@matthias-froomle
Copy link

When deploying the Stackdriver custom metrics adapter inside a GKE cluster with workload identity enabled, the adaptor (v0.10.2) fails to start.

Steps taken:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

Adapter deployment log:

"unable to construct client config: unable to construct lister client config to initialize provider: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory" 
source: "adapter.go:55" 
@pdecat
Copy link

pdecat commented Mar 10, 2020

Hi,

I too had trouble with CMSA and GKE Workload Identity on GKE v1.15.7-gke.23.

The error messages at startup differed though:

I0310 16:01:28.490640       1 serving.go:312] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0310 16:01:30.116329       1 secure_serving.go:116] Serving securely on [::]:443
E0310 16:01:32.987569       1 provider.go:241] Failed request to stackdriver api: Get https://monitoring.googleapis.com/v3/projects/myproject-preprod/metricDescriptors?alt=json&filter=resource.labels.project_id+%3D+%22myproject-preprod%22+AND+resource.labels.cluster_name+%3D+%22myproject-preprod-europe-west1-gke1%22+AND+resource.labels.location+%3D+%22europe-west1%22+AND+resource.type+%3D+one_of%28%22k8s_pod%22%2C%22k8s_node%22%29&prettyPrint=false: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fmonitoring.read: net/http: timeout awaiting response headers

The log would then be spammed by:

E0310 16:01:35.432807       1 provider.go:241] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden

I've managed to make it work with GKE Workload Identity by adding hostNetwork: true to the deployment's spec.

It works because of the following documented limitation:

Workload Identity can't be used with Pods running in the host network.

See https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#limitations

@serathius
Copy link
Contributor

/cc @kawych

@davidxia
Copy link

davidxia commented Apr 3, 2020

@pdecat, in your case, running CMSA on a node with Workload Identity (WI) enabled broke it probably because the Google Service Account (GSA) associated with WI that CMSA was running as didn't have the roles/monitoring.viewer role on the relevant GCP projects that hold the metrics CMSA is trying to query.

When you changed CMSA to run on host network mode, it probably worked because now CMSA is using the GKE node's default GSA which is different than the WI-related GSA. This GKE node default GSA probably has roles/monitoring.viewer role or at least those permissions to query the metrics.

@davidxia
Copy link

davidxia commented Apr 4, 2020

Workload Identity with CMSA 0.10.2 seems to work for me. I'm seeing these logs which are the same as the ones when it wasn't using WI.

I0404 01:50:34.583939       1 serving.go:312] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0404 01:50:37.188933       1 secure_serving.go:116] Serving securely on [::]:443
E0404 01:50:41.273390       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E0404 01:50:41.273463       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed

@pdecat
Copy link

pdecat commented Apr 4, 2020

@davidxia do you have Horizontal Pod Autoscalers based on external Stackdriver metrics?

@davidxia
Copy link

davidxia commented Apr 4, 2020 via email

@varungbt
Copy link

varungbt commented Apr 9, 2020

Seeing the same issue.

@JacobSMoller
Copy link

Seems to work for me as well using workload identity.

Getting never ending stream of these logs though

1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed E0430 15:12:25.531660 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}

@davidxia
Copy link

Same, would be great if these could be silenced or moved to a lower logging level.

@cromaniuc
Copy link

@davidxia, @JacobSMoller What role are you using for Google Service Account associated with the Workload Identity that CMSA is running as? I'm using roles/monitoring.admin and it fails with 403. When I'm using hostNetwork: true it works.
Thanks!

@davidxia
Copy link

davidxia commented May 9, 2020

roles/monitoring.viewer

@JacobSMoller
Copy link

roles/monitoring.admin

@LouisTrezzini
Copy link

We're facing the same issue
Workload identity works fine for us in every single deployment except this one, so we're guessing there's something going on here

@aubm
Copy link

aubm commented Jul 15, 2020

I managed to make it work with WI using the following approach :

gcloud iam service-accounts create custom-metrics-sd-adapter --project "$GCP_PROJECT_ID"

gcloud projects add-iam-policy-binding "$GCP_PROJECT_ID" \
  --member "serviceAccount:custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/monitoring.editor"

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
  "iam.gke.io/gcp-service-account=custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --namespace custom-metrics

@viniciusccarvalho
Copy link

I have the same issue. @aubm steps do not work either. It will fail with WI on the adapter with errors 2020-08-03 17:26:32.524 EDT Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden The annotated service account does point to the right GSA, but this simply will not work as expected.

@aubm
Copy link

aubm commented Aug 3, 2020

@viniciusccarvalho did you try kubectl delete pods -all -n custom-metrics after running my previous commands?

@viniciusccarvalho
Copy link

Yes, I deleted everything even the namespace, still won't work. Running 1.16.11-gke.5 on my cluster. Still no luck

@apurvc
Copy link

apurvc commented Sep 18, 2020

I managed to make it work with WI using the following approach :

gcloud iam service-accounts create custom-metrics-sd-adapter --project "$GCP_PROJECT_ID"

gcloud projects add-iam-policy-binding "$GCP_PROJECT_ID" \
  --member "serviceAccount:custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/monitoring.editor"

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
  "iam.gke.io/gcp-service-account=custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --namespace custom-metrics

This worked for me I am using 1.15.12-gke.2

@stevenarvar
Copy link

stevenarvar commented Feb 9, 2021

Running 1.17.14-gke.1600. Ran into this issue. I followed the steps describe in the README:
https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/custom-metrics-stackdriver-adapter/README.md

The instruction is the same as @aubm. Actually, the first time I configured it, CMSA works with WI. Then I needed to replace the GSA so I re-annotated the K8S service account. My new GSA has the same roles as the original working GSA. I don't see any miss configuration. The new config is outputting these errors:

E0209 19:19:11.124054       1 provider.go:270] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden
E0209 19:19:11.220395       1 provider.go:270] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden
E0209 19:19:11.316925       1 provider.go:270] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden

@stevenarvar
Copy link

After wait a while, I do see my CMSA and WI working fine. Not sure why GSA/WI/CMSA does not work right away. May be took some time for GCP to sync up.

@jharshman
Copy link
Contributor

There does appear to be a few issues here.

One being that the GKE Metadata Service does not support all of the same endpoints that the GCE metadata service does. So if you don't run this workload with host networking enabled in a cluster with Workload Identity enabled, it fails immediately and gets thrown into a crashloop with the following error:
Failed to get GCE config: error while getting instance (node) name: metadata: GCE metadata "instance/name" not defined

This makes sense seeing that there is no instance/name endpoint on the GKE metadata service.
Supported endpoints are:

attributes/
hostname
id
service-accounts/
zone

The second issue being that when not directly using Workload Identity, and instead setting GOOGLE_APPLICATION_CREDENTIALS environment variable with a service account JSON mounted to the pod, authentication starts to fail.

W0210 00:09:53.360350       1 stackdriver.go:91] Error while fetching metric descriptors for kube-state-metrics: Get https://monitoring.googleapis.com/v3/projects/REDACTED/metricDescriptors?alt=json&filter=metric.type+%3D+starts_with%28%22custom.googleapis.com%2Fkube-state-metrics%22%29&prettyPrint=false: oauth2: cannot fetch token: 400 Bad Request
Response: {"error":"invalid_scope","error_description":"Invalid OAuth scope or ID token audience provided."}

The prometheus-to-sd daemonset that comes with GKE as part of the core tooling, appears to use host-networking to bypass the GKE metadata server and use the GCE metadata server.

If there are issues with this project functioning with Workload Identity enabled or with host-networking mode turned off perhaps some documentation would help.

@AnthonMS
Copy link

AnthonMS commented Jul 7, 2022

gcloud iam service-accounts create custom-metrics-sd-adapter --project "$GCP_PROJECT_ID"

gcloud projects add-iam-policy-binding "$GCP_PROJECT_ID" \
  --member "serviceAccount:custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/monitoring.editor"

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
  "iam.gke.io/gcp-service-account=custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --namespace custom-metrics

I have finally gotten passed the stage where it says permission denied, by cleaning up all the services and other stuff the other adapter yaml config creates. I have run these commands to create the service account, bind the correct roles and create the services and deployment.

It does however, look like I am getting a new error and I will post the logs below.
Ideally I would like to scale based on fpm metrics like in the example here. But I got the permission denied in the logs after applying that adapter.yaml and I was also getting a NaN error in the prometheous-to-sd container. But that's for another day.

I thought if I applyed this adapter to my google project/GKE cluster? Then I would be able to scale based on request_per_second to the pods? Something like in the first example in this custom metrics adapter.

E0707 08:13:38.145404       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="c85dd773-5209-4482-b579-453fb45609cc"
E0707 08:13:38.145544       1 timeout.go:135] post-timeout activity - time-elapsed: 4.57µs, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:13:38.145943       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:13:38.146055       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="1f67ea8f-7156-4051-a592-eb2e1d6f784a"
E0707 08:13:38.146170       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="22365708-84de-47d9-83c4-f3a65e1ec541"
E0707 08:13:38.146237       1 timeout.go:135] post-timeout activity - time-elapsed: 3.409µs, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:13:38.147755       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:13:38.151491       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:13:38.157277       1 timeout.go:135] post-timeout activity - time-elapsed: 105.423502ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:13:38.158473       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:13:38.159764       1 timeout.go:135] post-timeout activity - time-elapsed: 13.63787ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:07.944406       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.944539       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:07.944614       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="5b892969-1535-4b4c-9bcf-098b41715a3f"
E0707 08:14:07.950119       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:07.950496       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.951429       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.951550       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="81a6e59c-c9fd-4928-9970-3139892733df"
E0707 08:14:07.951884       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.955355       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="51b15586-d512-46ab-8043-96f15ba83db0"
E0707 08:14:07.955692       1 timeout.go:135] post-timeout activity - time-elapsed: 10.996794ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:07.956311       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="b3f23a0f-7a84-4fe0-93ed-11c74277106e"
E0707 08:14:07.956580       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:07.957598       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:08.046439       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:08.052644       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.053480       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.054590       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.057312       1 timeout.go:135] post-timeout activity - time-elapsed: 101.863564ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:08.144599       1 timeout.go:135] post-timeout activity - time-elapsed: 192.981022ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:08.145462       1 timeout.go:135] post-timeout activity - time-elapsed: 189.04884ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:08.146271       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:08.146469       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="33c48cb9-9f1d-4ed3-99f7-bb6b56f21a33"
E0707 08:14:08.150336       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:08.152789       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.154572       1 timeout.go:135] post-timeout activity - time-elapsed: 7.988662ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:37.747766       1 writers.go:111] apiserver was unable to close cleanly the response writer: http2: stream closed
E0707 08:14:37.852388       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="4ea88186-56b4-47a4-b73d-6a680fa8ce2f"
E0707 08:14:37.853284       1 timeout.go:135] post-timeout activity - time-elapsed: 13.497µs, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:38.046613       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:38.047320       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:38.047827       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="d12d9ec7-92c8-459d-ab49-f9953abda738"
E0707 08:14:38.054379       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:38.054542       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="9d28f636-0bb0-495a-9526-c81ebb4d0c98"
E0707 08:14:38.145771       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:38.147100       1 timeout.go:135] post-timeout activity - time-elapsed: 98.88436ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:38.152042       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:38.155348       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:38.156855       1 timeout.go:135] post-timeout activity - time-elapsed: 102.231415ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:38.158736       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="3bf9eb85-5fb1-4d18-8ec8-988ce1eca65b"
E0707 08:14:38.158967       1 writers.go:111] apiserver was unable to close cleanly the response writer: http: Handler timeout
E0707 08:14:38.161342       1 timeout.go:135] post-timeout activity - time-elapsed: 2.336539ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>

@msathe-tech
Copy link

I tried all known tricks in the book, created ns ahead of time, create the needed K8s SA (custom-metrics-stackdriver-adapter), annotated K8s SA with GCP SA, the GCP SA already has a Moniotoring Editor role, and then used https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

I still get following error - 
E0822 20:17:27.104881       1 provider.go:271] Failed request to stackdriver api: Get "https://monitoring.googleapis.com/v3/projects/<project>/metricDescriptors?alt=json&filter=resource.labels.project_id+%3D+%22prj-gke-mt-spike%22+AND+resource.labels.cluster_name+%3D+%22<cluster>%22+AND+resource.labels.location+%3D+%22<c cluster-zone>%22+AND+resource.type+%3D+one_of%28%22k8s_pod%22%2C%22k8s_node%22%2C%22k8s_container%22%29&prettyPrint=false": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: The caller does not have permission
This error could be caused by a missing IAM policy binding on the target IAM service account.
For more information, refer to the Workload Identity documentation:
        https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

@msathe-tech
Copy link

My GKE version is 1.24.3-gke.200

Without this feature the HPA simply doesn't work with custom metric. You need to rely on a SA key download which is against a security best practice.

@eahrend
Copy link

eahrend commented Sep 27, 2022

Hey, I'm getting this on 1.22.12-gke.300 as well, trying to use WIF.

I'm also getting this error:

E0927 18:27:46.752161       1 timeout.go:135] post-timeout activity - time-elapsed: 22.306873ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0927 18:27:46.755417       1 timeout.go:135] post-timeout activity - time-elapsed: 25.534497ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>

However when I run $ kubectl proxy --port=8080 and go to http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta2 and http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta1 the response is not nil and happens almost instantaneously.

@red8888
Copy link

red8888 commented Oct 31, 2022

Can you confirm what service account is used by default by the metrics adapter? Is it the node's GSA?

I assign my own GSA to the node pools:

resource "google_container_node_pool" "mypool" {
  name       = "sdfsdfsdf"
  cluster    = google_container_cluster.cluster.name
  .....
  node_config {
    machine_type = "e2-highmem-4"
    // Assign a service account
    service_account = google_service_account.node-pool.email

Can you confirm that this GSA will need access? I don't need the adapter deployment to use workload id itself. Im ok with giving the node pool account this access.

I granted my node pool GSA the Monitoring Viewer role but still seeing this error in the deployment:
Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden

@pdecat, in your case, running CMSA on a node with Workload Identity (WI) enabled broke it probably because the Google Service Account (GSA) associated with WI that CMSA was running as didn't have the roles/monitoring.viewer role on the relevant GCP projects that hold the metrics CMSA is trying to query.

When you changed CMSA to run on host network mode, it probably worked because now CMSA is using the GKE node's default GSA which is different than the WI-related GSA. This GKE node default GSA probably has roles/monitoring.viewer role or at least those permissions to query the metrics.

@red8888
Copy link

red8888 commented Nov 1, 2022

I tried all known tricks in the book, created ns ahead of time, create the needed K8s SA (custom-metrics-stackdriver-adapter), annotated K8s SA with GCP SA, the GCP SA already has a Moniotoring Editor role, and then used https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

I still get following error - 
E0822 20:17:27.104881       1 provider.go:271] Failed request to stackdriver api: Get "https://monitoring.googleapis.com/v3/projects/<project>/metricDescriptors?alt=json&filter=resource.labels.project_id+%3D+%22prj-gke-mt-spike%22+AND+resource.labels.cluster_name+%3D+%22<cluster>%22+AND+resource.labels.location+%3D+%22<c cluster-zone>%22+AND+resource.type+%3D+one_of%28%22k8s_pod%22%2C%22k8s_node%22%2C%22k8s_container%22%29&prettyPrint=false": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: The caller does not have permission
This error could be caused by a missing IAM policy binding on the target IAM service account.
For more information, refer to the Workload Identity documentation:
        https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

Sounds like your missing this piece

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

@iamhritik
Copy link

I'm also getting the same error in custom-stackdriver pod
Any new update ???

E0214 16:25:51.059563       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="de9aad6d-13d9-4f88-86fc-ef73c7eb568f"
E0214 16:25:51.059907       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="9ac55f68-c490-4953-b32e-d775adfb056d"
E0214 16:25:51.060088       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="5f791624-f01f-45c7-8d8f-76e43bf56a9c"

However when I run $ kubectl proxy --port=8080 and go to http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta2 and http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta1 the response is not nil and happens almost instantaneously.

@nguyen-viet-hung
Copy link

I have the same error as above:

E0221 03:19:29.283344       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0221 03:19:29.283406       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="927bf416-71ba-4845-880a-8ba80c0b044d"
E0221 03:19:29.283529       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="898079ad-6b51-45e9-b3ec-b723312ba8ff"

When I do kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq the response is not nil and happens almost instantaneously.

Does anybody have solutions?
My cluster version is: v1.24.9-gke.2000

@perrornet
Copy link

I have the same error as above:↳

E0221 03:19:29.283344       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0221 03:19:29.283406       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="927bf416-71ba-4845-880a-8ba80c0b044d"
E0221 03:19:29.283529       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="898079ad-6b51-45e9-b3ec-b723312ba8ff"

When I do kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq the response is not nil and happens almost instantaneously.↳

Does anybody have solutions? My cluster version is: v1.24.9-gke.2000↳

I have also encountered this situation.

E0227 08:50:42.495351       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="b63c2246-27c1-4ece-a779-e552782f1dcd"
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0227 08:50:42.523800       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="67d82f5f-5360-404b-b688-9639d9a89a88"
E0227 08:50:42.523807       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0227 08:50:42.526275       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0227 08:50:42.527535       1 timeout.go:135] post-timeout activity - time-elapsed: 32.12511ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>

My cluster version is: v1.25.5-gke.2000

@Wazbat
Copy link

Wazbat commented Jul 7, 2023

Shame this adapter isn't included by default. Struggling to resolve this on my end

Permission errors with workload identity, and even with hostNetwork: true I start to get these errors

post-timeout activity - time-elapsed: 12.553843ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>

@maxpain
Copy link

maxpain commented Aug 30, 2023

Any updates?

@MikSFG
Copy link

MikSFG commented Sep 7, 2023

I have the same error as above:↳

E0221 03:19:29.283344       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0221 03:19:29.283406       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="927bf416-71ba-4845-880a-8ba80c0b044d"
E0221 03:19:29.283529       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="898079ad-6b51-45e9-b3ec-b723312ba8ff"

When I do kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq the response is not nil and happens almost instantaneously.↳
Does anybody have solutions? My cluster version is: v1.24.9-gke.2000↳

I have also encountered this situation.

E0227 08:50:42.495351       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="b63c2246-27c1-4ece-a779-e552782f1dcd"
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0227 08:50:42.523800       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="67d82f5f-5360-404b-b688-9639d9a89a88"
E0227 08:50:42.523807       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0227 08:50:42.526275       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0227 08:50:42.527535       1 timeout.go:135] post-timeout activity - time-elapsed: 32.12511ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>

My cluster version is: v1.25.5-gke.2000

Happens to me as well, kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq works ok, and gives normal output.

@IanKnighton
Copy link

We're trying to setup the custom metrics so we can scale pods off of messages in a pub/sub queue.

Currently can't get past this error in the logs for the custom-metrics-stackdriver-adapter pod.

E0912 19:48:53.983876       1 provider.go:320] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden
E0912 19:48:53.984071       1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0912 19:48:53.984139       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0912 19:48:53.985476       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0912 19:48:53.986869       1 timeout.go:135] post-timeout activity - time-elapsed: 9m10.432424127s, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>

I've tried every combination of service account I can think of, but it appears that all of our nodes and all of our pods have the monitoring.metricDescriptors.list permission at the least.

It's kind of wild to me that it appears that this is all over the place. Also kind of annoying that I just followed the documentation and now we're here.

@markhc
Copy link

markhc commented Oct 11, 2023

Same issue on our end. Followed every step in the GCP Readme and the alternative methods here as well, still getting 403 errors.

I can get the 403 errors to be resolved when using hostNetwork: true for the deployment, but then other issues pop-up and the pod enters in a crash loop every couple of seconds with these GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil> errors

GKE Cluster Version 1.24.15-gke.1700


EDIT: Finally managed to get it working. The crash loop I mentioned above was an OOMKilled, so I had to increase the resources for the custom-metrics Deployment.

Final working steps:

  1. Follow @aubm steps and install the adapter.yaml resources NOT THE adapter_new_resource_model.yaml. The new adapter entered a different crash loop for me that I was not able to solve.
  2. Modify that adapter Deployment, adding hostNetwork: true to the spec. This solves the "403 Forbidden" errors.
  3. Increase request/limit resources. See what works for you, but my adapter is frequently reaching ~350Mb, and it only had 200Mb as a limit originally.

@ltieman
Copy link

ltieman commented Oct 19, 2023

I was still getting 403 exceptions using @aubm 's instructions until I added this:

kind: ClusterRole
metadata:
  name: custom-metrics-permissions
rules:
- apiGroups: [""]
  resources: ["subjectaccessreviews"]
  verbs: ["create"]

---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: custom-metrics-binding
subjects:
- kind: ServiceAccount
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics  # Replace with the appropriate namespace
roleRef:
  kind: ClusterRole
  name: custom-metrics-permissions
  apiGroup: rbac.authorization.k8s.io

once the IAM permissions propagated, it started working

@PaulRudin
Copy link

Just to be clear is monitoring.editor necessary? You'd have thought monitoring.viewer is enough - and that's what the README says.

Although it's somewhat academic in my case as I'm getting:

E1219 16:02:17.832184       1 provider.go:320] Failed request to stackdriver api: Get "https://monitoring.go │
│ 2023-12-19T16:02:17.832315130Z This error could be caused by a missing IAM policy binding on the target IAM service account.                │
│ 2023-12-19T16:02:17.832320455Z For more information, refer to the Workload Identity documentation:                                          │
│ 2023-12-19T16:02:17.832323944Z     https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

either way.

@PaulRudin
Copy link

PaulRudin commented Dec 20, 2023

After applying the suggestion here, the permissions issue is fixed, but I still get loads of this sort of thing in the logs:

1220 07:24:32.869016       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="9c852465-5ffa-4d77-a441-f801886e29e3"
1220 07:24:32.869108       1 writers.go:111] apiserver was unable to close cleanly the response writer: http2: stream closed
E1220 07:24:32.871209       1 timeout.go:135] post-timeout activity - time-elapsed: 102.731383ms, GET "/api/custom.metrics.k8s.io/v1beta1" result:  <nil>                         
E1220 07:24:32.873362       1 timeout.go:135] post-timeout activity - time-elapsed: 103.207074ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result:  <nil>
E1220 07:24:32.874493       1 timeout.go:135] post-timeout activity - time-elapsed: 109.99919ms, GET "/api/custom.metrics.k8s.io/v1beta2" result: <nil>
E1220 07:24:32.876604       1 timeout.go:135] post-timeout activity - time-elapsed: 7.572859ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: < nil>

@PaulRudin
Copy link

... and either I was mistaken, or the issue has resurfaced - I still see: this sort of thing:

apiserver received an error that is not an metav1.Status: &googleapi.Error{Code:403, Message:"Permission monitoring.timeSeries.list denied (or the resource may not exist).", Details:[]interface {}(nil), Body:"{\n  \"error\": {\n    \"code\": 403,\n    \"message\": \"Permission monitoring.timeSeries.list denied (or the resource may not exist).\",\n    \"errors\": [\n      {\n        \"message\": \"Permission monitoring.timeSeries.list denied (or the resource may not exist).\",\n        \"domain\": \"global\",\n        \"reason\": \"forbidden\"\n      }\n    ],\n    \"status\": \"PERMISSION_DENIED\"\n  }\n}\n", Header:http.Header(nil), Errors:[]googleapi.ErrorItem{googleapi.ErrorItem{Reason:"forbidden", Message:"Permission monitoring.timeSeries.list denied (or the resource may not exist)."}}}: googleapi: Error 403: Permission monitoring.timeSeries.list denied (or the resource may not exist)., forbidden

even though the relevant service account has the monitoring.viewer role.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests