Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container Cluster stuck in non-ready state because channel update rejected #194

Closed
jlewi opened this issue Jun 1, 2020 · 2 comments
Closed
Labels
bug Something isn't working

Comments

@jlewi
Copy link

jlewi commented Jun 1, 2020

Describe the bug
A clear and concise description of what the bug is.

ConfigConnector Version
Run the following command to get the current ConfigConnector version

kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}' 

cnrm.cloud.google.com/version: 1.9.1

To Reproduce
Steps to reproduce the behavior:

  1. Create a ContainerCluster CNRM resource setting the release channel for the cluster to get
    the status

  2. Apply the resource to create the cluster

  3. Reapply the resource

  4. Container cluster reports

      - lastTransitionTime: "2020-05-29T00:15:38Z"
       message: 'Update call failed: the desired mutation for the following field(s)
         is invalid: [releaseChannel.0.Channel]'
       reason: UpdateFailed
       status: "False"
       type: Ready
    

The release channel shouldn't be changing.

I suspect this an issue in the update logic since there are some restrictions about mutations to
release channel
https://cloud.google.com/kubernetes-engine/docs/concepts/release-channels

YAML snippets:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
  clusterName: gcp-private-dev/us-central1/jl-0601
  labels:
    kf-name: gcp-private-0527
    mesh_id: gcp-private-dev_us-central1_jl-0601
  name: jl-0601
  namespace: gcp-private-dev
spec:
  clusterAutoscaling:
    autoProvisioningDefaults:
      oauthScopes:
      - https://www.googleapis.com/auth/logging.write
      - https://www.googleapis.com/auth/monitoring
      - https://www.googleapis.com/auth/devstorage.read_only
      serviceAccountRef:
        name: jl-0601-vm
    enabled: true
    resourceLimits:
    - maximum: 128
      resourceType: cpu
    - maximum: 2000
      resourceType: memory
    - maximum: 16
      resourceType: nvidia-tesla-k80
  initialNodeCount: 2
  ipAllocationPolicy:
    clusterSecondaryRangeName: pods
    createSubnetwork: false
    servicesSecondaryRangeName: services
    useIpAliases: true
  location: us-central1
  loggingService: logging.googleapis.com/kubernetes
  monitoringService: monitoring.googleapis.com/kubernetes
  networkRef:
    name: jl-0601
  nodeConfig:
    machineType: n1-standard-8
    metadata:
      disable-legacy-endpoints: "true"
    oauthScopes:
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    - https://www.googleapis.com/auth/devstorage.read_only
    serviceAccountRef:
      name: jl-0601-vm
    workloadMetadataConfig:
      nodeMetadata: GKE_METADATA_SERVER
  privateClusterConfig:
    enablePrivateEndpoint: false
    enablePrivateNodes: true
    masterIpv4CidrBlock: 172.16.0.32/28
  releaseChannel:
    channel: stable
  subnetworkRef:
    name: jl-0601
  workloadIdentityConfig:
    identityNamespace: gcp-private-dev.svc.id.goog
@kibbles-n-bytes
Copy link
Contributor

kibbles-n-bytes commented Jun 3, 2020

Hey @jlewi , the underlying API's canonical form for the release channel is all uppercase: https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/projects.locations.clusters#channel

Curiously, the underlying API accepts the lowercase form as well, but further GET requests return the uppercase form, which we currently don't suppress a diff for. If you create your cluster with channel: STABLE instead, you should see this issue go away.

@jlewi
Copy link
Author

jlewi commented Jun 3, 2020

Thanks that's super helpful I will fix that on our end.

Feel free to close this issue unless you want to leave open to continue to track handling the diff more gracefully.

@caieo caieo closed this as completed Jun 3, 2020
jlewi pushed a commit to jlewi/manifests that referenced this issue Jun 4, 2020
* Tracking issue GoogleCloudPlatform/kubeflow-distribution#33

* Fix the setters on firewall rules. They should be partial setters so
  we don't lose the suffixes.

* Add a firewall rule to allow cert-manager webhooks this is necessary
  to work with private GKE

  ref https://docs.cert-manager.io/en/release-0.11/getting-started/webhook.html#running-on-private-gke-clusters

* Add kpt/kustomize function to configure the transform to replace
  images with the mirror'd image versions.

* Update image mirroring configs

  * Instead of using "*" to match all images we list out image prefixes
    to match so we are a bit more intentional.

  * We want to include gcr.io images in order to support working with
    VPC-SC. For VPC-SC gcr.io images need to be mirror'd as
    well because they are unlikely to be within the perimeter

  * Use the locations gcr.io/${PROJECT}/mirror
    It looks like the mirror'ing pipeline includes the registry name

* Change the release channel on the cluster to be upper case

  * Per GoogleCloudPlatform/k8s-config-connector#194
we need release channels to be upper case otherwise updates fail.

* centraldashboard  v3 kustomization.yaml needs an image stanza
  * Without this we end up deploying using tag "latest" which isn't
  what we want.

* Use CNRM to enable services GoogleCloudPlatform/kubeflow-distribution#31

* Remove cert-manager ACME challenge from excluded paths for JWT
  validation
  * We no longer use cert-manager so we no longer need to allow that
    path.

* We need to add a default network route in order to allow cloudnat to
  access the outbound interet access
  * Need to access jwks

* Give routes and nat resources unique names based on the KF name.

* Route to public internet should be higher priority so google apis take precedence.
jlewi pushed a commit to jlewi/manifests that referenced this issue Jun 4, 2020
* Tracking issue GoogleCloudPlatform/kubeflow-distribution#33

* Fix the setters on firewall rules. They should be partial setters so
  we don't lose the suffixes.

* Add a firewall rule to allow cert-manager webhooks this is necessary
  to work with private GKE

  ref https://docs.cert-manager.io/en/release-0.11/getting-started/webhook.html#running-on-private-gke-clusters

* Add kpt/kustomize function to configure the transform to replace
  images with the mirror'd image versions.

* Update image mirroring configs

  * Instead of using "*" to match all images we list out image prefixes
    to match so we are a bit more intentional.

  * We want to include gcr.io images in order to support working with
    VPC-SC. For VPC-SC gcr.io images need to be mirror'd as
    well because they are unlikely to be within the perimeter

  * Use the locations gcr.io/${PROJECT}/mirror
    It looks like the mirror'ing pipeline includes the registry name

* Change the release channel on the cluster to be upper case

  * Per GoogleCloudPlatform/k8s-config-connector#194
we need release channels to be upper case otherwise updates fail.

* centraldashboard  v3 kustomization.yaml needs an image stanza
  * Without this we end up deploying using tag "latest" which isn't
  what we want.

* Use CNRM to enable services GoogleCloudPlatform/kubeflow-distribution#31

* Remove cert-manager ACME challenge from excluded paths for JWT
  validation
  * We no longer use cert-manager so we no longer need to allow that
    path.

* We need to add a default network route in order to allow cloudnat to
  access the outbound interet access
  * Need to access jwks

* Give routes and nat resources unique names based on the KF name.

* Route to public internet should be higher priority so google apis take precedence.
k8s-ci-robot pushed a commit to kubeflow/manifests that referenced this issue Jun 5, 2020
* Fix a bunch issues with GCP blueprints for private gke.

* Tracking issue GoogleCloudPlatform/kubeflow-distribution#33

* Fix the setters on firewall rules. They should be partial setters so
  we don't lose the suffixes.

* Add a firewall rule to allow cert-manager webhooks this is necessary
  to work with private GKE

  ref https://docs.cert-manager.io/en/release-0.11/getting-started/webhook.html#running-on-private-gke-clusters

* Add kpt/kustomize function to configure the transform to replace
  images with the mirror'd image versions.

* Update image mirroring configs

  * Instead of using "*" to match all images we list out image prefixes
    to match so we are a bit more intentional.

  * We want to include gcr.io images in order to support working with
    VPC-SC. For VPC-SC gcr.io images need to be mirror'd as
    well because they are unlikely to be within the perimeter

  * Use the locations gcr.io/${PROJECT}/mirror
    It looks like the mirror'ing pipeline includes the registry name

* Change the release channel on the cluster to be upper case

  * Per GoogleCloudPlatform/k8s-config-connector#194
we need release channels to be upper case otherwise updates fail.

* centraldashboard  v3 kustomization.yaml needs an image stanza
  * Without this we end up deploying using tag "latest" which isn't
  what we want.

* Use CNRM to enable services GoogleCloudPlatform/kubeflow-distribution#31

* Remove cert-manager ACME challenge from excluded paths for JWT
  validation
  * We no longer use cert-manager so we no longer need to allow that
    path.

* We need to add a default network route in order to allow cloudnat to
  access the outbound interet access
  * Need to access jwks

* Give routes and nat resources unique names based on the KF name.

* Route to public internet should be higher priority so google apis take precedence.

* * Regenerate tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants