Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A basic and minimal ComputeInstanceTemplate doesn't materialize. #310

Closed
bshimc opened this issue Nov 20, 2020 · 9 comments
Closed

A basic and minimal ComputeInstanceTemplate doesn't materialize. #310

bshimc opened this issue Nov 20, 2020 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@bshimc
Copy link

bshimc commented Nov 20, 2020

Describe the bug
A basic and minimal ComputeInstanceTemplate doesn't materialize. The resource is created without any feedback saying there's a problem. When I run kubectl describe on the resource, there are no events.

ConfigConnector Version
Run the following command to get the current ConfigConnector version

kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}' 
1.27.1

To Reproduce
Apply the resource (below). Observe that gcloud compute instance-templates list doesn't show anything. There are no logs for resource.type="gce_instance_template". The DEBUG/INFO log spew for the KCC-related controllers is so high (4000/minute) that it's extremely difficult to tease out any information from that source.

YAML snippets:

apiVersion: compute.cnrm.cloud.google.com/v1beta1
kind: ComputeInstanceTemplate
metadata:
  name: my-template
spec:
  machineType: e2-highcpu-16
  disk:
  - sourceImageRef:
      external: projects/cos-cloud/global/images/family/cos-stable
    autoDelete: true
    boot: true
    diskType: pd-ssd
    diskSizeGb: 48
    type: PERSISTENT
@bshimc bshimc added the bug Something isn't working label Nov 20, 2020
@toumorokoshi toumorokoshi self-assigned this Nov 23, 2020
@toumorokoshi
Copy link
Contributor

Hey @bshimc. Sorry to hear you're having issues.

I tried this locally and I was able to get an error message by querying the events on the resource itself:

command:

$ kubectl describe -n gh-310 ComputeInstanceTemplate

output:

Name:         my-template
Namespace:    gh-310
Labels:       <none>
Annotations:  cnrm.cloud.google.com/management-conflict-prevention-policy: none
              cnrm.cloud.google.com/project-id: {REDACTED}
              kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"compute.cnrm.cloud.google.com/v1beta1","kind":"ComputeInstanceTemplate","metadata":{"annotations":{},"name":"my-template","...
API Version:  compute.cnrm.cloud.google.com/v1beta1
Kind:         ComputeInstanceTemplate
Metadata:
  Creation Timestamp:  2020-11-23T17:14:09Z
  Finalizers:
    cnrm.cloud.google.com/finalizer
    cnrm.cloud.google.com/deletion-defender
  Generation:        33
  Resource Version:  8846634
  Self Link:         /apis/compute.cnrm.cloud.google.com/v1beta1/namespaces/gh-310/computeinstancetemplates/my-template
  UID:               6a3e59f1-8f02-4a37-adf9-f37f7c8e93f8
Spec:
  Disk:
    Auto Delete:   true
    Boot:          true
    Disk Size Gb:  48
    Disk Type:     pd-ssd
    Source Image Ref:
      External:  projects/cos-cloud/global/images/family/cos-stable
    Type:        PERSISTENT
  Machine Type:  e2-highcpu-16
Status:
  Conditions:
    Last Transition Time:  2020-11-23T17:14:09Z
    Message:               Update call failed: error applying desired state: Error creating instance template: googleapi: Error 400: Invalid value for field 'resource.properties.networkInterfaces[0]': ''. Instance properties must provide at least one NetworkInterface definition., invalid
    Reason:                UpdateFailed
    Status:                False
    Type:                  Ready
Events:
  Type     Reason        Age                     From                                Message
  ----     ------        ----                    ----                                -------
  Warning  UpdateFailed  4m36s (x12 over 4m53s)  computeinstancetemplate-controller  Update call failed: error applying desired state: Error creating instance template: googleapi: Error 400: Invalid value for field 'resource.properties.networkInterfaces[0]': ''. Instance properties must provide at least one NetworkInterface definition., invalid
  Normal   Updating      4m25s (x13 over 4m53s)  computeinstancetemplate-controller  Update in progress

I was also able to find this in the cloud logging logs as well, under the following query:

resource.type="k8s_container"
resource.labels.project_id="{REDACTED}"
resource.labels.location="us-central1-c"
resource.labels.cluster_name="{REDACTED}"
resource.labels.namespace_name="cnrm-system"
resource.labels.pod_name:"cnrm-controller-manager-"
jsonPayload.controller="computeinstancetemplate-controller"

Although I agree the logs are fairly verbose, and would only use them if querying the resource fails to provide any information.

2020-11-23-092531_1279x324_scrot

@toumorokoshi
Copy link
Contributor

Would you mind pasting the described state similar to how I did above? or let me know if those troubleshooting steps help.

the error I ran into is the lack of a NetworkInterface definition. Can you clarify how you came to the conclusion that the spec you posted was a minimal working set? I'm looking at our documentation for the resource, and although the NetworkInterface field is not explicitly stated as requiring at least one, our example case does include a NetworkInterface.

@bshimc
Copy link
Author

bshimc commented Nov 23, 2020

@toumorokoshi thanks for following up! Here's what I'm seeing

$ kubectl describe computeinstancetemplate.compute.cnrm.cloud.google.com/salta-template | sed 's/XXXXXX/redacted/'
Name:         redacted-template
Namespace:    default
Labels:       <none>
Annotations:  cnrm.cloud.google.com/management-conflict-prevention-policy: none
              cnrm.cloud.google.com/project-id: default
API Version:  compute.cnrm.cloud.google.com/v1beta1
Kind:         ComputeInstanceTemplate
Metadata:
  Creation Timestamp:  2020-11-20T22:01:26Z
  Generation:          1
  Managed Fields:
    API Version:  compute.cnrm.cloud.google.com/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:disk:
        f:machineType:
    Manager:         downloaded
    Operation:       Update
    Time:            2020-11-20T22:01:26Z
  Resource Version:  79298865
  Self Link:         /apis/compute.cnrm.cloud.google.com/v1beta1/namespaces/default/computeinstancetemplates/redacted-template
  UID:               322e1f00-b5cb-4f57-864c-9d790fa89e30
Spec:
  Disk:
    Auto Delete:   true
    Boot:          true
    Disk Size Gb:  48
    Disk Type:     pd-ssd
    Source Image Ref:
      External:  projects/cos-cloud/global/images/family/cos-stable
    Type:        PERSISTENT
  Machine Type:  e2-highcpu-16
Events:          <none>

Notably: Events: <none>

Can you clarify how you came to the conclusion that the spec you posted was a minimal working set?

Great question; I had no idea whether it was the minimal working set and I typically rely on trial and error (via describe and inspecting logs) but in this case I had trouble tracking down any errors. I will check again to see if I can find them given your helpful query.

@toumorokoshi
Copy link
Contributor

Thanks! the peculiar part to me is that you don't have events. I have a couple follow-up questions:

  • Does the logging query I posted above yield anything?
  • What version of kubernetes are you running? I tested in 1.16 and it looks like yours has server-side apply, so above that.

I might give it a try with a newer version of the k8s to see what happens.

@bshimc
Copy link
Author

bshimc commented Nov 24, 2020

I just started digging into the logs and I can't (yet) quite find something like the error entries you posted.

$ gcloud container clusters describe kcc --zone us-central1-c
addonsConfig:
  configConnectorConfig:
    enabled: true
  horizontalPodAutoscaling: {}
  httpLoadBalancing:
    disabled: true
  kubernetesDashboard:
    disabled: true
  networkPolicyConfig:
    disabled: true
clusterIpv4Cidr: 10.0.0.0/14
createTime: '2020-10-21T04:17:55+00:00'
currentMasterVersion: 1.18.10-gke.2101
currentNodeCount: 1
currentNodeVersion: 1.18.10-gke.2101
...
initialClusterVersion: 1.18.9-gke.801

(this cluster is the rapid release channel)

@toumorokoshi
Copy link
Contributor

Hi! I tried to reproduce this using the rapid channel of GKE, and the output was the same:

  • error points at a lack of a NetworkInterface
  • the event populated correctly.
  • the log existed in the cloud logging

I used:

  • GKE Config Connector addon
  • rapid release channel (v1.18.10-gke.2101)
  • single SA mode (not namespaced)

Can you see if this happens with a brand new cluster? And are you using namespaced mode.

Also if you have any logs at all for the pods in the cnrm-system namespace, could you post them? I'm also wondering at this point if Config Connector is running correctly.

Also this output, to see what pods are running:

kubectl get pods -n cnrm-system

@bshimc
Copy link
Author

bshimc commented Dec 1, 2020

I deleted the old cluster and went through a manual installation (so not the GKE config add-on) and I can now see events for the resource via 'describe'.

@bshimc
Copy link
Author

bshimc commented Dec 1, 2020

Thanks for your help @toumorokoshi; given the project's velocity I'm not sure it's worth trying to reproduce what I saw in the original bug report.

@bshimc bshimc closed this as completed Dec 1, 2020
@toumorokoshi
Copy link
Contributor

Glad to hear that a new cluster worked! And appreciate the flexibility. We're trying to improve debuggability so that hopefully we can capture more details to figure out tricky issues like this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants