Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLInstance: reference SQLInstance X/Y is not ready #294

Closed
rnaveiras opened this issue Oct 21, 2020 · 27 comments
Closed

SQLInstance: reference SQLInstance X/Y is not ready #294

rnaveiras opened this issue Oct 21, 2020 · 27 comments
Labels
question Further information is requested

Comments

@rnaveiras
Copy link

We observed the behaviour that all resources related to sql.cnrm.cloud.google.com/v1beta1 have multiple events where they state transition from Ready to DependencyNotReady. It seems that the state is flapping.

$ kubectl get sqldatabase -o json | jq '.items | map(.status.conditions)'                                                                                                                                                                                           
[                                                                                                                                                                                                                                                                                                           
  [                                                                                                                                                                                                                                                                                                         
    {                                                                                                                                                                                                                                                                                                       
      "lastTransitionTime": "2020-10-21T11:44:58Z",                                                                                                                                                                                                                                                         
      "message": "reference SQLInstance X/Y is not ready",                                                                                                                                                                                                                        
      "reason": "DependencyNotReady",                                                                                                                                                                                                                                                                       
      "status": "False",                                                                                                                                                                                                                                                                                    
      "type": "Ready"                                                                                                                                                                                                                                                                                       
    }                                                                                                                                                                                                                                                                                                       
  ]                                                                                                                                                                                                                                                                                                         
]

Sometime later:

kubectl get sqldatabase -o json | jq '.items | map(.status.conditions)'                                                                                                                                                                                           
[                                                                                                                                                                                                                                                                                                           
  [                                                                                                                                                                                                                                                                                                         
    {                                                                                                                                                                                                                                                                                                       
      "lastTransitionTime": "2020-10-21T11:44:58Z",                                                                                                                                                                                                                                                         
      "message": "reference SQLInstance X/Y is not ready",                                                                                                                                                                                                                        
      "reason": "DependencyNotReady",                                                                                                                                                                                                                                                                       
      "status": "False",                                                                                                                                                                                                                                                                                    
      "type": "Ready"                                                                                                                                                                                                                                                                                       
    }                                                                                                                                                                                                                                                                                                       
  ]                                                                                                                                                                                                                                                                                                         
]   

Checking the events in the resource:

  Type     Reason              Age                    From                    Message
  ----     ------              ----                   ----                    -------
  Normal   UpToDate            24m (x63 over 22h)     sqldatabase-controller  The resource is up to date
  Warning  DependencyNotReady  4m48s (x855 over 22h)  sqldatabase-controller  reference SQLInstance X/Y is not ready

It happens for the rest of the resources related to CloudSQL, like sqlsslcert.sql.cnrm.cloud.google.com, sqldatabase.sql.cnrm.cloud.google.com, sqlinstance.sql.cnrm.cloud.google.com

Example from sqlsslcert.sql.cnrm.cloud.google.com

Events:
  Type     Reason              Age                    From                   Message
  ----     ------              ----                   ----                   -------
  Normal   UpToDate            48m (x59 over 21h)     sqlsslcert-controller  The resource is up to date
  Warning  DependencyNotReady  7m59s (x829 over 21h)  sqlsslcert-controller  reference SQLInstance X/Y is not ready

Could you advise about this issue, please?

@rnaveiras rnaveiras added the question Further information is requested label Oct 21, 2020
@rnaveiras
Copy link
Author

/cc @jcanseco

@kibbles-n-bytes
Copy link
Contributor

Hey @rnaveiras , could you share what the events look like for the SQLInstance object being referenced? There may be a bug on our side when handling your instance configuration that is causing it to continuously update, which would explain why its dependent resources keep seeing it as not ready.

@rnaveiras
Copy link
Author

Events in the sqlinstance:

  Normal  Updating  14m (x522 over 7d19h)  sqlinstance-controller  Update in progress
  Normal  UpToDate  11m (x522 over 7d19h)  sqlinstance-controller  The resource is up to date

Events in the namespace:

10m         Normal    Updating             sqlinstance/abacus                                                                           Update in progress
10m         Warning   DependencyNotReady   sqluser/abacus                                                                               reference SQLInstance abacus-sandbox-staging/abacus is not ready
10m         Warning   DependencyNotReady   sqldatabase/abacus                                                                           reference SQLInstance abacus-sandbox-staging/abacus is not ready
8m12s       Normal    UpToDate             sqlinstance/abacus                                                                           The resource is up to date
7m55s       Normal    UpToDate             sqluser/abacus                                                                               The resource is up to date
7m54s       Normal    UpToDate             sqldatabase/abacus                                                                           The resource is up to date
10m         Warning   DependencyNotReady   sqlsslcert/abacus                                                                            reference SQLInstance abacus-sandbox-staging/abacus is not ready
8m1s        Normal    UpToDate             sqlsslcert/abacus                                                                            The resource is up to date

I hope this helps

@caieo
Copy link
Contributor

caieo commented Oct 30, 2020

@rnaveiras , thank you for sharing the events with us. Would you be able to also share the configuration you're using for your SQLInstance so that we can try to replicate this issue?

@benwh
Copy link

benwh commented Jan 12, 2021

Hey @caieo - I work on the same team as @rnaveiras

Apologies it took us a while to get back to you on this!

Here's a dump of an instance that's (still) exhibiting this issue:

---
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLInstance
metadata:
  annotations:
    cnrm.cloud.google.com/management-conflict-prevention-policy: resource
    cnrm.cloud.google.com/observed-secret-versions: '{}'
    cnrm.cloud.google.com/project-id: project-redacted
    cnrm.cloud.google.com/supports-ssa: "true"
  creationTimestamp: "2020-09-30T08:48:02Z"
  finalizers:
  - cnrm.cloud.google.com/finalizer
  - cnrm.cloud.google.com/deletion-defender
  generation: 17641
  labels:
    app: abacus
    app.kubernetes.io/instance: prd-abacus-sandbox-staging-abacus
    environment: sandbox-staging
    part-of: abacus
    release: abacus
    service: abacus
  managedFields:
  - apiVersion: sql.cnrm.cloud.google.com/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:cnrm.cloud.google.com/supports-ssa: {}
    manager: supports-ssa
    operation: Apply
    time: "2020-10-08T14:41:36Z"
  - apiVersion: sql.cnrm.cloud.google.com/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:cnrm.cloud.google.com/management-conflict-prevention-policy: {}
          f:cnrm.cloud.google.com/observed-secret-versions: {}
          f:cnrm.cloud.google.com/project-id: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:finalizers:
          v:"cnrm.cloud.google.com/deletion-defender": {}
          v:"cnrm.cloud.google.com/finalizer": {}
        f:labels:
          f:app: {}
          f:app.kubernetes.io/instance: {}
          f:environment: {}
          f:part-of: {}
          f:release: {}
          f:service: {}
      f:spec:
        f:databaseVersion: {}
        f:region: {}
        f:settings:
          f:activationPolicy: {}
          f:availabilityType: {}
          f:backupConfiguration:
            f:enabled: {}
            f:startTime: {}
          f:diskAutoresize: {}
          f:diskSize: {}
          f:diskType: {}
          f:ipConfiguration:
            f:authorizedNetworks: {}
            f:ipv4Enabled: {}
            f:requireSsl: {}
          f:locationPreference:
            f:zone: {}
          f:pricingPlan: {}
          f:replicationType: {}
          f:tier: {}
      f:status:
        f:connectionName: {}
        f:firstIpAddress: {}
        f:ipAddress: {}
        f:publicIpAddress: {}
        f:selfLink: {}
        f:serverCaCert:
          f:cert: {}
          f:commonName: {}
          f:createTime: {}
          f:expirationTime: {}
          f:sha1Fingerprint: {}
        f:serviceAccountEmailAddress: {}
    manager: before-first-apply
    operation: Update
  - apiVersion: sql.cnrm.cloud.google.com/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
    manager: cnrm-controller-manager
    operation: Update
    time: "2021-01-12T00:17:57Z"
  name: abacus
  namespace: abacus-sandbox-staging
  resourceVersion: "765985076"
  selfLink: /apis/sql.cnrm.cloud.google.com/v1beta1/namespaces/abacus-sandbox-staging/sqlinstances/abacus
  uid: 108919d2-af60-4f72-a478-b7dd20fa2222
spec:
  databaseVersion: POSTGRES_12
  region: europe-west4
  settings:
    activationPolicy: ALWAYS
    availabilityType: REGIONAL
    backupConfiguration:
      enabled: true
      startTime: "07:00"
    diskAutoresize: true
    diskSize: 10
    diskType: PD_SSD
    ipConfiguration:
      authorizedNetworks:
      - name: all
        value: 0.0.0.0/0
      ipv4Enabled: true
      requireSsl: true
    locationPreference:
      zone: europe-west4-a
    pricingPlan: PER_USE
    replicationType: SYNCHRONOUS
    tier: db-custom-1-3840
status:
  conditions:
  - lastTransitionTime: "2021-01-12T00:17:57Z"
    message: The resource is up to date
    reason: UpToDate
    status: "True"
    type: Ready
  connectionName: project-redacted:europe-west4:abacus
  firstIpAddress: 1.2.3.4
  ipAddress:
  - ipAddress: 1.2.3.4
    type: PRIMARY
  publicIpAddress: 1.2.3.4
  selfLink: https://sqladmin.googleapis.com/sql/v1beta4/projects/project-redacted/instances/abacus
  serverCaCert:
    cert: |-
      redacted
    commonName: C=US,O=Google\, Inc,CN=Google Cloud SQL Server CA,dnQualifier=5548eefb-f843-458c-a67a-ea2f396e55c1
    createTime: "2020-09-30T08:49:11.127Z"
    expirationTime: "2030-09-28T08:50:11.127Z"
    sha1Fingerprint: 2b4fc8716cb4fdf29b4269ae79cdbf6a33c11083
  serviceAccountEmailAddress: redacted@gcp-sa-cloud-sql.iam.gserviceaccount.com

@snuggie12
Copy link

We have this issue for most of our SQL instances. We're currently on 1.34.0 but have seen this on multiple versions.

Same symptoms as above. Eventually all events balance out (e.g. 498 UpToDate and 498 Updating.) Based on the fact the generation is the reconciliation loop is not ignoring the correct fields (like status) or something is legitimately changing.

@snuggie12
Copy link

Here is the controller's log for it happening and the diff between the two versions:
controller log

2021-02-02T17:09:42.118728000Z {"severity":"info","logger":"sqlinstance-controller","msg":"starting reconcile","resource":{"namespace":"document-manager","name":"document-manager-db"}}
2021-02-02T17:09:42.284219048Z {"severity":"info","logger":"sqlinstance-controller","msg":"creating/updating underlying resource","resource":{"namespace":"document-manager","name":"document-manager-db"}}
2021-02-02T17:11:21.113692503Z {"severity":"info","logger":"sqlinstance-controller","msg":"successfully finished reconcile","resource":{"namespace":"document-manager","name":"document-manager-db"}}

diff between resourceVersion/generation yamls (generation, resourceVersion, status, and managed fields for status (I'm guessing that last one is what is broken))

$ diff -u /tmp/sqlinstance.old /tmp/sqlinstance.new
--- /tmp/sqlinstance.old        2021-02-02 09:13:33.000000000 -0800
+++ /tmp/sqlinstance.new        2021-02-02 09:13:51.000000000 -0800
@@ -10,7 +10,7 @@
   finalizers:
   - cnrm.cloud.google.com/finalizer
   - cnrm.cloud.google.com/deletion-defender
-  generation: 10988
+  generation: 10989
   labels:
     missionlane.com/owner: document-manager
   managedFields:
@@ -81,7 +81,7 @@
         f:conditions: {}
     manager: cnrm-controller-manager
     operation: Update
-    time: "2021-02-02T16:49:41Z"
+    time: "2021-02-02T17:09:42Z"
   name: document-manager-db
   namespace: document-manager
   ownerReferences:
@@ -91,7 +91,7 @@
     kind: FutureObject
     name: document-manager-db
     uid: c66e7275-4b16-4b29-95f9-8a7b5028d1e3
-  resourceVersion: "84322797"
+  resourceVersion: "84344565"
   selfLink: /apis/sql.cnrm.cloud.google.com/v1beta1/namespaces/document-manager/sqlinstances/document-manager-db
   uid: ad6e14dd-4709-41d3-897b-47f007424863
 spec:
@@ -118,10 +118,10 @@
     tier: db-g1-small
 status:
   conditions:
-  - lastTransitionTime: "2021-02-02T16:49:41Z"
-    message: The resource is up to date
-    reason: UpToDate
-    status: "True"
+  - lastTransitionTime: "2021-02-02T17:09:42Z"
+    message: Update in progress
+    reason: Updating
+    status: "False"
     type: Ready
   connectionName: document-manager-dev-a67e:us-east4:document-manager-db
   firstIpAddress: 10.17.16.10

@xiaobaitusi
Copy link
Contributor

Hi @snuggie12, do your sql instance resources have "cnrm.cloud.google.com/management-conflict-prevention-policy: resource" annotation? If so, that means the label lease is enabled for conflict prevention. ConfigConnector will need to update your instance's labels to renew the lease.

You can disable it per https://cloud.google.com/config-connector/docs/concepts/managing-conflicts#modifying_conflict_prevention.

@snuggie12
Copy link

@xiaobaitusi We only add deletion-policy to abandon. It looks like the controller adds that annotation though:

cnrm.cloud.google.com/management-conflict-prevention-policy: resource

Based on that link it says the default is determined by the resource type and whether it supports labels. I believe the SQL Instance does support labels so that explains the default.

Seeing as how we only have one controller am I understanding you correctly that setting it to none explicitly will tell it to stop this behavior and hopefully stop the reconciliations?

@snuggie12
Copy link

@xiaobaitusi that did indeed fix the problem for us.

@benwh
Copy link

benwh commented Feb 17, 2021

@xiaobaitusi Just to clarify though, disabling conflict prevention shouldn't really be required right? This still sounds like a bug in the controller if it's not able to renew the lease on the resource without changing its Ready condition to false temporarily.

@jcanseco
Copy link
Member

jcanseco commented Mar 15, 2021

Hi @benwh, I apologize that we missed your question. Yes, you are correct: the controller should not be marking the resource Ready: false simply to renew the lease. We agree that this is a bug, and we'll work on fixing it.

@snuggie12
Copy link

I'm seeing new behavior with this.

This seems specific to only one of our SQL instances. It also seems specific to .metadata.managedFields. I'm thinking one specific change isn't agreeing.

I have it set to abandon so my plan is to delete the resource which should remove the managedFields changes?

Here is an example diff as well as the md5sums of the yaml taken approximately every second:

$ diff -u 1617859845.txt 1617859847.txt
--- 1617859845.txt      2021-04-07 22:30:45.000000000 -0700
+++ 1617859847.txt      2021-04-07 22:30:47.000000000 -0700
@@ -10,7 +10,7 @@
   finalizers:
   - cnrm.cloud.google.com/finalizer
   - cnrm.cloud.google.com/deletion-defender
-  generation: 6848296
+  generation: 6848298
   labels:
     missionlane.com/owner: platform
   managedFields:
@@ -88,7 +88,7 @@
         f:conditions: {}
     manager: cnrm-controller-manager
     operation: Update
-    time: "2021-04-08T05:30:44Z"
+    time: "2021-04-08T05:30:46Z"
   - apiVersion: sql.cnrm.cloud.google.com/v1beta1
     fieldsType: FieldsV1
     fieldsV1:
@@ -99,7 +99,7 @@
           f:diskSize: {}
     manager: manager
     operation: Update
-    time: "2021-04-08T05:30:44Z"
+    time: "2021-04-08T05:30:46Z"
   name: servicing-change-in-terms
   namespace: platform
   ownerReferences:
@@ -109,7 +109,7 @@
     kind: FutureObject
     name: platform-change-in-terms-sqlinstance
     uid: 141b9c95-17c9-4f77-aae1-503c4cd51eea
-  resourceVersion: "159144467"
+  resourceVersion: "159144492"
   selfLink: /apis/sql.cnrm.cloud.google.com/v1beta1/namespaces/platform/sqlinstances/servicing-change-in-terms
   uid: 6481ad8a-98b8-4dd0-8c3a-1fbc874d1fcc
 spec:

183cb8ca236626f72d54c520e0027f27  1617859678.txt
183cb8ca236626f72d54c520e0027f27  1617859681.txt
183cb8ca236626f72d54c520e0027f27  1617859683.txt
183cb8ca236626f72d54c520e0027f27  1617859684.txt
183cb8ca236626f72d54c520e0027f27  1617859685.txt
183cb8ca236626f72d54c520e0027f27  1617859687.txt
183cb8ca236626f72d54c520e0027f27  1617859688.txt
183cb8ca236626f72d54c520e0027f27  1617859690.txt
183cb8ca236626f72d54c520e0027f27  1617859691.txt
183cb8ca236626f72d54c520e0027f27  1617859693.txt
183cb8ca236626f72d54c520e0027f27  1617859694.txt
183cb8ca236626f72d54c520e0027f27  1617859695.txt
183cb8ca236626f72d54c520e0027f27  1617859697.txt
183cb8ca236626f72d54c520e0027f27  1617859698.txt
183cb8ca236626f72d54c520e0027f27  1617859700.txt
183cb8ca236626f72d54c520e0027f27  1617859701.txt
b48f0ecb40925462477f4891092a558c  1617859703.txt
b48f0ecb40925462477f4891092a558c  1617859704.txt
b48f0ecb40925462477f4891092a558c  1617859705.txt
c5ea62c74e002c484df61774e7491552  1617859707.txt
34770102810c54fcd3b2179f708fd3b2  1617859708.txt
c937b3ca8dc5d2cc917a1f7a293323bc  1617859710.txt
4d747ef6a948512bbcf9b9bbd74b9768  1617859711.txt
be75f6c70b92372e079e1eef582ba621  1617859712.txt
2a33d73728210668cd0c551fa300640f  1617859714.txt
a65b7ec782aa65e48c017587a271425a  1617859715.txt
d922349a217c8ca88fcf546acfd8da10  1617859717.txt
f3ded818784d89f68bfc68ba45802f7a  1617859718.txt
7d5259e58c6a9879aa4ca30fdaa4a2fc  1617859719.txt
0a46f8cb6bd7706f3b21e846a04e3ea4  1617859721.txt
ac11183c67a7c38e775e6ca2fce71c5a  1617859722.txt
ee1e591393fb50f5db046d119d7c3f3f  1617859724.txt
5cebead8d298f98ec7b71f55132ea818  1617859725.txt
6756feb3116902401a8eb6ddfa081361  1617859726.txt
6756feb3116902401a8eb6ddfa081361  1617859728.txt
6756feb3116902401a8eb6ddfa081361  1617859729.txt
6756feb3116902401a8eb6ddfa081361  1617859731.txt
6756feb3116902401a8eb6ddfa081361  1617859732.txt
6756feb3116902401a8eb6ddfa081361  1617859733.txt
6756feb3116902401a8eb6ddfa081361  1617859735.txt
6756feb3116902401a8eb6ddfa081361  1617859736.txt
6756feb3116902401a8eb6ddfa081361  1617859738.txt
6756feb3116902401a8eb6ddfa081361  1617859739.txt
6756feb3116902401a8eb6ddfa081361  1617859741.txt
6756feb3116902401a8eb6ddfa081361  1617859742.txt
6756feb3116902401a8eb6ddfa081361  1617859743.txt
6756feb3116902401a8eb6ddfa081361  1617859745.txt
6756feb3116902401a8eb6ddfa081361  1617859746.txt
6756feb3116902401a8eb6ddfa081361  1617859748.txt
6756feb3116902401a8eb6ddfa081361  1617859749.txt
6756feb3116902401a8eb6ddfa081361  1617859750.txt
6756feb3116902401a8eb6ddfa081361  1617859752.txt
6756feb3116902401a8eb6ddfa081361  1617859753.txt
6756feb3116902401a8eb6ddfa081361  1617859755.txt
6756feb3116902401a8eb6ddfa081361  1617859756.txt
6756feb3116902401a8eb6ddfa081361  1617859757.txt
83050d5ec16b29139b4f44fda742940d  1617859759.txt
6c4f35745baee47f72e75b3b66b841c0  1617859760.txt
af19532182de4b3788e0ae2e22e8c72f  1617859762.txt
0987fc4201e7fe8d3faaeba3b2b00c02  1617859763.txt
f2ad24c093bd3d5b66bf32336c4b6635  1617859765.txt
b0fc11c2a26b1c0102bc3fedf1313a16  1617859766.txt
fa776843d2f362641f722bbf0aac7810  1617859767.txt
6e3e20181a3ec12eb65edf809478d8bb  1617859769.txt
3bf16ce2d1d56f4095eff0f79d159ce7  1617859770.txt
750440f9ca32037818ea245a3eeef56a  1617859772.txt
d4067318d9eadae8b44a6c89f8fc5cae  1617859773.txt
519743dee7ec97c6fac97a6667c038c0  1617859774.txt
7624dd67b31ac4e88165bdc08ed50f6f  1617859776.txt
631a22a375918e9b820f16354375ba7f  1617859777.txt
631a22a375918e9b820f16354375ba7f  1617859779.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859780.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859781.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859783.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859784.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859786.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859787.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859789.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859790.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859791.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859793.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859794.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859796.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859797.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859798.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859800.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859801.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859803.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859804.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859805.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859807.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859809.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859817.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859819.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859821.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859822.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859824.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859825.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859827.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859828.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859829.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859831.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859832.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859834.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859835.txt
dc11bc1eeb2557e7cf51a4274923c0e6  1617859837.txt
df65814f3a030182ed7e8170bb31081c  1617859838.txt
8d495c38d80448c506a118ea24fce1f1  1617859839.txt
73fec27523e47fa5935b17d67d6b516d  1617859841.txt
f016d64ac0fed1e19234efc734c88586  1617859842.txt
a231b8a50511edba869bb895ede668e5  1617859844.txt
139376def2feba09280f6f54fd78d71b  1617859845.txt
c11251e4e24ecfe415e24ca946ea4cef  1617859847.txt
c11251e4e24ecfe415e24ca946ea4cef  1617859848.txt
c11251e4e24ecfe415e24ca946ea4cef  1617859849.txt
c11251e4e24ecfe415e24ca946ea4cef  1617859851.txt
c11251e4e24ecfe415e24ca946ea4cef  1617859852.txt

@maqiuyujoyce
Copy link
Collaborator

Hi @snuggie12 , sorry to hear that you ran into a similar issue again. Just to clarify, are you observing this SQLInstance getting updated regularly?

It also seems specific to .metadata.managedFields. I'm thinking one specific change isn't agreeing.

Did you observe any value changes of any fields? If so, could you share more details?

I have it set to abandon so my plan is to delete the resource which should remove the managedFields changes?

managedFields reflects all the fields and their managers in the underlying GCP resource, so unless you edited it manually, changes in managedFields should be handled by Config Connector and should not cause any issues. Do you happen to know if the SQL instance is also managed by another resource/application/tooling? It's possible that Config Connector is doing self-healing (if it's every 10 mins) because of changes out-of-band.

@snuggie12
Copy link

@maqiuyujoyce yes it is updating regularly. You can see the update patterns based on the file names (they are epoch times.)

The changes are also pasted above. Aside from the expected 2 fields, it is the time field for 2 of the managed field entries.

Nothing else manages these resources. Config connector is rapidly changing those fields. If I delete the kubernetes resource with abandon set for my delete policy I believe managed fields (or at least any changes it is documenting,) will be wiped out and config connector should stop making updates.

@eyalzek
Copy link

eyalzek commented Apr 9, 2021

I'm seeing the same behaviour, in my case setting cnrm.cloud.google.com/management-conflict-prevention-policy: "none" doesn't resolve the issue.

@maqiuyujoyce
Copy link
Collaborator

@snuggie12 and @eyalzek Thank you for your confirmation & new data point! Could you provide the following information so that we can try to reproduce?

  • K8s version/GKE version
  • Config Connector version (addon or manual installation)
  • Full YAML configuration for the problematic resources (feel free to remove the sensitive information)
  • Steps to reproduce your issue

If I delete the kubernetes resource with abandon set for my delete policy I believe managed fields (or at least any changes it is documenting,) will be wiped out and config connector should stop making updates.

Yes @snuggie12, you can mark the deletion policy as abandon and delete the K8s resource without impacting the underlying SQL instance. After the SQLInstance resource is deleted from K8s, the underlying instance should stop making changes.

@snuggie12
Copy link

v1.16.15-gke.7800
1.34.0 manual installation single controller for the whole cluster
I'll list yaml last, but as far as re-produce I'm not really sure. Things are self-service. Based on my limited understanding, the managed fields stuff tracks updates so I take the two entries constantly changing timestamps were updated after the resource was initially created.

Went heavy on the redaction and commented the two fields in the reconciliation update loop.

apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLInstance
metadata:
  annotations:
    cnrm.cloud.google.com/deletion-policy: abandon
    cnrm.cloud.google.com/management-conflict-prevention-policy: none
    cnrm.cloud.google.com/observed-secret-versions: '{}'
    cnrm.cloud.google.com/project-id: REDACTED
  creationTimestamp: "2021-01-07T17:30:56Z"
  finalizers:
  - cnrm.cloud.google.com/finalizer
  - cnrm.cloud.google.com/deletion-defender
  generation: 7079440 # obviously a problem
  labels:
    REDACTED/owner: platform
  managedFields:
  - apiVersion: sql.cnrm.cloud.google.com/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:cnrm.cloud.google.com/deletion-policy: {}
          f:cnrm.cloud.google.com/observed-secret-versions: {}
          f:cnrm.cloud.google.com/project-id: {}
        f:finalizers:
          v:"cnrm.cloud.google.com/deletion-defender": {}
          v:"cnrm.cloud.google.com/finalizer": {}
        f:labels:
          f:REDACTED/owner: {}
        f:ownerReferences:
          k:{"uid":"REDACTED"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:databaseVersion: {}
        f:region: {}
        f:settings:
          f:activationPolicy: {}
          f:availabilityType: {}
          f:backupConfiguration:
            f:enabled: {}
            f:location: {}
            f:startTime: {}
          f:diskAutoresize: {}
          f:diskType: {}
          f:ipConfiguration:
            f:ipv4Enabled: {}
            f:privateNetworkRef:
              f:external: {}
          f:locationPreference:
            f:zone: {}
          f:pricingPlan: {}
          f:replicationType: {}
          f:tier: {}
      f:status:
        f:connectionName: {}
        f:firstIpAddress: {}
        f:ipAddress: {}
        f:privateIpAddress: {}
        f:selfLink: {}
        f:serverCaCert:
          f:cert: {}
          f:commonName: {}
          f:createTime: {}
          f:expirationTime: {}
          f:sha1Fingerprint: {}
        f:serviceAccountEmailAddress: {}
    manager: before-first-apply
    operation: Update
  - apiVersion: sql.cnrm.cloud.google.com/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:cnrm.cloud.google.com/management-conflict-prevention-policy: {}
    manager: kubectl-edit
    operation: Update
    time: "2021-04-05T00:54:54Z"
  - apiVersion: sql.cnrm.cloud.google.com/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
    manager: cnrm-controller-manager
    operation: Update
    time: "2021-04-10T15:50:25Z" # the field getting updated and causing the reconciliation update loop
  - apiVersion: sql.cnrm.cloud.google.com/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:settings:
          f:backupConfiguration:
            f:pointInTimeRecoveryEnabled: {}
          f:diskSize: {}
    manager: manager
    operation: Update
    time: "2021-04-10T15:50:25Z" # Same here. This setting is what I suspect is maybe in the wrong format?
  name: servicing-change-in-terms
  namespace: platform
  ownerReferences:
  - apiVersion: orchestration.cnrm.cloud.google.com/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: FutureObject
    name: platform-change-in-terms-sqlinstance
    uid: REDACTED
  resourceVersion: "162860575"
  selfLink: /apis/sql.cnrm.cloud.google.com/v1beta1/namespaces/platform/sqlinstances/servicing-change-in-terms
  uid: REDACTED
spec:
  databaseVersion: POSTGRES_10
  region: us-east4
  settings:
    activationPolicy: ALWAYS
    availabilityType: ZONAL
    backupConfiguration:
      enabled: true
      location: us
      pointInTimeRecoveryEnabled: true
      startTime: "20:00"
    diskAutoresize: true
    diskSize: 20
    diskType: PD_SSD
    ipConfiguration:
      ipv4Enabled: false
      privateNetworkRef:
        external: https://www.googleapis.com/compute/v1/projects/REDACTED/global/networks/REDACTED
    locationPreference:
      zone: us-east4-b
    pricingPlan: PER_USE
    replicationType: SYNCHRONOUS
    tier: db-custom-2-4096
status:
  conditions:
  - lastTransitionTime: "2021-04-05T00:56:32Z"
    message: The resource is up to date
    reason: UpToDate
    status: "True"
    type: Ready
  connectionName: REDACTED:us-east4:servicing-change-in-terms
  firstIpAddress: 10.21.16.52
  ipAddress:
  - ipAddress: 10.21.16.52
    type: PRIVATE
  privateIpAddress: 10.21.16.52
  selfLink: https://sqladmin.googleapis.com/sql/v1beta4/projects/REDACTED/instances/servicing-change-in-terms
  serverCaCert:
    cert: |-
      -----BEGIN CERTIFICATE-----
      REDACTED
      -----END CERTIFICATE-----
    commonName: C=US,O=Google\, Inc,CN=Google Cloud SQL Server CA,dnQualifier=REDACTED
    createTime: "2021-01-07T17:32:09.448Z"
    expirationTime: "2031-01-05T17:33:09.448Z"
    sha1Fingerprint: REDACTED
  serviceAccountEmailAddress: REDACTED@gcp-sa-cloud-sql.iam.gserviceaccount.com

@eyalzek
Copy link

eyalzek commented Apr 12, 2021

@snuggie12 and @eyalzek Thank you for your confirmation & new data point! Could you provide the following information so that we can try to reproduce?

* K8s version/GKE version

* Config Connector version (addon or manual installation)

* Full YAML configuration for the problematic resources (feel free to remove the sensitive information)

* Steps to reproduce your issue

If I delete the kubernetes resource with abandon set for my delete policy I believe managed fields (or at least any changes it is documenting,) will be wiped out and config connector should stop making updates.

Yes @snuggie12, you can mark the deletion policy as abandon and delete the K8s resource without impacting the underlying SQL instance. After the SQLInstance resource is deleted from K8s, the underlying instance should stop making changes.

  • GKE v1.19.7-gke.1500
  • Config Connector 1.41.0 (addon installation)
  • The following resources are dependencies and are created and working as expected:
---
apiVersion: compute.cnrm.cloud.google.com/v1beta1
kind: ComputeAddress
metadata:
  annotations:
    cnrm.cloud.google.com/deletion-policy: abandon
    cnrm.cloud.google.com/project-id: ${GCP_PROJECT}
  name: google-managed-services-default
spec:
  addressType: INTERNAL
  description: IP Range for peer networks.
  location: global
  purpose: VPC_PEERING
  prefixLength: 20
  networkRef:
    external: default
---
apiVersion: servicenetworking.cnrm.cloud.google.com/v1beta1
kind: ServiceNetworkingConnection
metadata:
  annotations:
    cnrm.cloud.google.com/deletion-policy: abandon
    cnrm.cloud.google.com/project-id: ${GCP_PROJECT}
  name: peer-network
spec:
  networkRef:
    external: default
  reservedPeeringRanges:
    - name: google-managed-services-default
  service: servicenetworking.googleapis.com

the problematic resource is the instance itself:

---
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLInstance
metadata:
  annotations:
    cnrm.cloud.google.com/project-id: ${GCP_PROJECT}
    cnrm.cloud.google.com/deletion-policy: abandon
    #### the resource is still stuck in updating loop even with this annotation set....
    cnrm.cloud.google.com/management-conflict-prevention-policy: none
  name: development-mysql-master
spec:
  databaseVersion: MYSQL_5_7
  region: europe-west4
  settings:
    tier: db-n1-standard-1
    availabilityType: ZONAL
    ipConfiguration:
      ipv4Enabled: true
      privateNetworkRef:
        external: default
    backupConfiguration:
      binaryLogEnabled: false
      enabled: true
      location: eu
      startTime: 00:00
    maintenanceWindow:
      day: 1
      hour: 2
      updateTrack: canary

One this to note here is that the instance was already created with terraform beforehand. After applying the manifest, the instance was "Updating" in the GCP console but became ready within the minute. I used gcloud to describe it before and after applying the manifest and the only difference was in the labels applied:

$ diff /tmp/dev-mysql.yaml /tmp/dev-mysql-2.yaml 
4c4
< etag: cbc7ad26f61e42f4baddad8bd3b87a1602391d358255e6785e15d0cfe2b343b7
---
> etag: 69ee84b689d851abc30f769586a793996a0563db8694f120b0fdf5768e7c1d1c
76c76
<   settingsVersion: '146'
---
>   settingsVersion: '150'
79a80,83
>   userLabels:
>     cnrm-lease-expiration: '1617967055'
>     cnrm-lease-holder-id: bvgug84inp3o783qoq40
>     managed-by-cnrm: 'true'

first time I applied it the cnrm.cloud.google.com/management-conflict-prevention-policy: none annotation was not set. After I saw it was stuck updating and found this discussion, I deleted the resource and applied it again with the annotation set. Results were the same.

Here are some logs from the cnrm-system components (which were continously outputted as part of the reconcile loop):

cnrm-controller-manager-0 manager {"severity":"info","logger":"sqlinstance-controller","msg":"successfully finished reconcile","resource":{"namespace":"cluster-commons","name":"development-mysql-master"}}
cnrm-controller-manager-0 manager {"severity":"info","logger":"sqlinstance-controller","msg":"starting reconcile","resource":{"namespace":"cluster-commons","name":"development-mysql-master"}}
cnrm-controller-manager-0 manager {"severity":"info","logger":"sqlinstance-controller","msg":"creating/updating underlying resource","resource":{"namespace":"cluster-commons","name":"development-mysql-master"}}
cnrm-webhook-manager-77f958d648-pzql8 webhook {"severity":"info","msg":"processing request","operation":"UPDATE","handler":"immutable fields validation","kind":"SQLInstance","resource":{"namespace":"cluster-commons","name":"development-mysql-master"}}
cnrm-webhook-manager-77f958d648-pzql8 webhook {"severity":"info","msg":"done processing request","operation":"UPDATE","handler":"immutable fields validation","kind":"SQLInstance","resource":{"namespace":"cluster-commons","name":"development-mysql-master"},"result-code":200,"result-reason":"ignore non-user requests"}
cnrm-webhook-manager-77f958d648-njlkb webhook {"severity":"info","msg":"processing request","operation":"UPDATE","handler":"unknown fields validation","kind":"SQLInstance","resource":{"namespace":"cluster-commons","name":"development-mysql-master"}}
cnrm-webhook-manager-77f958d648-njlkb webhook {"severity":"info","msg":"done processing request","operation":"UPDATE","handler":"unknown fields validation","kind":"SQLInstance","resource":{"namespace":"cluster-commons","name":"development-mysql-master"},"result-code":200,"result-reason":"admission controller passed"}
cnrm-webhook-manager-77f958d648-njlkb webhook {"severity":"info","msg":"processing request","operation":"UPDATE","handler":"unknown fields validation","kind":"SQLInstance","resource":{"namespace":"cluster-commons","name":"development-mysql-master"}}
cnrm-webhook-manager-77f958d648-pzql8 webhook {"severity":"info","msg":"processing request","operation":"UPDATE","handler":"immutable fields validation","kind":"SQLInstance","resource":{"namespace":"cluster-commons","name":"development-mysql-master"}}
cnrm-webhook-manager-77f958d648-pzql8 webhook {"severity":"info","msg":"done processing request","operation":"UPDATE","handler":"immutable fields validation","kind":"SQLInstance","resource":{"namespace":"cluster-commons","name":"development-mysql-master"},"result-code":200,"result-reason":"ignore non-user requests"}
cnrm-webhook-manager-77f958d648-njlkb webhook {"severity":"info","msg":"done processing request","operation":"UPDATE","handler":"unknown fields validation","kind":"SQLInstance","resource":{"namespace":"cluster-commons","name":"development-mysql-master"},"result-code":200,"result-reason":"admission controller passed"}

@toumorokoshi
Copy link
Contributor

Thanks for the output! I'm having trouble reproducing the issue.

Here are the highlights:

  • I cannot reproduce with a brand-new configuration, either creating from scratch or re-acquiring and existing resource.
  • my diff did not show any of the userLabels you have posted.
  • Your database may still be managed by a separate instance of Config Connector, as the following annotations only are attached when conflict prevention is turned on:
>     cnrm-lease-expiration: '1617967055'
>     cnrm-lease-holder-id: bvgug84inp3o783qoq40
>     managed-by-cnrm: 'true'

So I'm wondering if there are any other instances of KCC managing the resource, which would be applying those labels.

Here's the manifest I've used (the private network I believe is irrelevant, as it's a hard-coded reference so doesn't depend on the status of another resource in the cluster).

---
apiVersion: v1
kind: Namespace
metadata:
  name: gh-294
  annotations:
    "cnrm.cloud.google.com/project-id": {project_id}
---
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLInstance
metadata:
  annotations:
    cnrm.cloud.google.com/deletion-policy: abandon
    cnrm.cloud.google.com/management-conflict-prevention-policy: none
  namespace: gh-294
  name: gh-294
spec:
  databaseVersion: MYSQL_5_7
  region: europe-west4
  settings:
    tier: db-n1-standard-1
    availabilityType: ZONAL
    backupConfiguration:
      binaryLogEnabled: false
      enabled: true
      location: eu
      startTime: 00:00
    maintenanceWindow:
      day: 1
      hour: 2
      updateTrack: canary
---

With this configuration, I arrived an up to date cluster, that stayed that way for at least 40 minutes when I stopped checking:

, detail:
  Warning  UpdateFailed  13m (x12 over 13m)   sqlinstance-controller  Update call failed: error applying desired state: summary: Error, failed to create instance because the network doesn't have at least 1 private services connection. Please see https://cloud.google.com/sql/docs/mysql/private-ip#network_requirements for how to create this connection., detail:
  Normal   Updating      8m7s (x16 over 13m)  sqlinstance-controller  Update in progress
  Normal   UpToDate      96s                  sqlinstance-controller  The resource is up to date

38 minutes later...

Reason: insufficientPermissions, Message: Insufficient Permission
, detail:
  Warning  UpdateFailed  50m (x12 over 50m)  sqlinstance-controller  Update call failed: error applying desired state: summary: Error, failed to create instance because the network doesn't have at least 1 private services connection. Please see https://cloud.google.com/sql/docs/mysql/private-ip#network_requirements for how to create this connection., detail:
  Normal   Updating      45m (x16 over 50m)  sqlinstance-controller  Update in progress
  Normal   UpToDate      38m                 sqlinstance-controller  The resource is up to date

Note the number of Updating, UpdateFailed events did not increase. This was a new instance that didn't exist previously.

@snuggie12
Copy link

snuggie12 commented Apr 16, 2021

@eyalzek Looks like you supplied the yaml that you apply, but not the actual resource on the cluster. Could you provide .metadata.generation and .metadata.managedFields from the live resource?

@toumorokoshi I don't think a fresh installation is going to show any issues. I deleted one of my instances (with abandon on) and because managedFields got cleared out I don't see the issue anymore. I'm going to try this soon on the one I provided above and hopefully see similar results.

I think specifically what is wrong is this entry:

  - apiVersion: sql.cnrm.cloud.google.com/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
    manager: cnrm-controller-manager
    operation: Update
    time: "2021-04-10T15:50:25Z"

I'm presuming that just like normal infinite reconciliation loops, .status should not be updated as a managed field.

To fully re-create, could you:

  • keep your yaml just as it is except remove conflict prevention
  • after like 10 minutes make the same two changes my developer did separately. They added the conflict prevention and they setup point in time recovery.
  • after those two changes, do you have 3 or 4 managed field entries? If you just have the 3 (the original, the conflict prevention, and the backups) then I think it won't constantly update. If you have 4 where the 4th is your .status then I think you'll have tons of updating events.

@snuggie12
Copy link

After reading up on managedFields I actually found the issue. I have two different controllers trying to take control of two fields.

I spotted this via: watch "kubectl get sqlinstance --context prod -n platform servicing-change-in-terms -o json | jq '.metadata.managedFields[].manager'" For the two entries changing the manager keeps swapping between two controllers.

This eventually led me to the fact that the non-kcc controller was trying to set diskSize to 20 but GCP and the SQLInstance resource itself said 25. Once I fixed the owning resource to match reality my updates stopped happening.

@eyalzek
Copy link

eyalzek commented Apr 16, 2021

I just tried creating it again and monitoring the metadata, but the resource does not have a managedFields section within the metadata. However, for some reason now the instance is Ready and not going into the update loop....

$ k get sqlinstances.sql.cnrm.cloud.google.com 
NAME                       AGE     READY   STATUS     STATUS AGE
development-mysql-master   2m13s   True    UpToDate   101s

we had a cluster upgrade to v1.19.8-gke.20001 a couple of days ago and it's possible that config connector was updated as well. I'll test next week on one of our clusters on the normal release channel that are still running 1.18 and see if it's working as expected there.

@toumorokoshi
Copy link
Contributor

@snuggie12 great! thanks for spelunking. Would it be fair to say your issue is fixed then?

@eyalzek sounds good! keep us posted. If it does happen, please continue to report the GKE master version and Config Connector version so we can try to repro.

@snuggie12
Copy link

Yes, I'm good. Thanks for your help

@toumorokoshi
Copy link
Contributor

@eyalzek I'm going to close the issue for now since things seem like they're working ok. Ping me on this thread if it's still not fixed and I'll re-open the issue.

@jcanseco
Copy link
Member

jcanseco commented Nov 3, 2021

FYI it seems that another way a SQLInstance might find itself repeatedly updating is if spec.settings.diskType is set to pd-ssd or pd-hdd instead of PD_SSD or PD_HDD.

This seems to be due to a bug in Terraform: hashicorp/terraform-provider-google#10492

Until the issue is fixed, we recommend using PD_SSD or PD_HDD instead, which are the values mentioned in the field's description anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

10 participants