Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRD deletion is stuck with no reason #116

Closed
Jainbrt opened this issue Feb 27, 2020 · 8 comments
Closed

CRD deletion is stuck with no reason #116

Jainbrt opened this issue Feb 27, 2020 · 8 comments
Assignees
Labels
Component: Automation Customer Impact: Localized low impact (2) Temporary / limited perf impact, unnecessary failovers, issues occur while in degraded state Customer Probability: Can't Happen (0) Problem is only hit w multiple failures or pathological use of operations rare in production env Phase: Test Issue was discovered through Test Severity: 4 Indicates low priority issue. Target: Operator Issues relating to the Operator Type: Bug Indicates issue is an undesired behavior, usually caused by code error. Type: Working As Designed indicates that the issue will not be fixed, as it it contradicts a stated and/or documented design.
Projects

Comments

@Jainbrt
Copy link
Member

Jainbrt commented Feb 27, 2020

Describe the bug
CRD deletion is stuck with no reason

To Reproduce
Download new ibm-spectrum-scale-csi-operator-dev.yaml image from john's check-in

[root@kubespray1 ~]# wget https://raw.githubusercontent.com/IBM/ibm-spectrum-scale-csi/1ce3f67ce1efaa2d05408c8f845526686119f321/generated/installer/ibm-spectrum-scale-csi-operator-dev.yaml

Created namespace for driver installation 

[root@kubespray1 ~]# oc apply -f  k8s/namespace.yaml
namespace/ibm-spectrum-scale-csi-driver created

Applied Operator deployment yaml file 

[root@kubespray1 ~]# oc apply -f ibm-spectrum-scale-csi-operator-dev.yaml
deployment.apps/ibm-spectrum-scale-csi-operator created
clusterrole.rbac.authorization.k8s.io/ibm-spectrum-scale-csi-operator created
clusterrolebinding.rbac.authorization.k8s.io/ibm-spectrum-scale-csi-operator created
serviceaccount/ibm-spectrum-scale-csi-operator created
customresourcedefinition.apiextensions.k8s.io/csiscaleoperators.csi.ibm.com created

Verified CRD is create and deployment is running 

[root@kubespray1 ~]# oc get crds|grep csi
csiscaleoperators.csi.ibm.com                    2020-02-25T12:18:49Z
[root@kubespray1 ~]# oc get pods -n ibm-spectrum-scale-csi-driver
NAME                                               READY   STATUS    RESTARTS   AGE
ibm-spectrum-scale-csi-operator-5f9f6c797c-fhnwc   2/2     Running   0          32s

Tried to delete but no response 
[root@kubespray1 k8s]# kubectl delete crd csiscaleoperators.csi.ibm.com
customresourcedefinition.apiextensions.k8s.io "csiscaleoperators.csi.ibm.com" deleted


^C

[root@kubespray1 ~]# oc describe  crds csiscaleoperators.csi.ibm.com
.
.
Status:
  Accepted Names:
    Kind:       CSIScaleOperator
    List Kind:  CSIScaleOperatorList
    Plural:     csiscaleoperators
    Singular:   csiscaleoperator
  Conditions:
    Last Transition Time:  2020-02-27T06:13:17Z
    Message:               no conflicts found
    Reason:                NoConflicts
    Status:                True
    Type:                  NamesAccepted
    Last Transition Time:  <nil>
    Message:               the initial names have been accepted
    Reason:                InitialNamesAccepted
    Status:                True
    Type:                  Established
    Last Transition Time:  2020-02-27T07:39:38Z
    Message:               CustomResource deletion is in progress
    Reason:                InstanceDeletionInProgress
    Status:                True
    Type:                  Terminating
  Stored Versions:
    v1
Events:  <none>

[root@kubespray1 ~]# date --utc
Tue Feb 25 12:28:07 UTC 2020

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Waited for almost 8 mins >

Expected behavior
CRD should be deleted in no time

Environment
NAME STATUS ROLES AGE VERSION
kubespray1 Ready master 246d v1.14.3
kubespray2 Ready 246d v1.14.3
kubespray3 Ready 246d v1.14.3

Additional context
Not sure what logs are required to debug this case.

@Jainbrt Jainbrt added Component: Automation Phase: Test Issue was discovered through Test Severity: 2 Indicates that the issue is critical and must be addressed before milestone. Type: Bug Indicates issue is an undesired behavior, usually caused by code error. Target: Operator Issues relating to the Operator labels Feb 27, 2020
@Jainbrt Jainbrt added this to the 1.1.0 milestone Feb 27, 2020
@Jainbrt
Copy link
Member Author

Jainbrt commented Feb 27, 2020

Later state of the CRD :

[root@kubespray1 ~]# kubectl describe crd csiscaleoperators.csi.ibm.com
.
.
Status:
  Accepted Names:
    Kind:       CSIScaleOperator
    List Kind:  CSIScaleOperatorList
    Plural:     csiscaleoperators
    Singular:   csiscaleoperator
  Conditions:
    Last Transition Time:  2020-02-27T06:13:17Z
    Message:               no conflicts found
    Reason:                NoConflicts
    Status:                True
    Type:                  NamesAccepted
    Last Transition Time:  <nil>
    Message:               the initial names have been accepted
    Reason:                InitialNamesAccepted
    Status:                True
    Type:                  Established
    Last Transition Time:  2020-02-27T07:39:38Z
    Message:               could not confirm zero CustomResources remaining: timed out waiting for the condition
    Reason:                InstanceDeletionCheck
    Status:                True
    Type:                  Terminating
  Stored Versions:
    v1
Events:  <none>

@mew2057
Copy link
Contributor

mew2057 commented Feb 27, 2020

You should actually be deleting the deployment first, not the CRD. This is actually something we hit in the scorecard: operator-framework/operator-sdk#2094.

Basically if you delete the CRD first it causes a ton of problems in removing an operator with a finalizer (which we need for account bindings, and pod management).

The way I remove everything is as follows:

kubectl delete csiscaleoperators --all -n <namespace>
kubectl delete -f deploy/operator.yaml
kubectl delete -f deploy/role.yaml
kubectl delete -f deploy/service_account.yaml
kubectl delete -f deploy/role_binding.yaml
kubectl delete -f deploy/crds/csiscaleoperators.csi.ibm.com.crd.yaml
kubectl delete -f deploy/namespace.yaml

@Jainbrt
Copy link
Member Author

Jainbrt commented Feb 27, 2020

Based one more debugging, we found that there was one csiscaleoperators instance stuck with below error.

kubectl get csiscaleoperators -n abhishek ibm-spectrum-scale-csi -o yaml
apiVersion: csi.ibm.com/v1
kind: CSIScaleOperator
metadata:
creationTimestamp: "2020-02-25T11:09:59Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2020-02-25T11:11:00Z"
finalizers:

  • finalizer.csiscaleoperators.csi.ibm.com
    generation: 3
    labels:
    app.kubernetes.io/instance: ibm-spectrum-scale-csi-operator
    app.kubernetes.io/managed-by: ibm-spectrum-scale-csi-operator
    app.kubernetes.io/name: ibm-spectrum-scale-csi-operator
    name: ibm-spectrum-scale-csi
    namespace: abhishek
    resourceVersion: "38222291"
    selfLink: /apis/csi.ibm.com/v1/namespaces/abhishek/csiscaleoperators/ibm-spectrum-scale-csi
    uid: 5cfdcbe5-57bf-11ea-8af7-525400af16b7
    spec:
    clusters:
  • id: "709351379920172263"
    primary:
    primaryFs: gpfs0
    primaryFset: csifset
    restApi:
    • guiHost: kubespray4
      secrets: guisecret
      secureSslMode: false
      scaleHostpath: /ibm/gpfs0
      trigger: "1"
      status:
      conditions:
  • ansibleResult:
    changed: 0
    completion: 2020-02-25T11:10:43.688645
    failures: 1
    ok: 19
    skipped: 8
    lastTransitionTime: "2020-02-25T11:10:44Z"
    message: 'Failed to retrieve requested object: b''{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"clusterroles.rbac.authorization.k8s.io
    \"ibm-spectrum-scale-csi-attacher\" is forbidden: User \"system:serviceaccount:abhishek:ibm-spectrum-scale-csi-operator\"
    cannot get resource \"clusterroles\" in API group \"rbac.authorization.k8s.io\"
    at the cluster scope","reason":"Forbidden","details":{"name":"ibm-spectrum-scale-csi-attacher","group":"rbac.authorization.k8s.io","kind":"clusterroles"},"code":403}\n'''

    reason: Failed
    status: "False"
    type: Failure
  • lastTransitionTime: "2020-02-25T11:11:00Z"
    message: Running reconciliation
    reason: Running
    status: "True"
    type: Running

@mew2057
Copy link
Contributor

mew2057 commented Mar 6, 2020

This was an issue with the environment, I forgot to copy my conversations.

@mew2057
Copy link
Contributor

mew2057 commented Mar 6, 2020

The namespace was forcibly deleted:

[root@kubespray1 ~]# kubectl edit -n abhishek CSIScaleOperator
error: csiscaleoperators.csi.ibm.com "ibm-spectrum-scale-csi" could not be found on the server
The edits you made on deleted resources have been saved to "/tmp/kubectl-edit-k1qxe.yaml"

This was the work around:

kubectl create namespace abhishek
# Delete finalizer
kubectl edit   -n  abhishek csiscaleoperator ibm-spectrum-scale-csi
kubectl  delete namespace abhishek

This was likely an issue of user automation leaving artifacts, and the artifacts weren't properly cleaned.

@mew2057 mew2057 added Severity: 4 Indicates low priority issue. Type: Working As Designed indicates that the issue will not be fixed, as it it contradicts a stated and/or documented design. and removed Severity: 2 Indicates that the issue is critical and must be addressed before milestone. labels Mar 6, 2020
@Jainbrt
Copy link
Member Author

Jainbrt commented Mar 11, 2020

Thanks John for your help in debugging the issue.While I will keep this issue open for either occurrence.

@mew2057 mew2057 added Customer Impact: Localized low impact (2) Temporary / limited perf impact, unnecessary failovers, issues occur while in degraded state Customer Probability: Can't Happen (0) Problem is only hit w multiple failures or pathological use of operations rare in production env labels Mar 12, 2020
@Jainbrt Jainbrt removed this from the 1.1.0 milestone Mar 20, 2020
@aspalazz aspalazz added this to Open defects in CSI 2.0.0 Mar 26, 2020
@smitaraut
Copy link
Member

Did we have any more recreates of this?

@Jainbrt
Copy link
Member Author

Jainbrt commented May 21, 2020

We can close this issue for now.

@Jainbrt Jainbrt closed this as completed May 21, 2020
CSI 2.0.0 automation moved this from Needs assignment to Done May 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Automation Customer Impact: Localized low impact (2) Temporary / limited perf impact, unnecessary failovers, issues occur while in degraded state Customer Probability: Can't Happen (0) Problem is only hit w multiple failures or pathological use of operations rare in production env Phase: Test Issue was discovered through Test Severity: 4 Indicates low priority issue. Target: Operator Issues relating to the Operator Type: Bug Indicates issue is an undesired behavior, usually caused by code error. Type: Working As Designed indicates that the issue will not be fixed, as it it contradicts a stated and/or documented design.
Projects
No open projects
CSI 2.0.0
  
Done
Development

No branches or pull requests

4 participants