Added note to check the method of boot for each NCN type if desired.

Cray-HPE · Oct 12, 2021 · e29c24a · e29c24a
1 parent 69b4d10
commit e29c24a
Showing 1 changed file with 54 additions and 33 deletions.
diff --git a/operations/node_management/Reboot_NCNs.md b/operations/node_management/Reboot_NCNs.md
@@ -183,16 +183,23 @@ Before rebooting NCNs:
 
         Ensure the power is reporting as on. This may take 5-10 seconds for this to update.
     4. Watch on the console until the node has successfully booted and the login prompt is reached.
+
+    5. If desired verify method of boot is expected. If the `/proc/cmdline` begins with `BOOT_IMAGE` then this NCN booted from disk:
+
+   ```bash
+   ncn# egrep -o '^(BOOT_IMAGE.+/kernel)' /proc/cmdline
+   BOOT_IMAGE=(mduuid/a3899572a56f5fd88a0dec0e89fc12b4)/boot/grub2/../kernel
+   ```
 
-    5. Retrieve the `XNAME` for the node being rebooted.
+    6. Retrieve the `XNAME` for the node being rebooted.
 
        This xname is available on the node being rebooted in the following file:
 
        ```bash
        ncn# ssh NODE cat /etc/cray/xname
        ```
 
-    6. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
+    7. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
 
        The following command will indicate if a CFS job is currently in progress for this node. Replace the `XNAME` value in the following command with the xname of the node being rebooted.
 
@@ -211,30 +218,30 @@ Before rebooting NCNs:
 
        If configurationStatus is `failed`, See [Troubleshoot Ansible Play Failures in CFS Sessions](../configuration_management/Troubleshoot_Ansible_Play_Failures_in_CFS_Sessions.md) for how to analyze the pod logs from cray-cfs to determine why the configuration may not have completed.
 
-    7. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.
+    8. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.
 
-        **Troubleshooting:** If the slurmctld and slurmdbd pods do not start after powering back up the node, check for the following error:
+         **Troubleshooting:** If the slurmctld and slurmdbd pods do not start after powering back up the node, check for the following error:
 
-        ```bash
-        ncn-m001# kubectl describe pod -n user -lapp=slurmctld
-        Warning  FailedCreatePodSandBox  27m              kubelet, ncn-w001  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "82c575cc978db00643b1bf84a4773c064c08dcb93dbd9741ba2e581bc7c5d545": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
-        ```
+         ```bash
+         ncn-m001# kubectl describe pod -n user -lapp=slurmctld
+         Warning  FailedCreatePodSandBox  27m              kubelet, ncn-w001  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "82c575cc978db00643b1bf84a4773c064c08dcb93dbd9741ba2e581bc7c5d545": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
+         ```
 
-        ```bash
-        ncn-m001# kubectl describe pod -n user -lapp=slurmdbd
-        Warning  FailedCreatePodSandBox  29m                    kubelet, ncn-w001  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "314ca4285d0706ec3d76a9e953e412d4b0712da4d0cb8138162b53d807d07491": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
-        ```
+         ```bash
+         ncn-m001# kubectl describe pod -n user -lapp=slurmdbd
+         Warning  FailedCreatePodSandBox  29m                    kubelet, ncn-w001  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "314ca4285d0706ec3d76a9e953e412d4b0712da4d0cb8138162b53d807d07491": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
+         ```
 
-        Remove the following files on every worker node to resolve the failure:
+         Remove the following files on every worker node to resolve the failure:
 
-        - /var/lib/cni/networks/macvlan-slurmctld-nmn-conf
-        - /var/lib/cni/networks/macvlan-slurmdbd-nmn-conf
+         - /var/lib/cni/networks/macvlan-slurmctld-nmn-conf
+         - /var/lib/cni/networks/macvlan-slurmdbd-nmn-conf
 
-    8. Disconnect from the console.
+    9. Disconnect from the console.
 
-    9. Repeat all of the sub-steps above for the remaining storage nodes, going from the highest to lowest number until all storage nodes have successfully rebooted.
+    10. Repeat all of the sub-steps above for the remaining storage nodes, going from the highest to lowest number until all storage nodes have successfully rebooted.
 
-        **Important:** Ensure `ceph -s` shows that Ceph is healthy BEFORE MOVING ON to reboot the next storage node. Once Ceph has recovered the downed mon, it may take a several minutes for Ceph to resolve clock skew.
+         **Important:** Ensure `ceph -s` shows that Ceph is healthy BEFORE MOVING ON to reboot the next storage node. Once Ceph has recovered the downed mon, it may take a several minutes for Ceph to resolve clock skew.
 
 #### NCN Worker Nodes
 
@@ -309,16 +316,23 @@ Before rebooting NCNs:
         Ensure the power is reporting as on. This may take 5-10 seconds for this to update.
 
     6. Watch on the console until the node has successfully booted and the login prompt is reached.
+   
+    7. If desired verify method of boot is expected. If the `/proc/cmdline` begins with `BOOT_IMAGE` then this NCN booted from disk:
+   
+   ```bash
+   ncn# egrep -o '^(BOOT_IMAGE.+/kernel)' /proc/cmdline
+   BOOT_IMAGE=(mduuid/a3899572a56f5fd88a0dec0e89fc12b4)/boot/grub2/../kernel
+   ```
 
-    7. Retrieve the `XNAME` for the node being rebooted.
+    8. Retrieve the `XNAME` for the node being rebooted.
 
        This xname is available on the node being rebooted in the following file:
 
        ```bash
        ncn# ssh NODE cat /etc/cray/xname
        ```
 
-    8. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
+    9. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
 
        The following command will indicate if a CFS job is currently in progress for this node. Replace the `XNAME` value in the following command with the xname of the node being rebooted.
 
@@ -337,29 +351,29 @@ Before rebooting NCNs:
 
        If configurationStatus is `failed`, See [Troubleshoot Ansible Play Failures in CFS Sessions](../configuration_management/Troubleshoot_Ansible_Play_Failures_in_CFS_Sessions.md) for how to analyze the pod logs from cray-cfs to determine why the configuration may not have completed.
 
-    9. Uncordon the node
+    10. Uncordon the node
 
-       ```bash
-       ncn-m# kubectl uncordon <node you just rebooted>
-       ```
+        ```bash
+        ncn-m# kubectl uncordon <node you just rebooted>
+        ```
 
-    10. Verify pods are running on the rebooted node.
+    11. Verify pods are running on the rebooted node.
 
          Within a minute or two, the following command should begin to show pods in a `Running` state (replace NCN in the command below with the name of the worker node):
 
          ```bash
          ncn-m# kubectl get pods -o wide -A | grep <node to be rebooted>
          ```
 
-    11. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.
+    12. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.
 
          Verify that the `Check the Health of the Etcd Clusters in the Services Namespace` check from the ncnHealthChecks.sh script returns a healthy report for all members of each etcd cluster.
 
          If terminating pods are reported when checking the status of the Kubernetes pods, wait for all pods to recover before proceeding.
 
-    12. Disconnect from the console.
+    13. Disconnect from the console.
 
-    13. Repeat all of the sub-steps above for the remaining worker nodes, going from the highest to lowest number until all worker nodes have successfully rebooted.
+    14. Repeat all of the sub-steps above for the remaining worker nodes, going from the highest to lowest number until all worker nodes have successfully rebooted.
 
 1. Ensure that BGP sessions are reset so that all BGP peering sessions with the spine switches are in an ESTABLISHED state.
 
@@ -404,16 +418,23 @@ Before rebooting NCNs:
         Ensure the power is reporting as on. This may take 5-10 seconds for this to update.
 
     4. Watch on the console until the node has successfully booted and the login prompt is reached.
+   
+    5. If desired verify method of boot is expected. If the `/proc/cmdline` begins with `BOOT_IMAGE` then this NCN booted from disk:
+   
+   ```bash
+   ncn# egrep -o '^(BOOT_IMAGE.+/kernel)' /proc/cmdline
+   BOOT_IMAGE=(mduuid/a3899572a56f5fd88a0dec0e89fc12b4)/boot/grub2/../kernel
+   ```
 
-    5. Retrieve the `XNAME` for the node being rebooted.
+    6. Retrieve the `XNAME` for the node being rebooted.
 
        This xname is available on the node being rebooted in the following file:
 
        ```bash
        ncn# ssh NODE cat /etc/cray/xname
        ```
 
-    6. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
+    7. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
 
        The following command will indicate if a CFS job is currently in progress for this node. Replace the `XNAME` value in the following command with the xname of the node being rebooted.
 
@@ -432,11 +453,11 @@ Before rebooting NCNs:
 
        If configurationStatus is `failed`, See [Troubleshoot Ansible Play Failures in CFS Sessions](../configuration_management/Troubleshoot_Ansible_Play_Failures_in_CFS_Sessions.md) for how to analyze the pod logs from cray-cfs to determine why the configuration may not have completed.
 
-    7. Run the platform health checks in [Validate CSM Health](../validate_csm_health.md).
+    8. Run the platform health checks in [Validate CSM Health](../validate_csm_health.md).
 
-    8. Disconnect from the console.
+    9. Disconnect from the console.
 
-    9. Repeat all of the sub-steps above for the remaining master nodes \(excluding `ncn-m001`\), going from the highest to lowest number until all master nodes have successfully rebooted.
+    10. Repeat all of the sub-steps above for the remaining master nodes \(excluding `ncn-m001`\), going from the highest to lowest number until all master nodes have successfully rebooted.
 
 2. Reboot `ncn-m001`.