Skip to content

Commit

Permalink
Added note to check the method of boot for each NCN type if desired.
Browse files Browse the repository at this point in the history
  • Loading branch information
SeanWallace committed Oct 12, 2021
1 parent 69b4d10 commit e29c24a
Showing 1 changed file with 54 additions and 33 deletions.
87 changes: 54 additions & 33 deletions operations/node_management/Reboot_NCNs.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,16 +183,23 @@ Before rebooting NCNs:

Ensure the power is reporting as on. This may take 5-10 seconds for this to update.
4. Watch on the console until the node has successfully booted and the login prompt is reached.

5. If desired verify method of boot is expected. If the `/proc/cmdline` begins with `BOOT_IMAGE` then this NCN booted from disk:

```bash
ncn# egrep -o '^(BOOT_IMAGE.+/kernel)' /proc/cmdline
BOOT_IMAGE=(mduuid/a3899572a56f5fd88a0dec0e89fc12b4)/boot/grub2/../kernel
```

5. Retrieve the `XNAME` for the node being rebooted.
6. Retrieve the `XNAME` for the node being rebooted.

This xname is available on the node being rebooted in the following file:

```bash
ncn# ssh NODE cat /etc/cray/xname
```

6. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
7. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.

The following command will indicate if a CFS job is currently in progress for this node. Replace the `XNAME` value in the following command with the xname of the node being rebooted.

Expand All @@ -211,30 +218,30 @@ Before rebooting NCNs:
If configurationStatus is `failed`, See [Troubleshoot Ansible Play Failures in CFS Sessions](../configuration_management/Troubleshoot_Ansible_Play_Failures_in_CFS_Sessions.md) for how to analyze the pod logs from cray-cfs to determine why the configuration may not have completed.
7. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.
8. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.
**Troubleshooting:** If the slurmctld and slurmdbd pods do not start after powering back up the node, check for the following error:
**Troubleshooting:** If the slurmctld and slurmdbd pods do not start after powering back up the node, check for the following error:
```bash
ncn-m001# kubectl describe pod -n user -lapp=slurmctld
Warning FailedCreatePodSandBox 27m kubelet, ncn-w001 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "82c575cc978db00643b1bf84a4773c064c08dcb93dbd9741ba2e581bc7c5d545": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
```
```bash
ncn-m001# kubectl describe pod -n user -lapp=slurmctld
Warning FailedCreatePodSandBox 27m kubelet, ncn-w001 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "82c575cc978db00643b1bf84a4773c064c08dcb93dbd9741ba2e581bc7c5d545": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
```
```bash
ncn-m001# kubectl describe pod -n user -lapp=slurmdbd
Warning FailedCreatePodSandBox 29m kubelet, ncn-w001 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "314ca4285d0706ec3d76a9e953e412d4b0712da4d0cb8138162b53d807d07491": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
```
```bash
ncn-m001# kubectl describe pod -n user -lapp=slurmdbd
Warning FailedCreatePodSandBox 29m kubelet, ncn-w001 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "314ca4285d0706ec3d76a9e953e412d4b0712da4d0cb8138162b53d807d07491": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
```
Remove the following files on every worker node to resolve the failure:
Remove the following files on every worker node to resolve the failure:
- /var/lib/cni/networks/macvlan-slurmctld-nmn-conf
- /var/lib/cni/networks/macvlan-slurmdbd-nmn-conf
- /var/lib/cni/networks/macvlan-slurmctld-nmn-conf
- /var/lib/cni/networks/macvlan-slurmdbd-nmn-conf
8. Disconnect from the console.
9. Disconnect from the console.
9. Repeat all of the sub-steps above for the remaining storage nodes, going from the highest to lowest number until all storage nodes have successfully rebooted.
10. Repeat all of the sub-steps above for the remaining storage nodes, going from the highest to lowest number until all storage nodes have successfully rebooted.
**Important:** Ensure `ceph -s` shows that Ceph is healthy BEFORE MOVING ON to reboot the next storage node. Once Ceph has recovered the downed mon, it may take a several minutes for Ceph to resolve clock skew.
**Important:** Ensure `ceph -s` shows that Ceph is healthy BEFORE MOVING ON to reboot the next storage node. Once Ceph has recovered the downed mon, it may take a several minutes for Ceph to resolve clock skew.
#### NCN Worker Nodes
Expand Down Expand Up @@ -309,16 +316,23 @@ Before rebooting NCNs:
Ensure the power is reporting as on. This may take 5-10 seconds for this to update.
6. Watch on the console until the node has successfully booted and the login prompt is reached.
7. If desired verify method of boot is expected. If the `/proc/cmdline` begins with `BOOT_IMAGE` then this NCN booted from disk:
```bash
ncn# egrep -o '^(BOOT_IMAGE.+/kernel)' /proc/cmdline
BOOT_IMAGE=(mduuid/a3899572a56f5fd88a0dec0e89fc12b4)/boot/grub2/../kernel
```
7. Retrieve the `XNAME` for the node being rebooted.
8. Retrieve the `XNAME` for the node being rebooted.
This xname is available on the node being rebooted in the following file:
```bash
ncn# ssh NODE cat /etc/cray/xname
```
8. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
9. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
The following command will indicate if a CFS job is currently in progress for this node. Replace the `XNAME` value in the following command with the xname of the node being rebooted.
Expand All @@ -337,29 +351,29 @@ Before rebooting NCNs:
If configurationStatus is `failed`, See [Troubleshoot Ansible Play Failures in CFS Sessions](../configuration_management/Troubleshoot_Ansible_Play_Failures_in_CFS_Sessions.md) for how to analyze the pod logs from cray-cfs to determine why the configuration may not have completed.
9. Uncordon the node
10. Uncordon the node
```bash
ncn-m# kubectl uncordon <node you just rebooted>
```
```bash
ncn-m# kubectl uncordon <node you just rebooted>
```
10. Verify pods are running on the rebooted node.
11. Verify pods are running on the rebooted node.
Within a minute or two, the following command should begin to show pods in a `Running` state (replace NCN in the command below with the name of the worker node):
```bash
ncn-m# kubectl get pods -o wide -A | grep <node to be rebooted>
```
11. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.
12. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.
Verify that the `Check the Health of the Etcd Clusters in the Services Namespace` check from the ncnHealthChecks.sh script returns a healthy report for all members of each etcd cluster.
If terminating pods are reported when checking the status of the Kubernetes pods, wait for all pods to recover before proceeding.
12. Disconnect from the console.
13. Disconnect from the console.
13. Repeat all of the sub-steps above for the remaining worker nodes, going from the highest to lowest number until all worker nodes have successfully rebooted.
14. Repeat all of the sub-steps above for the remaining worker nodes, going from the highest to lowest number until all worker nodes have successfully rebooted.
1. Ensure that BGP sessions are reset so that all BGP peering sessions with the spine switches are in an ESTABLISHED state.
Expand Down Expand Up @@ -404,16 +418,23 @@ Before rebooting NCNs:
Ensure the power is reporting as on. This may take 5-10 seconds for this to update.
4. Watch on the console until the node has successfully booted and the login prompt is reached.
5. If desired verify method of boot is expected. If the `/proc/cmdline` begins with `BOOT_IMAGE` then this NCN booted from disk:
```bash
ncn# egrep -o '^(BOOT_IMAGE.+/kernel)' /proc/cmdline
BOOT_IMAGE=(mduuid/a3899572a56f5fd88a0dec0e89fc12b4)/boot/grub2/../kernel
```
5. Retrieve the `XNAME` for the node being rebooted.
6. Retrieve the `XNAME` for the node being rebooted.
This xname is available on the node being rebooted in the following file:
```bash
ncn# ssh NODE cat /etc/cray/xname
```
6. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
7. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
The following command will indicate if a CFS job is currently in progress for this node. Replace the `XNAME` value in the following command with the xname of the node being rebooted.
Expand All @@ -432,11 +453,11 @@ Before rebooting NCNs:
If configurationStatus is `failed`, See [Troubleshoot Ansible Play Failures in CFS Sessions](../configuration_management/Troubleshoot_Ansible_Play_Failures_in_CFS_Sessions.md) for how to analyze the pod logs from cray-cfs to determine why the configuration may not have completed.
7. Run the platform health checks in [Validate CSM Health](../validate_csm_health.md).
8. Run the platform health checks in [Validate CSM Health](../validate_csm_health.md).
8. Disconnect from the console.
9. Disconnect from the console.
9. Repeat all of the sub-steps above for the remaining master nodes \(excluding `ncn-m001`\), going from the highest to lowest number until all master nodes have successfully rebooted.
10. Repeat all of the sub-steps above for the remaining master nodes \(excluding `ncn-m001`\), going from the highest to lowest number until all master nodes have successfully rebooted.
2. Reboot `ncn-m001`.
Expand Down

0 comments on commit e29c24a

Please sign in to comment.