Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put CASMINST-2105 WAR back and added notes to set and check method of boot if desired #275

Merged
merged 6 commits into from Oct 13, 2021
2 changes: 1 addition & 1 deletion docs-csm.spec
Expand Up @@ -21,7 +21,7 @@ is in Markdown format starting at /usr/share/doc/csm/README.md.

%install
install -m 755 -d %{buildroot}/usr/share/doc/csm
cp -pvrR ./*.md ./background ./install ./img ./introduction ./operations ./troubleshooting ./update_product_stream ./upgrade ./*example* %{buildroot}/usr/share/doc/csm/ | awk '{print $3}' | sed "s/'//g" | sed "s|$RPM_BUILD_ROOT||g" | tee -a INSTALLED_FILES
cp -pvrR ./*.md ./background ./install ./img ./introduction ./operations ./scripts ./troubleshooting ./update_product_stream ./upgrade ./*example* %{buildroot}/usr/share/doc/csm/ | awk '{print $3}' | sed "s/'//g" | sed "s|$RPM_BUILD_ROOT||g" | tee -a INSTALLED_FILES
cat INSTALLED_FILES | xargs -i sh -c 'test -L {} && exit || test -f $RPM_BUILD_ROOT/{} && echo {} || echo %dir {}' > INSTALLED_FILES_2

%clean
Expand Down
157 changes: 96 additions & 61 deletions operations/node_management/Reboot_NCNs.md
Expand Up @@ -147,13 +147,15 @@ Before rebooting NCNs:

#### Utility Storage Nodes (Ceph)

1. Reboot each of the storage nodes \(one at a time\).
1. Reboot each of the storage nodes (one at a time).

1. Establish a console session to each storage node.

Use the [Establish a Serial Connection to NCNs](../conman/Establish_a_Serial_Connection_to_NCNs.md) procedure referenced in step 4.

2. If booting from disk is desired then [set the boot order](../../background/ncn_boot_workflow.md#set-boot-order).

2. Reboot the selected node.
3. Reboot the selected node.

```bash
ncn-s# shutdown -r now
Expand All @@ -180,17 +182,24 @@ Before rebooting NCNs:
```

Ensure the power is reporting as on. This may take 5-10 seconds for this to update.
3. Watch on the console until the node has successfully booted and the login prompt is reached.
4. Watch on the console until the node has successfully booted and the login prompt is reached.

5. If desired verify method of boot is expected. If the `/proc/cmdline` begins with `BOOT_IMAGE` then this NCN booted from disk:

```bash
ncn# egrep -o '^(BOOT_IMAGE.+/kernel)' /proc/cmdline
BOOT_IMAGE=(mduuid/a3899572a56f5fd88a0dec0e89fc12b4)/boot/grub2/../kernel
```

1. Retrieve the `XNAME` for the node being rebooted.
6. Retrieve the `XNAME` for the node being rebooted.

This xname is available on the node being rebooted in the following file:

```bash
ncn# ssh NODE cat /etc/cray/xname
```

1. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
7. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.

The following command will indicate if a CFS job is currently in progress for this node. Replace the `XNAME` value in the following command with the xname of the node being rebooted.

Expand All @@ -209,34 +218,34 @@ Before rebooting NCNs:

If configurationStatus is `failed`, See [Troubleshoot Ansible Play Failures in CFS Sessions](../configuration_management/Troubleshoot_Ansible_Play_Failures_in_CFS_Sessions.md) for how to analyze the pod logs from cray-cfs to determine why the configuration may not have completed.

4. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.
8. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.

**Troubleshooting:** If the slurmctld and slurmdbd pods do not start after powering back up the node, check for the following error:
**Troubleshooting:** If the slurmctld and slurmdbd pods do not start after powering back up the node, check for the following error:

```bash
ncn-m001# kubectl describe pod -n user -lapp=slurmctld
Warning FailedCreatePodSandBox 27m kubelet, ncn-w001 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "82c575cc978db00643b1bf84a4773c064c08dcb93dbd9741ba2e581bc7c5d545": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
```
```bash
ncn-m001# kubectl describe pod -n user -lapp=slurmctld
Warning FailedCreatePodSandBox 27m kubelet, ncn-w001 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "82c575cc978db00643b1bf84a4773c064c08dcb93dbd9741ba2e581bc7c5d545": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
```

```bash
ncn-m001# kubectl describe pod -n user -lapp=slurmdbd
Warning FailedCreatePodSandBox 29m kubelet, ncn-w001 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "314ca4285d0706ec3d76a9e953e412d4b0712da4d0cb8138162b53d807d07491": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
```
```bash
ncn-m001# kubectl describe pod -n user -lapp=slurmdbd
Warning FailedCreatePodSandBox 29m kubelet, ncn-w001 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "314ca4285d0706ec3d76a9e953e412d4b0712da4d0cb8138162b53d807d07491": Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to allocate for range 0: no IP addresses available in range set: 10.252.2.4-10.252.2.4
```

Remove the following files on every worker node to resolve the failure:
Remove the following files on every worker node to resolve the failure:

- /var/lib/cni/networks/macvlan-slurmctld-nmn-conf
- /var/lib/cni/networks/macvlan-slurmdbd-nmn-conf
- /var/lib/cni/networks/macvlan-slurmctld-nmn-conf
- /var/lib/cni/networks/macvlan-slurmdbd-nmn-conf

5. Disconnect from the console.
9. Disconnect from the console.

6. Repeat all of the sub-steps above for the remaining storage nodes, going from the highest to lowest number until all storage nodes have successfully rebooted.
10. Repeat all of the sub-steps above for the remaining storage nodes, going from the highest to lowest number until all storage nodes have successfully rebooted.

**Important:** Ensure `ceph -s` shows that Ceph is healthy BEFORE MOVING ON to reboot the next storage node. Once Ceph has recovered the downed mon, it may take a several minutes for Ceph to resolve clock skew.
**Important:** Ensure `ceph -s` shows that Ceph is healthy BEFORE MOVING ON to reboot the next storage node. Once Ceph has recovered the downed mon, it may take a several minutes for Ceph to resolve clock skew.

#### NCN Worker Nodes

1. Reboot each of the worker nodes \(one at a time\).
1. Reboot each of the worker nodes (one at a time).

**NOTE:** You are doing a single worker at a time, so please keep track of what ncn-w0xx you are on for these steps.

Expand All @@ -246,13 +255,13 @@ Before rebooting NCNs:

See [Establish a Serial Connection to NCNs](../conman/Establish_a_Serial_Connection_to_NCNs.md) for more information.

1. Failover any postgres leader that is running on the worker node you are rebooting.
2. Failover any postgres leader that is running on the worker node you are rebooting.

```bash
ncn-m# /usr/share/doc/csm/upgrade/1.0/scripts/k8s/failover-leader.sh <node to be rebooted>
```

1. Cordon and Drain the node
3. Cordon and Drain the node

```bash
ncn-m# kubectl drain --ignore-daemonsets=true --delete-local-data=true <node to be rebooted>
Expand All @@ -276,7 +285,9 @@ Before rebooting NCNs:
ncn-m# kubectl drain --ignore-daemonsets=true --delete-local-data=true <node to be rebooted>
```

1. Reboot the selected node.
4. If booting from disk is desired then [set the boot order](../../background/ncn_boot_workflow.md#set-boot-order).

5. Reboot the selected node.

```bash
ncn-w# shutdown -r now
Expand Down Expand Up @@ -304,17 +315,24 @@ Before rebooting NCNs:

Ensure the power is reporting as on. This may take 5-10 seconds for this to update.

1. Watch on the console until the node has successfully booted and the login prompt is reached.
6. Watch on the console until the node has successfully booted and the login prompt is reached.

7. If desired verify method of boot is expected. If the `/proc/cmdline` begins with `BOOT_IMAGE` then this NCN booted from disk:

```bash
ncn# egrep -o '^(BOOT_IMAGE.+/kernel)' /proc/cmdline
BOOT_IMAGE=(mduuid/a3899572a56f5fd88a0dec0e89fc12b4)/boot/grub2/../kernel
```

1. Retrieve the `XNAME` for the node being rebooted.
8. Retrieve the `XNAME` for the node being rebooted.

This xname is available on the node being rebooted in the following file:

```bash
ncn# ssh NODE cat /etc/cray/xname
```

1. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
9. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.

The following command will indicate if a CFS job is currently in progress for this node. Replace the `XNAME` value in the following command with the xname of the node being rebooted.

Expand All @@ -333,43 +351,45 @@ Before rebooting NCNs:

If configurationStatus is `failed`, See [Troubleshoot Ansible Play Failures in CFS Sessions](../configuration_management/Troubleshoot_Ansible_Play_Failures_in_CFS_Sessions.md) for how to analyze the pod logs from cray-cfs to determine why the configuration may not have completed.

1. Uncordon the node
10. Uncordon the node

```bash
ncn-m# kubectl uncordon <node you just rebooted>
```
```bash
ncn-m# kubectl uncordon <node you just rebooted>
```

1. Verify pods are running on the rebooted node.
11. Verify pods are running on the rebooted node.

Within a minute or two, the following command should begin to show pods in a `Running` state (replace NCN in the command below with the name of the worker node):
Within a minute or two, the following command should begin to show pods in a `Running` state (replace NCN in the command below with the name of the worker node):

```bash
ncn-m# kubectl get pods -o wide -A | grep <node to be rebooted>
```
```bash
ncn-m# kubectl get pods -o wide -A | grep <node to be rebooted>
```

1. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.
12. Run the platform health checks from the [Validate CSM Health](../validate_csm_health.md) procedure.

Verify that the `Check the Health of the Etcd Clusters in the Services Namespace` check from the ncnHealthChecks.sh script returns a healthy report for all members of each etcd cluster.
Verify that the `Check the Health of the Etcd Clusters in the Services Namespace` check from the ncnHealthChecks.sh script returns a healthy report for all members of each etcd cluster.

If terminating pods are reported when checking the status of the Kubernetes pods, wait for all pods to recover before proceeding.
If terminating pods are reported when checking the status of the Kubernetes pods, wait for all pods to recover before proceeding.

1. Disconnect from the console.
13. Disconnect from the console.

1. Repeat all of the sub-steps above for the remaining worker nodes, going from the highest to lowest number until all worker nodes have successfully rebooted.
14. Repeat all of the sub-steps above for the remaining worker nodes, going from the highest to lowest number until all worker nodes have successfully rebooted.

1. Ensure that BGP sessions are reset so that all BGP peering sessions with the spine switches are in an ESTABLISHED state.

See [Check BGP Status and Reset Sessions](../network/metallb_bgp/Check_BGP_Status_and_Reset_Sessions.md).

#### NCN Master Nodes

1. Reboot each of the master nodes \(one at a time\) starting with ncn-m003 then ncn-m001. There are special instructions for ncn-m001 below since its console connection is not managed by conman.
1. Reboot each of the master nodes (one at a time) starting with ncn-m003 then ncn-m001. There are special instructions for ncn-m001 below since its console connection is not managed by conman.

1. Establish a console session to the master node you are rebooting.

See step [Establish a Serial Connection to NCNs](../conman/Establish_a_Serial_Connection_to_NCNs.md) for more information.

2. Reboot the selected node.
2. If booting from disk is desired then [set the boot order](../../background/ncn_boot_workflow.md#set-boot-order).

3. Reboot the selected node.

```bash
ncn-m001# shutdown -r now
Expand Down Expand Up @@ -397,17 +417,24 @@ Before rebooting NCNs:

Ensure the power is reporting as on. This may take 5-10 seconds for this to update.

3. Watch on the console until the node has successfully booted and the login prompt is reached.
4. Watch on the console until the node has successfully booted and the login prompt is reached.

5. If desired verify method of boot is expected. If the `/proc/cmdline` begins with `BOOT_IMAGE` then this NCN booted from disk:

```bash
ncn# egrep -o '^(BOOT_IMAGE.+/kernel)' /proc/cmdline
BOOT_IMAGE=(mduuid/a3899572a56f5fd88a0dec0e89fc12b4)/boot/grub2/../kernel
```

1. Retrieve the `XNAME` for the node being rebooted.
6. Retrieve the `XNAME` for the node being rebooted.

This xname is available on the node being rebooted in the following file:

```bash
ncn# ssh NODE cat /etc/cray/xname
```

1. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
7. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.

The following command will indicate if a CFS job is currently in progress for this node. Replace the `XNAME` value in the following command with the xname of the node being rebooted.

Expand All @@ -426,19 +453,21 @@ Before rebooting NCNs:

If configurationStatus is `failed`, See [Troubleshoot Ansible Play Failures in CFS Sessions](../configuration_management/Troubleshoot_Ansible_Play_Failures_in_CFS_Sessions.md) for how to analyze the pod logs from cray-cfs to determine why the configuration may not have completed.

4. Run the platform health checks in [Validate CSM Health](../validate_csm_health.md).
8. Run the platform health checks in [Validate CSM Health](../validate_csm_health.md).

5. Disconnect from the console.
9. Disconnect from the console.

6. Repeat all of the sub-steps above for the remaining master nodes \(excluding `ncn-m001`\), going from the highest to lowest number until all master nodes have successfully rebooted.
10. Repeat all of the sub-steps above for the remaining master nodes \(excluding `ncn-m001`\), going from the highest to lowest number until all master nodes have successfully rebooted.

2. Reboot `ncn-m001`.

1. Determine the CAN IP address for one of the other NCNs in the system to establish an SSH session with that NCN.

2. Establish a console session to `ncn-m001` from a remote system, as `ncn-m001` is the NCN that has an externally facing IP address.

3. If booting from disk is desired then [set the boot order](../../background/ncn_boot_workflow.md#set-boot-order).

3. Power cycle the node
4. Power cycle the node

Ensure the expected results are returned from the power status check before rebooting:

Expand Down Expand Up @@ -466,17 +495,17 @@ Before rebooting NCNs:

Ensure the power is reporting as on. This may take 5-10 seconds for this to update.

4. Watch on the console until the node has successfully booted and the login prompt is reached.
5. Watch on the console until the node has successfully booted and the login prompt is reached.

1. Retrieve the `XNAME` for the node being rebooted.
6. Retrieve the `XNAME` for the node being rebooted.

This xname is available on the node being rebooted in the following file:
This xname is available on the node being rebooted in the following file:

```bash
ncn# ssh NODE cat /etc/cray/xname
```
```bash
ncn# ssh NODE cat /etc/cray/xname
```

1. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.
7. Confirm what the Configuration Framework Service (CFS) configurationStatus is for the desiredConfig after rebooting the node.

The following command will indicate if a CFS job is currently in progress for this node. Replace the `XNAME` value in the following command with the xname of the node being rebooted.

Expand All @@ -495,11 +524,17 @@ Before rebooting NCNs:

If configurationStatus is `failed`, See [Troubleshoot Ansible Play Failures in CFS Sessions](../configuration_management/Troubleshoot_Ansible_Play_Failures_in_CFS_Sessions.md) for how to analyze the pod logs from cray-cfs to determine why the configuration may not have completed.

5. Run the platform health checks in [Validate CSM Health](../validate_csm_health.md).
8. Run the platform health checks in [Validate CSM Health](../validate_csm_health.md).

9. Disconnect from the console.

3. Remove any dynamically assigned interface IPs that did not get released automatically by running the CASMINST-2015 script:

6. Disconnect from the console.
```bash
ncn-m001# /usr/share/doc/csm/scripts/CASMINST-2015.sh
```

3. Re-run the platform health checks and ensure that all BGP peering sessions are Established with both spine switches.
4. Re-run the platform health checks and ensure that all BGP peering sessions are Established with both spine switches.

See [Validate CSM Health](../validate_csm_health.md) for the platform health checks.

Expand Down
8 changes: 7 additions & 1 deletion operations/node_management/Rebuild_NCNs.md
Expand Up @@ -879,7 +879,13 @@ This section applies to storage nodes. Skip this section if rebuilding a master

#### 6. Validation

Only follow the steps in the section for the node type that was rebuilt:
As a result of rebuilding any NCN(s) remove any dynamically assigned interface IPs that did not get released automatically by running the CASMINST-2015 script:

```bash
ncn-m001# /usr/share/doc/csm/scripts/CASMINST-2015.sh
```

Once that is done only follow the steps in the section for the node type that was rebuilt:

- [Validate Worker Node](#validate_worker_node)
- [Validate Master Node](#validate_master_node)
Expand Down