Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/docs-ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
env:
NGC_CLI_API_KEY: ${{ secrets.NVCR_TOKEN }}
run: |
make api-docs helm-docs generate-docs-versions-var
make api-docs helm-docs generate-docs-versions-var nic-conf-docs
- name: Close any existing documentation PRs
run: |
for pr_number in $(gh pr list --search "$PR_TITLE_PREFIX" --json number --jq ".[].number"); do
Expand Down
38 changes: 37 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -30,19 +30,23 @@ endif

# Network Operator source tar location
REPO_TAR_URL ?= https://github.com/Mellanox/network-operator/archive/refs/$(TAR_PATH)
# NIC Configuration Operator source tar location
NIC_CONF_REPO_TAR_URL ?= https://github.com/Mellanox/nic-configuration-operator/archive/refs/$(TAR_PATH)
# release.yaml location
RELEASE_YAML_URL ?= https://raw.githubusercontent.com/Mellanox/network-operator/$(if $(TAG),$(TAG),$(BRANCH))/hack/release.yaml

# Path to download the crd api to.
CRD_API_DEP_ROOT = $(BUILDDIR)/crd
# Path to download the nic-conf-operator crd api to.
NIC_CONF_CRD_API_DEP_ROOT = $(BUILDDIR)/nic-conf-crd
# Path to download the helm chart to.
HELM_CHART_DEP_ROOT = $(BUILDDIR)/helmcharts
# Helm chart version and url
HELM_CHART_VERSION ?= 24.4.1
NGC_HELM_CHART_URL ?= https://helm.ngc.nvidia.com/nvidia/charts/network-operator-${HELM_CHART_VERSION}.tgz
HELM_CHART_PATH ?=

$(BUILDDIR) $(TOOLSDIR) $(HELM_CHART_DEP_ROOT) $(CRD_API_DEP_ROOT): ; $(info Creating directory $@...)
$(BUILDDIR) $(TOOLSDIR) $(HELM_CHART_DEP_ROOT) $(CRD_API_DEP_ROOT) $(NIC_CONF_CRD_API_DEP_ROOT): ; $(info Creating directory $@...)
mkdir -p $@


Expand Down Expand Up @@ -113,16 +117,48 @@ download-api: | $(CRD_API_DEP_ROOT)
curl -sL ${REPO_TAR_URL} \
| tar -xz -C ${CRD_API_DEP_ROOT}

.PHONY: download-nic-conf-api
download-nic-conf-api: | $(NIC_CONF_CRD_API_DEP_ROOT)
curl -sL ${NIC_CONF_REPO_TAR_URL} \
| tar -xz -C ${NIC_CONF_CRD_API_DEP_ROOT}

gen-crd-api-docs: | $(GEN_CRD_API_REFERENCE_DOCS) download-api
cd ${CRD_API_DEP_ROOT}/network-operator-${SRC}/api/v1alpha1 && \
$(GEN_CRD_API_REFERENCE_DOCS) -api-dir=. -config=${CURDIR}/hack/api-docs/config.json \
-template-dir=${CURDIR}/hack/api-docs/templates -out-file=${BUILDDIR}/crds-api.html

gen-nic-conf-crd-api-docs: | $(GEN_CRD_API_REFERENCE_DOCS) download-nic-conf-api
cd ${NIC_CONF_CRD_API_DEP_ROOT}/nic-configuration-operator-${SRC}/api/v1alpha1 && \
$(GEN_CRD_API_REFERENCE_DOCS) -api-dir=. -config=${CURDIR}/hack/api-docs/nic-conf-config.json \
-template-dir=${CURDIR}/hack/api-docs/templates -out-file=${BUILDDIR}/nic-conf-crds-api.html

.PHONY: api-docs
api-docs: gen-crd-api-docs
docker run --rm --volume "`pwd`:/data:Z" pandoc/minimal -f html -t rst --lua-filter=/data/hack/ref_links.lua \
--columns 200 /data/build/_output/crds-api.html -o /data/docs/customizations/crds.rst


.PHONY: nic-conf-api-docs
nic-conf-api-docs: gen-nic-conf-crd-api-docs
docker run --rm --volume "`pwd`:/data:Z" pandoc/minimal -f html -t rst --lua-filter=/data/hack/ref_links.lua \
--columns 200 /data/build/_output/nic-conf-crds-api.html -o /data/docs/nic-conf-operator/crds.rst

.PHONY: nic-conf-api-docs-versioned
nic-conf-api-docs-versioned:
$(eval NIC_CONF_VERSION := $(shell grep "nic-configuration-operator-version" docs/common/vars.rst | sed 's/.*replace:: //'))
@echo "Using NIC Configuration Operator version: $(NIC_CONF_VERSION)"
TAG=$(NIC_CONF_VERSION) make nic-conf-api-docs

.PHONY: fetch-config-docs
fetch-config-docs:
$(eval NIC_CONF_VERSION := $(shell grep "nic-configuration-operator-version" docs/common/vars.rst | sed 's/.*replace:: //'))
@echo "Fetching configuration documentation for version: $(NIC_CONF_VERSION)"
./hack/fetch-config-docs.sh Mellanox/nic-configuration-operator $(NIC_CONF_VERSION) README.md docs/nic-conf-operator/configuration-details.rst

.PHONY: nic-conf-docs
nic-conf-docs: nic-conf-api-docs-versioned fetch-config-docs
@echo "Generated all NIC Configuration Operator documentation"

.PHONY: build-cache
build-cache:
@if [ -d "$(CACHE_DIR)" ]; then \
Expand Down
271 changes: 0 additions & 271 deletions docs/getting-started-kubernetes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2680,274 +2680,3 @@ The ``pod.yaml`` configuration file for such a deployment:
- sh
- -c
- sleep inf


===========================================================================
Configure NIC Firmware using the NIC Configuration Operator
===========================================================================
`NVIDIA NIC Configuration Operator <https://github.com/Mellanox/nic-configuration-operator>`_ provides Kubernetes API (Custom Resource Definition) to allow Firmware update and configuration on NVIDIA NICs in a coordinated manner. It deploys a configuration daemon on each of the desired nodes to configure NVIDIA NICs there. NVIDIA NIC Configuration Operator uses `Maintenance Operator <https://github.com/Mellanox/maintenance-operator>`_ to prepare a node for maintenance before the actual configuration.

.. warning:: NVIDIA NIC Configuration Operator does not support FW reset flow for DPU mode. Check `limitations <https://github.com/Mellanox/network-operator-docs/blob/main/docs/release-notes.rst>`_

.. note::
To perform Firmware validation and update on NIC devices, NIC Configuration Operator requires a persistent storage set up in the cluster.
To set up a persistent NFS storage in the cluster, the `example from the CSI NFS Driver repository <https://github.com/kubernetes-csi/csi-driver-nfs/blob/master/deploy/example/nfs-provisioner/README.md>`_ might be used.
After deploying the NFS server and NFS CSI driver, the `storage class <https://github.com/kubernetes-csi/csi-driver-nfs/blob/master/deploy/example/storageclass-nfs.yaml>`_ should become available in the cluster. The name of the storage class should then be passed when configuring the NIC Configuration Operator.

First install the Network Operator helm chart with the Maintenance Operator enabled and deploy a NIC Cluster Policy CRD with NIC Configuration Operator enabled:

``values.yaml``:

.. code-block:: yaml

maintenanceOperator:
enabled: true

``nicclusterpolicy.yaml``:

.. code-block:: yaml
:substitutions:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
nicConfigurationOperator:
operator:
image: nic-configuration-operator
repository: |nic-configuration-operator-repository|
version: |nic-configuration-operator-version|
configurationDaemon:
image: nic-configuration-operator-daemon
repository: |nic-configuration-operator-repository|
version: |nic-configuration-operator-version|
nicFirmwareStorage:
create: true
pvcName: nic-fw-storage-pvc
# Name of the storage class is provided by the user
storageClassName: nfs-csi
availableStorageSize: 1Gi

Observe the NicDevice CRs detected in the cluster. The name of the CR is composed from the node name, NIC type and its serial number:

.. code:: bash

> kubectl get nicdevices -n nvidia-network-operator

NAME AGE
node1-1015-mt1627x08307 1m
node1-101d-mt1952x03330 1m
node2-1015-mt1627x08305 1m
node2-101d-mt1952x03327 1m

Discover more information about a specific device:

.. code:: bash

kubectl get nicdevice -n nvidia-network-operator node1-101d-mt1952x03327 -o yaml

.. code-block:: yaml

apiVersion: configuration.net.nvidia.com/v1alpha1
kind: NicDevice
metadata:
creationTimestamp: "2024-09-21T08:43:08Z"
generation: 1
name: node1-101d-mt1952x03327
namespace: nvidia-network-operator
ownerReferences:
- apiVersion: v1
kind: Node
name: node1
uid: 25c4f4e2-f7ba-4ba9-9a87-8056313ffc79
resourceVersion: "1177095"
uid: ac6763bf-67c6-4af5-81f8-1aad5da929bf
spec: {}
status:
conditions:
- type: FirmwareUpdateInProgress
status: "False"
reason: DeviceFirmwareSpecEmpty
message: Device firmware spec is empty, cannot update or validate firmware
lastTransitionTime: "2024-09-21T08:43:04Z"
- type: ConfigUpdateInProgress
status: "False"
reason: DeviceConfigSpecEmpty
message: Device configuration spec is empty, cannot update configuration
lastTransitionTime: "2024-09-21T08:43:08Z"
firmwareVersion: 22.39.1015
node: cloud-dev-41
partNumber: mcx623106ac-cdat
ports:
- networkInterface: enp3s0f0np0
pci: "0000:03:00.0"
rdmaInterface: mlx5_0
- networkInterface: enp3s0f1np1
pci: "0000:03:00.1"
rdmaInterface: mlx5_1
psid: mt_0000000436
serialNumber: mt1952x03327
type: 101d

Configure and apply the NICFirmwareSource CR:

.. code-block:: yaml

apiVersion: configuration.net.nvidia.com/v1alpha1
kind: NicFirmwareSource
metadata:
name: connectx6-dx-firmware-22-44-1036
namespace: nvidia-network-operator
finalizers:
- configuration.net.nvidia.com/nic-configuration-operator
spec:
# a list of firmware binaries zip archives from the Mellanox website, can point to any url accessible from the cluster
binUrlSources:
- https://www.mellanox.com/downloads/firmware/fw-ConnectX6Dx-rel-22_44_1036-MCX623106AC-CDA_Ax-UEFI-14.37.14-FlexBoot-3.7.500.signed.bin.zip

Observe the NICFirmwareSource status:

.. code:: bash

> kubectl get nicfirmwaresource -n nvidia-network-operator connectx6-dx-firmware-22-44-1036 -o yaml

...
status:
state: Success
versions:
22.44.1036:
- mt_0000000436

Configure and apply the NicFirmwareTemplate CR:

.. code-block:: yaml

apiVersion: configuration.net.nvidia.com/v1alpha1
kind: NicFirmwareTemplate
metadata:
name: connectx6dx-config
namespace: nvidia-network-operator
spec:
nodeSelector:
kubernetes.io/hostname: node1
nicSelector:
nicType: "101d"
template:
nicFirmwareSourceRef: connectx6dx-firmware-22-44-1036
updatePolicy: Update

Configure and apply the NicConfigurationTemplate CR:

.. code-block:: yaml

apiVersion: configuration.net.nvidia.com/v1alpha1
kind: NicConfigurationTemplate
metadata:
name: connectx6-config
namespace: nvidia-network-operator
spec:
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
nicSelector:
# nicType selector is mandatory the rest are optional. Only a single type can be specified.
nicType: 101d
pciAddresses:
- "0000:03:00.0"
- “0000:04:00.0”
serialNumbers:
- "mt1952x03327"
resetToDefault: false # if set, template is ignored, device configuration should reset
template:
# numVfs and linkType fields are mandatory, the rest are optional
numVfs: 2
linkType: Ethernet
pciPerformanceOptimized:
enabled: true
maxReadRequest: 4096
roceOptimized:
enabled: true
qos:
trust: dscp
pfc: "0,0,0,1,0,0,0,0"
gpuDirectOptimized:
enabled: true
env: Baremetal

.. note:: It's not possible to apply more than one template of each kind (NICFirmwareTemplate or NICConfigurationTemplate) to a single device. In this case, no template will be applied and an error event will be emitted for the corresponding NicDevice CR.

.. note:: To use the NIC Configuration Operator functionality together with SR-IOV Network Operator, "mellanox" `plugin should be disabled <https://github.com/k8snetworkplumbingwg/sriov-network-operator/tree/master?tab=readme-ov-file#disabling-sr-iov-config-daemon-plugins>`_ in the SR-IOV Network Operator.

For more information about the CRD API, refer to `API documentation <https://github.com/Mellanox/nic-configuration-operator/blob/main/docs/api-reference.md>`_.
For more information, which FW parameter each settings corresponds to, refer to `Configuration details doc section <https://github.com/Mellanox/nic-configuration-operator?tab=readme-ov-file#configuration-details>`_.

Spec of the NicDevice CR is updated in accordance with the NICFirmwareTemplate and NicConfigurationTemplate CRs matching the device

.. code-block:: bash

> kubectl get nicdevice -n nvidia-network-operator node1-101d-mt1952x03327 -o jsonpath='{.spec}' | yq -P

template:
firmware:
nicFirmwareSourceRef: connectx6dx-firmware-22-44-1036
updatePolicy: Update
configuration:
numVfs: 2
linkType: Ethernet
pciPerformanceOptimized:
enabled: true
roceOptimized:
enabled: true
qos:
trust: dscp
pfc: "0,0,0,1,0,0,0,0"
gpuDirectOptimized:
enabled: true
env: Baremetal


Status conditions of the NicDevice CR reflect the status of the configuration update and indicate any errors that might occur during the process

.. code-block:: bash

> kubectl get nicdevice -n nvidia-network-operator node1-101d-mt1952x03327 -o jsonpath='{.status.conditions}' | yq -P

- type: FirmwareUpdateInProgress
status: "False"
reason: DeviceFirmwareConfigMatch
message: Firmware matches the requested version
observedGeneration: 4
lastTransitionTime: "2024-09-21T08:42:23Z"
- type: ConfigUpdateInProgress
status: "True"
reason: UpdateStarted
message: ""
lastTransitionTime: "2024-09-21T08:43:08Z"

----------------------------------
NIC Firmware Mismatch Notification
----------------------------------

NIC Configuration Operator updates status conditions of the NicDevice CR to set `FirmwareConfigMatch` condition based on a current NIC firmware:

.. code-block:: bash

> kubectl get nicdevice -n nvidia-network-operator node1-101d-mt1952x03327 -o jsonpath='{.status.conditions}' | yq -P

- type: FirmwareConfigMatch
status: "True"
reason: DeviceFirmwareConfigMatch
message: Device firmware '20.42.1000' matches to recommended version '20.42.1000'
lastTransitionTime: "2024-09-21T08:43:10Z"

`FirmwareConfigMatch` condition status is set to `Unknown` if DOCA-OFED Driver is not installed otherwise it notifies if current NIC firmware is recommended or not recommended by DOCA-OFED Driver. E.g.:

.. code-block:: bash

> kubectl get nicdevice -n nvidia-network-operator node1-101d-mt1952x03327 -o jsonpath='{.status.conditions}' | yq -P

- type: FirmwareConfigMatch
status: "True"
reason: DeviceFirmwareConfigMatch
message: Device firmware '20.42.1000' matches to recommended version '20.42.1000'
lastTransitionTime: "2024-11-08T09:19:41Z"
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
Platform Support <platform-support.rst>
Getting Started with Kubernetes <getting-started-kubernetes.rst>
Getting Started with Red Hat OpenShift <getting-started-openshift.rst>
NIC Configuration Operator <nic-conf-operator/nic-configuration-operator.rst>
Customization Options and CRDs <customizations/customization.rst>
Life Cycle Management <life-cycle-management.rst>
Advanced Configurations <advanced/advanced.rst>
Expand Down
Loading