# Nvidia GPU Operator Lab
This lab explores the Nvidia GPU Operator installation on the OpenShift cluster

In [None]:
##### Logging into OpenShift #####
### Set Student Number ###
student_number = "##"      # Replace with your student number

if student_number == "##":
    raise ValueError("Please set your student number in the 'student_number' variable.")

### Login to OpenShift ###
!oc login -u s{student_number} -p"!@34QWer" https://api.ocp.ucsx.hl.dns:6443 --insecure-skip-tls-verify
!oc project ai-s{student_number}

## Look at the Node labels
- The feature.node.kubernetes.io/pci-xxxx labels indicate what pci devices are present on the node
  - 10de is the PCI vendor ID for Nvidia
  - 15b3 is the PCI vendor for Mellanox
- The nvidia.com lables are created by the Nvidia GPU Operator and are useful when scheduling GPU workloads to nodes with GPUs

In [None]:
### Look at PCIE Node labels ###
# !oc describe node ocp5 | grep -i feature.node
!oc describe node ocp5 | grep -i pci


# feature.node.kubernetes.io/pci-10de.present=true
# feature.node.kubernetes.io/pci-10de.sriov.capable=true
# feature.node.kubernetes.io/pci-15b3.present=true
# feature.node.kubernetes.io/pci-15b3.sriov.capable=true

In [None]:
### Look at NVIDIA specific Node labels ###
!oc describe node ocp5 | grep -i nvidia

## Look at GPUs on each worker node

In [None]:
### Look at PCIE Devices on the Nodes ###
## This command creates a debug pods on each node and runs lspci to list PCI devices, filtering for NVIDIA devices
print("### Node: ocp4 ###")
!oc debug node/ocp4 -- chroot /host lspci | grep -i nvidia

print("\n\n### Node: ocp5 ###")
!oc debug node/ocp5 -- chroot /host lspci | grep -i nvidia

print("\n\n### Node: ocp6 ###")
!oc debug node/ocp6 -- chroot /host lspci | grep -i nvidia


### Simplified Output ###
# Node: ocp4
# 31:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
#
# Node: ocp5
# 3d:00.0 3D controller: NVIDIA Corporation AD102GL [L40] (rev a1)
#
# Node: ocp6
# 3d:00.0 3D controller: NVIDIA Corporation AD102GL [L40] (rev a1)

## Explore Nvidia GPU Operator Pods
- In our environment pods in the gpu operator namespace are usually in sets of 3
  - One pod is deployed on each worker node that has a GPU

### There are 3 key groups of pods ###
- nvidia-driver-daemonset-…
  -  Installs and manages the NVIDIA kernel driver on each GPU node
  
- nvidia-container-toolkit-daemonset-…
  -  Installs NVIDIA Container Runtime hooks on each node
  
- nvidia-device-plugin-daemonset-…
  -  Exposes GPUs to Kubernetes as resources that can be scheduled to other pods (nvidia.com/gpu)

In [None]:
### Look at all pods in the nvidia-gpu-operator namespace ###
!oc get pods -n nvidia-gpu-operator -o wide

# NAME                                                  READY   STATUS      RESTARTS   AGE     IP             NODE   NOMINATED NODE   READINESS GATES
# gpu-feature-discovery-42brk                           2/2     Running     0          2d22h   10.128.2.99    ocp5   <none>           <none>
# gpu-feature-discovery-65pd5                           2/2     Running     0          2d22h   10.129.2.217   ocp4   <none>           <none>
# gpu-feature-discovery-6b5fl                           2/2     Running     0          2d22h   10.131.0.97    ocp6   <none>           <none>
# gpu-operator-5dc89c5c45-82ncx                         1/1     Running     0          4d12h   10.128.2.44    ocp5   <none>           <none>
# nvidia-container-toolkit-daemonset-lqs54              1/1     Running     0          2d22h   10.129.2.215   ocp4   <none>           <none>
# nvidia-container-toolkit-daemonset-tmpsb              1/1     Running     0          2d22h   10.128.2.101   ocp5   <none>           <none>
# nvidia-container-toolkit-daemonset-xvgrt              1/1     Running     0          2d22h   10.131.0.99    ocp6   <none>           <none>
# nvidia-cuda-validator-gb45d                           0/1     Completed   0          2d22h   10.128.2.104   ocp5   <none>           <none>
# nvidia-cuda-validator-jjqbq                           0/1     Completed   0          2d22h   10.131.0.100   ocp6   <none>           <none>
# nvidia-cuda-validator-w45dk                           0/1     Completed   0          2d22h   10.129.2.220   ocp4   <none>           <none>
# nvidia-dcgm-22fjk                                     1/1     Running     0          2d22h   10.128.2.98    ocp5   <none>           <none>
# nvidia-dcgm-exporter-6qvgd                            1/1     Running     0          2d22h   10.128.2.100   ocp5   <none>           <none>
# nvidia-dcgm-exporter-j97qb                            1/1     Running     0          2d22h   10.131.0.98    ocp6   <none>           <none>
# nvidia-dcgm-exporter-mtz67                            1/1     Running     0          2d22h   10.129.2.219   ocp4   <none>           <none>
# nvidia-dcgm-gkmpt                                     1/1     Running     0          2d22h   10.129.2.218   ocp4   <none>           <none>
# nvidia-dcgm-llcvb                                     1/1     Running     0          2d22h   10.131.0.96    ocp6   <none>           <none>
# nvidia-device-plugin-daemonset-8dnbv                  2/2     Running     0          2d22h   10.128.2.102   ocp5   <none>           <none>
# nvidia-device-plugin-daemonset-sxxrn                  2/2     Running     0          2d22h   10.129.2.214   ocp4   <none>           <none>
# nvidia-device-plugin-daemonset-vk656                  2/2     Running     0          2d22h   10.131.0.95    ocp6   <none>           <none>
# nvidia-driver-daemonset-418.94.202510230424-0-dt77s   4/4     Running     0          2d22h   10.129.2.213   ocp4   <none>           <none>
# nvidia-driver-daemonset-418.94.202510230424-0-j6jmx   4/4     Running     0          2d22h   10.128.2.97    ocp5   <none>           <none>
# nvidia-driver-daemonset-418.94.202510230424-0-kx5q8   4/4     Running     0          2d22h   10.131.0.93    ocp6   <none>           <none>
# nvidia-node-status-exporter-66x7t                     1/1     Running     0          4d      10.131.0.73    ocp6   <none>           <none>
# nvidia-node-status-exporter-q2rjv                     1/1     Running     0          4d      10.128.2.77    ocp5   <none>           <none>
# nvidia-node-status-exporter-xlxtz                     1/1     Running     0          4d      10.129.2.97    ocp4   <none>           <none>
# nvidia-operator-validator-4rz5r                       1/1     Running     0          2d22h   10.131.0.94    ocp6   <none>           <none>
# nvidia-operator-validator-dpzls                       1/1     Running     0          2d22h   10.129.2.216   ocp4   <none>           <none>
# nvidia-operator-validator-pdmhj                       1/1     Running     0          2d22h   10.128.2.103   ocp5   <none>           <none>

In [None]:
### Change the below name to match name of the pod from ocp4 ###
nv_driver_pod_name = "nvidia-driver-daemonset-418.94.202510230424-0-brdb4"

In [None]:
### Look at the logs of the Nvidia Driver pod ###
!oc logs -n nvidia-gpu-operator {nv_driver_pod_name}


### Summarized Output ###
# ...
# Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 580.95.05....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
# + cd NVIDIA-Linux-x86_64-580.95.05
# + sh /tmp/install.sh nvinstall
# ...
# ========== NVIDIA Software Installer ==========

# Starting installation of NVIDIA driver version 580.95.05 for Linux kernel version 5.14.0-427.96.1.el9_4.x86_64

# + echo -e '\n========== NVIDIA Software Installer ==========\n'
# + echo -e 'Starting installation of NVIDIA driver version 580.95.05 for Linux kernel version 5.14.0-427.96.1.el9_4.x86_64\n'
# ...


## Run nvidia-smi inside the NVIDIA Driver Daemonset Pods ###
- These pods are in charge of managing the NVIDIA drivers on the nodes
- The NVIDIA Driver Daemonset Pods can see all GPU processes on the node
- where as, running nvidia-smi on any pod with GPU access, will only show processes for that pod
  - But will still show the VRAM has been used by other pods
- If you look at the pod running on node: ocp4, you will some of the processes for the olama/nvidia nim containers


In [None]:
## Note: The name of the pod might be different
!oc exec -n nvidia-gpu-operator -it {nv_driver_pod_name} -- nvidia-smi

In [None]:
### Look at the Node description to see the GPU capacity ###
## The nvidia.com/gpu capacity will only appear after the Nvidia driver is successfully installed
## The reason it shows 8 GPUs instead of 1 is because we enable time-slicing of GPUs in this lab environment
!oc describe node ocp4


# Capacity:
#   cpu:                152
#   ephemeral-storage:  523505924Ki
#   hugepages-1Gi:      0
#   hugepages-2Mi:      0
#   memory:             1056450856Ki
#   nvidia.com/gpu:     8
#   pods:               250

## GPU Time Slicing Configuration
- GPU Time Slicing lets multiple containers share a single GPU by loading their data into VRAM, then taking turns executing compute workloads
- VRAM usage is not isolated, so one pod can consume all available VRAM, preventing other pods from running AI workloads
- Time-slicing settings are managed through a ConfigMap in the nvidia-gpu-operator namespace
- For GPUs such as the H100, MIG (Multi-Instance GPU) is often a better option because each MIG instance receives its own dedicated VRAM and guaranteed GPU compute partition

In [None]:
### Loot at list of config maps in the nvidia-gpu-operator namespace ###
!oc get configmap -n nvidia-gpu-operator

# NAME                            DATA   AGE
# device-plugin-gpu-timeslicing   2      6d23h

In [None]:
### Look at the timeslicing configmap details ###
## This shows that the T4 and L40 GPUs are being timesliced into 8 replicas each
## Which matches the nvidia.com/gpu capacity above
!oc describe configmap device-plugin-gpu-timeslicing -n nvidia-gpu-operator


# Name:         device-plugin-gpu-timeslicing
# Namespace:    nvidia-gpu-operator
# Labels:       <none>
# Annotations:  <none>

# Data
# ====
# NVIDIA-L40:
# ----
# version: v1
# sharing:
#   timeSlicing:
#     renameByDefault: false
#     resources:
#       - name: nvidia.com/gpu
#         replicas: 8
# Tesla-T4:
# ----
# version: v1
# sharing:
#   timeSlicing:
#     renameByDefault: false
#     resources:
#       - name: nvidia.com/gpu
#         replicas: 8


## GPU usage in pods
- To add a GPU into a pod, you specify the resource request in the manifiest similar to:

```yaml
resources:
  limits:
    nvidia.com/gpu: 1            # requesting 1 GPU
```