<img src="./images/DLI_Header.png" style="width: 400px;">

# 1.0 Overview of the Class Environment

<img src="images/overview_class.png" style="float: right;" width=350>

This notebook will introduce the class environment that was configured to mimic an AI production structure. You will have an overview of the Class environment configured as a Kubernetes cluster with 4x A100 80GB GPU resources.

In addition, you will experiment with basic commands of the [Kubernetes Cluster](https://kubernetes.io/). Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. 

In this class, a K8s cluster is already launched using [Minikube](https://minikube.sigs.k8s.io/docs/). In addition, we already enabled the cluster for GPU acceleration using GPU Operator. 

This is our first step toward deploying, monitoring and managing AI based applications in production. 

The goals of this notebook are to:
* Understand the hardware configuration available for the class
* Understand the basics Kubernetes commands 
* Run a simple Cuda application

**[1.1 The Hardware Configuration Overview](#1.1-The-Hardware-Configuration-Overview)<br>**
**[1.2 Kubernetes Cluster Basics](#1.2-Kubernetes-Cluster-Basics)<br>**
**[1.3 GPU Application Example](#1.3-GPU-Application-Example)<br>**

---
# 1.1 The Hardware Configuration Overview

NVIDIA has designed DGXs servers as a full-stack solution for scalable AI development. Click the link to learn more about [DGX systems](https://www.nvidia.com/en-gb/data-center/dgx-systems/). This class environment is built with half the resources of a DGX 8xA100 server system (4x A100 GPUs, 4 NVlinks per GPU). However, different deliveries of this course may have different hardware configurations. Thus, for benchmarking purposes, we will be using 4x A100 80G as a reference.

<img  src="images/nvlink_v2.png" width="400"/>

The hardware setup for this course has been pre-configured as a GPU cluster unit. The cluster is structured into compute nodes (compute units), which can be managed by a Cluster Manager such as Kubernetes. Alongside CPUs (Central Processing Units) and GPUs (Graphics Processing Units), the cluster incorporates storage and networking components.

Let's look at the Hardware design available in this class.

## 1.1.1 Check The Available CPUs 

We can check the CPU information of the system using the `lscpu` command. 
This example of outputs shows that there are 48 CPU cores of the `x86_64` from AMD.
```
Architecture:                    x86_64
Core(s) per socket:              48
Model name:                      AMD EPYC 7V13 64-Core Processor
```
For a complete description of the CPU processor architecture, check the `/proc/cpuinfo` file.

In [None]:
# Display CPUs information
!lscpu

In [None]:
# Check the number of CPU cores
!grep 'cpu cores' /proc/cpuinfo | uniq

## 1.1.2 Check The Available  GPUs 

The NVIDIA System Management Interface `nvidia-smi` is a command for monitoring NVIDIA GPU devices. Several key details are listed such as the CUDA and  GPU driver versions, the number and type of GPUs available, the GPU memory each, running GPU process, etc.

In the following example, `nvidia-smi` command shows that there are GPUs, each with approximately 80GB of memory. 

<img  src="images/nvidia-smi.png" width="600"/>

For more details, refer to the [nvidia-smi documentation](https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf).

In [None]:
# Display information about GPUs
!nvidia-smi

---
# 1.2 Kubernetes Cluster Basics


In this class, a local Kubernetes cluster is already running using [Minikube](https://minikube.sigs.k8s.io/docs/) and the NVIDIA GPU Operator is already installed.

The [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, automatic node labelling using GPU Feautre Discovery (GFD), DCGM based monitoring and others.


The `kubectl get` commands provides information about nodes, pods and services in your Kubernetes cluster. Learn more about [kubectl command line tool](https://kubernetes.io/docs/reference/kubectl/overview/).

In a Kubernetes cluster, there are 2 types of ressources: Nodes running the applications and a control-plane coordinating the cluster. 

Let's have a look at the lab Kubernetes cluster by checking the status of nodes and Pods.

`kubectl get nodes` fetches the status of all nodes in the Kubernetes cluster from the control plane node. 

In this lab, you should see the control-plane node called minikube with the status `Ready`. 

Let's check that by running the below.

<div class="alert alert-block alert-warning">
If you get error when running the below command, it means the cluster is not ready. Try in few minutes. 
</div>

In [None]:
# Check available nodes in the cluster
! kubectl get nodes

Your cluster is already running a minikube control-plane node.

You can get more output details using the argument `-o wide`. Notice the Internal IP of minikube. 

Kubernetes requires a container runtime. In our class, we are using docker. 

In [None]:
# Check available nodes in the cluster with more details
! kubectl get nodes -o wide

Let's now check the pods running on our Kubernetes cluster.
You should see a several running pods on kube-system and gpu-operator namespace. Note, we are using -A to get pods in all namespaces. 

```
NAMESPACE      NAME                                                              READY   STATUS      RESTARTS        AGE
gpu-operator   gpu-feature-discovery-24m7m                                       1/1     Running     0               7m26s
gpu-operator   gpu-operator-1708880160-node-feature-discovery-gc-7fd59b8cdnjxb   1/1     Running     0               8m1s
gpu-operator   gpu-operator-1708880160-node-feature-discovery-master-55b7869lr   1/1     Running     0               8m1s
gpu-operator   gpu-operator-1708880160-node-feature-discovery-worker-bxgzc       1/1     Running     0               8m1s
gpu-operator   gpu-operator-fbc85568f-4l2kx                                      1/1     Running     1 (6m32s ago)   8m1s
gpu-operator   nvidia-container-toolkit-daemonset-fb54t                          1/1     Running     0               7m23s
gpu-operator   nvidia-cuda-validator-s8vh2                                       0/1     Completed   0               7m19s
gpu-operator   nvidia-dcgm-exporter-v2b99                                        1/1     Running     0               7m26s
gpu-operator   nvidia-device-plugin-daemonset-2wx9k                              1/1     Running     0               7m26s
gpu-operator   nvidia-mig-manager-55lr2                                          1/1     Running     0               5m53s
gpu-operator   nvidia-operator-validator-7d8hx                                   1/1     Running     0               7m26s
kube-system    coredns-5dd5756b68-wvzkb                                          1/1     Running     0               8m1s
kube-system    etcd-minikube                                                     1/1     Running     0               8m13s
kube-system    kube-apiserver-minikube                                           1/1     Running     0               8m15s
kube-system    kube-controller-manager-minikube                                  1/1     Running     0               8m13s
kube-system    kube-proxy-qw8qm                                                  1/1     Running     0               8m1s
kube-system    kube-scheduler-minikube                                           1/1     Running     0               8m15s
kube-system    nvidia-device-plugin-daemonset-j2b69                              1/1     Running     0               8m1s
kube-system    storage-provisioner                                               1/1     Running     1 (6m31s ago)   8m12s
```


In [None]:
! kubectl get pods -A

Let's now check the resources we have in the minikube cluster. Notice the available GPUs and CPUs.

In [None]:
!kubectl get node minikube -o jsonpath='{.status.capacity}'

# 1.3 GPU Application Example

In this section, we will deploy a simple GPU-accelerated application. 

We will use the [cuda-vectoradd](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#cuda-vectoradd) toy application which randomly generates two very large vectors and adds them. 

Let's print out the YAML configuration file needed to deploy this application:

In [None]:
# check the cuda-vectoradd config 
!cat kubernetes-config/gpu-pod.yaml

The config file shows a pod named `gpu-operator-test` deploying cuda-vector-add on nvidia/samples:vectoradd-cuda11.6.0 container using 1 GPU. 

To deploy an application, execute the `kubectl apply` command, specifying the YAML configuration file with the `-f` file option.

In [None]:
# Deploy the application (run the pod)
!kubectl apply -f kubernetes-config/gpu-pod.yaml

Once deployed, we can observe the status of a pod created with:

In [None]:
# Get the status of the pod deployed
!kubectl get pods gpu-operator-test

You might see the status Pending or ContainerCreating. Try again after few seconds:

In [None]:
# Run again to Get the status of the pod Completed
!kubectl get pods gpu-operator-test

Our application status is now `Completed`. Let's have a look at its execution logs with `kubectl logs`. 

You should see outputs similar to:
```
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

In [None]:
# Let's look at the output
!kubectl logs gpu-operator-test

Now, let's delete the Kubernetes pod `gpu-operator-test` as we do not needed anymore:

In [None]:
# Let's delete the pod
!kubectl delete -f kubernetes-config/gpu-pod.yaml

---
<h2 style="color:green;">Congratulations!</h2>

You've made it through the first section. In this notebook, you have:
- Discovered the class environment configuration.
- Interacted with K8s using `kubectl`
- Deployed a simple Cuda application

Next, you'll see how to deploy NIMs forming a complex Retrieval Augmented Generation application (RAG).

Move on to [02_GitOps_Using_ArgoCD.ipynb](02_GitOps_Using_ArgoCD.ipynb)

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>