<img src="./images/DLI_Header.png" style="width: 400px;">

# 1.0 Overview of the Class Environment

This notebook will introduce the basic knowledge of using AI clusters. You will have an overview of the Class Environment configured as an AI compute cluster. In addition, you will experiment with basic commands of the [SLURM cluster management](https://slurm.schedmd.com/overview.html).

### Learning Objectives

The goals of this notebook are to:
* Understand the hardware configuration available for the class
* Understand the basics commands for jobs submissions with SLURM
* Run simple test scripts allocating different GPU resources
* Connect interactively to a compute node and observe available resources

**[1.1 The Hardware Configuration Overview](#1.1-The-Hardware-Configuration-Overview)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[1.1.1 Check The Available CPUs](#1.1.1-Check-The-Available-CPUs)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.1.2 Check the Available GPUs](#1.1.2-Check-The-Available-GPUs)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.1.3 Check The Interconnect Topology](#1.1.3-Check-The-Interconnect-Topology)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.1.4 Bandwidth & Connectivity Tests](#1.1.4-Bandwidth-and-Connectivity-Tests)<br>
**[1.2 Basic SLURM Commands](#1.2-Basic-SLURM-Commands)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[1.2.1 Check the SLURM Configuration](#1.2.1-Check-the-SLURM-Configuration)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.2.2 Submit Jobs Using SRUN Command](#1.2.2-Submit-jobs-using-SRUN-Command)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.2.3 Submit Jobs Using SBATCH Command](#1.2.3-Submit-jobs-using-SBATCH-Command])<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.2.4 Exercise: Submit Jobs Using SBATCH Command Requesting More Resources](#1.2.4-Exercise-Submit-jobs-using-SBATCH-Command])<br>
**[1.3 Run Interactive Sessions](#1.3-Run-Interactive-Sessions)<br>**

---
# 1.1 The Hardware Configuration Overview


A modern AI cluster is a type of infrastructure designed for optimal Deep Learning model development. NVIDIA has designed DGXs servers as a full-stack solution for scalable AI development. Click the link to learn more about [DGX systems](https://www.nvidia.com/en-gb/data-center/dgx-systems/).

Different deliveries of this course may have different hardware configurations. For benchmarking purposes, we will be using 4 A100s as a reference. This is about half the resources of a DGX 8xA100 server system (4 A100 GPUs, 4 NVlinks per GPU).

<img  src="images/nvlink_v2.png" width="600"/>

The hardware for this class has already been configured as a GPU cluster unit for Deep Learning. The cluster is organized as compute units (nodes) that can be allocated using a Cluster Manager (example SLURM). Among the hardware components, the cluster includes CPUs (Central Processing Units), GPUs (Graphics Processing Units), storage and networking.

Let's look at the GPUs, CPUs and network design available in this class.

## 1.1.1 Check the Available CPUs 

We can check the CPU information of the system using the `lscpu` command. 

This example of outputs shows that there are 48 CPU cores of the `x86_64` from AMD.
```
Architecture:                    x86_64
Core(s) per socket:              48
Model name:                      AMD EPYC 7V13 64-Core Processor
```
For a complete description of the CPU processor architecture, check the `/proc/cpuinfo` file.


In [1]:
# Display information CPUs
!lscpu

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             96
On-line CPU(s) list:                0-95
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          2
NUMA node(s):                       4
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC 7V13 64-Core Processor
Stepping:                           1
CPU MHz:                            2445.434
BogoMIPS:                           4890.86
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          3 MiB
L1i cache:                          3 MiB
L2 cache:                           48 MiB
L3 cache:         

In [2]:
# Check the number of CPU cores
!grep 'cpu cores' /proc/cpuinfo | uniq

cpu cores	: 48


## 1.1.2 Check the Available  GPUs 

The NVIDIA System Management Interface `nvidia-smi` is a command for monitoring NVIDIA GPU devices. Several key details are listed such as the CUDA and  GPU driver versions, the number and type of GPUs available, the GPU memory each, running GPU process, etc.

In the following example, `nvidia-smi` command shows that there are GPUs, each with approximately 80GB of memory. 

<img  src="images/nvidia_smi.png" width="600"/>

For more details, refer to the [nvidia-smi documentation](https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf).

In [3]:
# Display information about GPUs
!nvidia-smi

Thu Mar 21 14:31:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                    0 |
| N/A   38C    P0              55W / 300W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000002:00:00.0 Off |  

## 1.1.3 Check the Available Interconnect Topology 



<img  align="right" src="images/nvlink_nvidia.png" width="420"/>

The multi-GPU system configuration needs a fast and scalable interconnect. [NVIDIA NVLink technology](https://www.nvidia.com/en-us/data-center/nvlink/) is a direct GPU-to-GPU interconnect providing high bandwidth and improving scalability for multi-GPU systems.

To check the available interconnect topology, we can use `nvidia-smi topo --matrix` command. In this class, we should get 4 NVLinks per GPU device. 

```
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      NV12    SYS     SYS     0-23            N/A
GPU1    NV12     X      SYS     SYS     24-47           N/A
GPU2    SYS     SYS      X      NV12    48-71           N/A
GPU3    SYS     SYS     NV12     X      72-95           N/A

Where X= Self and NV# = Connection traversing a bonded set of # NVLinks
```

In this environment, notice only 1 link between GPU0 and GPU1, GPU2 while 2 links are shown between GPU0 and GPU3.

---

The image you've shared seems to be a snippet from a document describing the interconnect topology of a multi-GPU system, specifically referencing NVIDIA's NVLink technology. "SYS" in the context of the GPU interconnect topology usually stands for "system interconnect." It means that the GPUs in question are connected via the system's standard PCIe interface rather than a high-speed NVLink connection.

NVLink is an interconnect technology developed by NVIDIA that enables high bandwidth data transfer between GPUs and between GPUs and CPUs. It is designed to be much faster than the traditional PCIe bus found in most computers. 

The table in your image is showing the connectivity between four GPUs (GPU0 to GPU3). Here's how to interpret the connections:

- GPU0 has one NVLink to GPU1 (NV12 signifies one connection that could represent a pair of NVLink connections as NVLink typically comes in pairs) and a system interconnect (SYS) to GPUs 2 and 3.
- GPU1 is similarly connected with one NVLink to GPU0 and system interconnects to GPUs 2 and 3.
- GPU2 has a system interconnect to GPUs 0 and 1 and one NVLink to GPU3.
- GPU3 has a system interconnect to GPUs 0 and 1 and one NVLink to GPU2.

Ideally, according to the document, each GPU should have four NVLink connections to the other GPUs, suggesting a fully connected topology where each GPU can communicate with every other GPU via NVLink. However, the topology described in the document indicates a less optimal connection, with some GPUs connected via NVLink and others via the slower system bus. This can have implications for the performance of tasks that require high-speed GPU-to-GPU communication, such as deep learning and complex simulations.

In [4]:
# Check Interconnect Topology 
!nvidia-smi topo --matrix

	[4mGPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m
GPU0	 X 	NV12	SYS	SYS	0-23	0		N/A
GPU1	NV12	 X 	SYS	SYS	24-47	1		N/A
GPU2	SYS	SYS	 X 	NV12	48-71	2		N/A
GPU3	SYS	SYS	NV12	 X 	72-95	3		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks


It is also possible to check the NVLink status and bandwidth using `nvidia-smi nvlink --status` command. You should see similar outputs per device.
```
GPU 0: Graphics Device
	 Link 0: 25 GB/s
	 Link 1: 25 GB/s
	 Link 2: 25 GB/s
	 Link 3: 25 GB/s
```

In [5]:
# Check nvlink status
!nvidia-smi nvlink --status

GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-faed01c7-25fe-0f09-5dee-1012b6f0b3c3)
	 Link 0: 25 GB/s
	 Link 1: 25 GB/s
	 Link 2: 25 GB/s
	 Link 3: 25 GB/s
	 Link 4: 25 GB/s
	 Link 5: 25 GB/s
	 Link 6: 25 GB/s
	 Link 7: 25 GB/s
	 Link 8: 25 GB/s
	 Link 9: 25 GB/s
	 Link 10: 25 GB/s
	 Link 11: 25 GB/s
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-f5a14843-d8a5-c0de-f673-0ede7190fdce)
	 Link 0: 25 GB/s
	 Link 1: 25 GB/s
	 Link 2: 25 GB/s
	 Link 3: 25 GB/s
	 Link 4: 25 GB/s
	 Link 5: 25 GB/s
	 Link 6: 25 GB/s
	 Link 7: 25 GB/s
	 Link 8: 25 GB/s
	 Link 9: 25 GB/s
	 Link 10: 25 GB/s
	 Link 11: 25 GB/s
GPU 2: NVIDIA A100 80GB PCIe (UUID: GPU-0dfcb01d-0d50-c193-026b-a3ab60cff441)
	 Link 0: 25 GB/s
	 Link 1: 25 GB/s
	 Link 2: 25 GB/s
	 Link 3: 25 GB/s
	 Link 4: 25 GB/s
	 Link 5: 25 GB/s
	 Link 6: 25 GB/s
	 Link 7: 25 GB/s
	 Link 8: 25 GB/s
	 Link 9: 25 GB/s
	 Link 10: 25 GB/s
	 Link 11: 25 GB/s
GPU 3: NVIDIA A100 80GB PCIe (UUID: GPU-eaef7178-19f7-0d0c-07ba-c428f755500a)
	 Link 0: 25 GB/s
	 Link 1: 25 GB/

## 1.1.4 Bandwidth & Connectivity Tests


NVIDIA provides an application **p2pBandwidthLatencyTest** that demonstrates CUDA Peer-To-Peer (P2P) data transfers between pairs of GPUs by computing bandwidth and latency while enabling and disabling NVLinks. This tool is part of the code samples for CUDA Developers [cuda-samples](https://github.com/NVIDIA/cuda-samples.git). 

Example outputs are shown below. Notice the Device to Device (D\D) bandwidth differences when enabling and disabling NVLinks (P2P).

```
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 1529.61 516.36  20.75  21.54 
     1 517.04 1525.88  20.63  21.33 
     2  20.32  20.17 1532.61 517.23 
     3  20.95  20.83 517.98 1532.61 

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 1532.61  18.09  20.79  21.52 
     1  18.11 1531.11  20.65  21.33 
     2  20.32  20.17 1528.12  28.89 
     3  20.97  20.82  28.36 1531.11 
```


In [6]:
# Tests on GPU pairs using P2P and without P2P 
#`git clone --depth 1 --branch v11.2 https://github.com/NVIDIA/cuda-samples.git`
!/dli/cuda-samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA A100 80GB PCIe, pciBusID: 0, pciDeviceID: 0, pciDomainID:1
Device: 1, NVIDIA A100 80GB PCIe, pciBusID: 0, pciDeviceID: 0, pciDomainID:2
Device: 2, NVIDIA A100 80GB PCIe, pciBusID: 0, pciDeviceID: 0, pciDomainID:3
Device: 3, NVIDIA A100 80GB PCIe, pciBusID: 0, pciDeviceID: 0, pciDomainID:4
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matri

---
# 1.2 Basic SLURM Commands

Now that we've seen how GPUs can communicate with each other over NVLink, let's go over how the hardware resources can be organized into compute nodes. These nodes can be managed by Cluster Manager such as [*Slurm Workload Manager*](https://slurm.schedmd.com/), an open source cluster management and job scheduler system for large and small Linux clusters. 


For this lab, we have configured a SLURM manager where the 4 available GPUs are partitioned into 2 nodes: **slurmnode1** 
and **slurmnode2**, each with 2 GPUs. 

Next, let's see some basic SLURM commands. More SLURM commands can be found in the [SLURM official documentation](https://slurm.schedmd.com/).

<img src="images/cluster_overview.png" width="500"/>

## 1.2.1 Check the SLURM Configuration

We can check the available resources in the SLURM cluster by running `sinfo`. The output will show that there are 2 nodes in the cluster **slurmnode1** and **slurmnode2**. Both nodes are currently idle.

In [7]:
# Check available resources in the cluster
!sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
slurmpar*    up   infinite      2   idle slurmnode[1-2]


##  1.2.2 Submit Jobs Using `srun` Command

The `srun` command allows to running parallel jobs. 

The argument **-N** (or *--nodes*) can be used to specify the nodes allocated to a job. It is also possible to allocate a subset of GPUs available within a node by specifying the argument **-G (or --gpus)**.

Check out the [SLURM official documentation](https://slurm.schedmd.com/) for more arguments.

To test running parallel jobs, let's submit a job that requests 1 node (2 GPUs) and run a simple command on it: `nvidia-smi`. We should see the output of 2 GPUs available in the allocated node.

In [8]:
# run nvidia-smi slurm job with 1 node allocation
!srun -N 1 nvidia-smi

Thu Mar 21 14:31:22 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                    0 |
| N/A   39C    P0              75W / 300W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000002:00:00.0 Off |  

Great! Let's now allocate 2 nodes and run again `nvidia-smi` command.

We should see the results of both nodes showing the available GPU devices. Notce that the stdout might be scrumbled due to the asynchronous and parallel execution of `nvidia-smi` command in the two nodes.

In [9]:
# run nvidia-smi slurm job with 2 node allocation.
!srun -N 2 nvidia-smi

Thu Mar 21 14:31:22 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                    0 |
| N/A   39C    P0              75W / 300W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000002:00:00.0 Off |  

## 1.2.3 Submit Jobs Using `sbatch` Command 

In the previous examples, we allocated resources to run one single command. For more complex jobs, the `sbatch` command allows submitting batch scripts to SLURM by specifying the resources and all environment variables required for executing the job. `sbatch` will transfer the execution to the SLURM Manager after automatically populating the arguments.

In the batch script below, `#SBATCH ...` is used to specify resources and other options relating to the job to be executed:

```
        #!/bin/bash
        #SBATCH -N 1                               # Node count to be allocated for the job
        #SBATCH --job-name=dli_firstSlurmJob       # Job name
        #SBATCH -o /dli/nemo/logs/%j.out       # Outputs log file 
        #SBATCH -e /dli/nemo/logs/%j.err       # Errors log file

        srun -l my_script.sh                       # my SLURM script 
```

Before we submit the `sbatch` batch script, let's first prepare a job that will be executed: a short batch script that will sleep for 2 seconds before running the `nvidia-smi` command.

In [10]:
!chmod +x /dli/code/test.sh # "chmod" modifies the permissions and access mode of files and directories
                            # +x means that the file is executable
# Check the batch script 
!cat /dli/code/test.sh # uses "cat" command to display file contents
                       # "!/bin/bash - "tells your terminal that when you run the script it should use bash to execute it

#!/bin/bash
sleep 2
nvidia-smi


To submit this batch script job, let's create an `sbatch` script that initiates the resources to be allocated and submits the test.sh job.

The following cell will edit the `test_sbatch.sbatch` script allocating 1 node.

---
The image you've provided appears to be a screenshot of a Jupyter notebook explaining how to submit jobs using the `sbatch` command in a High-Performance Computing (HPC) environment managed by the SLURM workload manager. Here's a breakdown of what's happening:

1. **sbatch**: The `sbatch` command is used to submit a job to the SLURM scheduler. This job will be queued and executed when the resources specified in the `sbatch` script become available. 

2. **SLURM**: The Simple Linux Utility for Resource Management (SLURM) is an open-source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

3. **Batch Script**: The text shown is a batch script that is written for the SLURM scheduler. This script contains several `#SBATCH` directives which specify job settings like the number of nodes, job name, output, and error log paths.

   - `#SBATCH -N 1` indicates that the job should use one compute node.
   - `#SBATCH --job-name=dl_firstSlurmJob` sets the job name to `dl_firstSlurmJob`.
   - `#SBATCH -o /dl1/nemo/logs/%x.%j.out` specifies where to write the standard output of the job. `%x` is the job name and `%j` is the job ID.
   - `#SBATCH -e /dl1/nemo/logs/%x.%j.err` specifies where to write the standard error output of the job.

4. **Script Execution**: The command `srun -l my_script.sh` at the end of the batch script will be executed on the allocated node(s). This script is likely to contain the actual commands that need to be run as part of the job. The `-l` option prefixes each line of output with the task number.

5. **Job Preparation**: Before the job can be submitted, the notebook instructs the user to make a script executable (`chmod +x /dl1/code/test.sh`) and then provides a cell to check the contents of the script (`cat /dl1/code/test.sh`).

6. **Job Submission**: The Jupyter notebook provides a cell to write the `test_sbatch.sbatch` script, which sets up the `sbatch` job configuration, and another cell is provided to inspect this script using the `cat` command.

7. **Final Submission**: The last step would be to actually submit the job to the SLURM scheduler, which is not shown in the screenshot but would typically involve running a command like `sbatch test_sbatch.sbatch`.

This process is typical for running batch jobs on an HPC cluster, where jobs are queued and managed by a scheduler to optimize the use of compute resources.

In [11]:
%%writefile /dli/code/test_sbatch.sbatch
#!/bin/bash

#SBATCH -N 1
#SBATCH --job-name=dli_firstSlurmJob
#SBATCH -o /dli/nemo/logs/%j.out
#SBATCH -e /dli/nemo/logs/%j.err

srun -l /dli/code/test.sh  

Overwriting /dli/code/test_sbatch.sbatch


In [12]:
# Check the sbatch script 
!cat /dli/code/test_sbatch.sbatch # "cat" command can be used to display the content of a file, copy content from one file to another, concatenate the contents of multiple files, display 
                                  # the line number, display $ at the end of the line, etc.

#!/bin/bash

#SBATCH -N 1
#SBATCH --job-name=dli_firstSlurmJob
#SBATCH -o /dli/nemo/logs/%j.out
#SBATCH -e /dli/nemo/logs/%j.err

srun -l /dli/code/test.sh  


Now let's submit the `sbatch` job and check the SLURM scheduler. The batch script will be queued and executed when the requested resources are available.

The `squeue` command shows the running or pending jobs. An output example is shown below: 

```
Submitted batch job **
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                **  slurmpar test_sba    admin  R       0:01      1 slurmnode1

```

It shows the SLURM Job ID, Job's name, the user ID, Job's Status (R=running), running duration and the allocated node name.

The following cell submits the `sbatch` job, collects the `JOBID` variable (for querying later the logs) and checks the jobs in the SLURM scheduling queue.

---

The `JOBID` in a SLURM managed cluster is not reset to 1 upon restarting your kernel or even the SLURM service itself. It is a sequentially increasing number managed by the SLURM controller that uniquely identifies each job submitted to the queue. This is intentional for several reasons:

1. **Uniqueness**: Each job must have a unique identifier so that there is no ambiguity when referring to jobs, especially in a multi-user environment where many jobs may be submitted, running, or have completed.
   
2. **Accounting and Logging**: SLURM keeps detailed accounting logs that track resource usage, job status, and other metadata associated with job execution. Using a continuously incrementing `JOBID` helps maintain accurate and consistent records across system reboots and service restarts.

3. **Resilience**: By not resetting the `JOBID` counter, the system avoids the risk of job ID collisions, which could potentially cause issues with job dependencies, accounting, and tracking.

The `JOBID` is designed to be a system-wide counter and not a session or user-specific counter. This is why when you restart your Jupyter kernel, which is essentially restarting your Python session, the `JOBID` does not reset. The SLURM controller, which assigns these IDs, operates independently of your individual session and maintains its state across individual job submissions and system reboots until it is explicitly reset by an administrator, or it rolls over due to reaching the maximum value.

---

In a Jupyter Notebook, lines starting with an exclamation mark `!` execute shell commands in the underlying operating system. The snippet you've provided is combining shell commands with Python code to interact with the SLURM job scheduling system.

Here's what each part of the snippet does:

1. `JOBID=!squeue -u root | grep dli | awk '{print $1}'`: This line performs several actions:
   - `squeue -u root`: This is a SLURM command that lists all jobs queued or running for the user `root`.
   - `| grep dli`: The `grep` command filters the output from `squeue` to only include lines that contain the string "dli", which likely corresponds to the job name or user associated with the jobs you're interested in.
   - `| awk '{print $1}'`: This part of the command uses `awk`, a powerful text-processing tool, to print only the first field of each line, which in the output of `squeue` corresponds to the job ID.
   - The entire command is preceded by `!`, which tells the Jupyter Notebook to run the command in the shell, and the result is assigned to the variable `JOBID`.

2. `slurm_job_output='/dli/nemo/logs/'+JOBID[0]+'.out'`: This line is constructing a file path for the log file associated with the job ID obtained in the previous step. In Python, `JOBID[0]` would refer to the first element of a list named `JOBID`. Since the previous command should return a list with a single string (the job ID), `JOBID[0]` gets that job ID.

   - `/dli/nemo/logs/`: This is the directory path where SLURM log files are stored.
   - `JOBID[0]`: This is the job ID extracted from the previous command.
   - `'.out'`: This is the file extension for the standard output log file. When combined, this creates a path to the SLURM log file for the given job ID.

Together, these commands are used to identify the log file associated with a SLURM job and to construct the path to that log file so that it can be accessed later, possibly to check the output of the job or to diagnose any issues that occurred during its execution.

In [13]:
# Submit the job
!sbatch /dli/code/test_sbatch.sbatch

# Get the JOBID variable
JOBID=!squeue -u root | grep dli | awk '{print $1}'
slurm_job_output='/dli/nemo/logs/'+JOBID[0]+'.out'

"""
Grep, or global regular expression print, is one of the most versatile and useful Linux commands available. It searches for text and strings defined by users in a given file. 
Grep allows users to search files for a specific pattern or word and see which lines contain it.

The awk command's main purpose is to make information retrieval and text manipulation easy to perform in Linux. This Linux command works by scanning a set of input lines in 
order and searches for lines matching the patterns specified by the user.

"""

# check the jobs in the SLURM scheduling queue
!squeue

Submitted batch job 24
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                24  slurmpar dli_firs     root PD       0:00      1 (None)


The output log file for the executed job (**JOBID.out**) is automatically created to gather the outputs.

In our case, we should see the results of `nvidia-smi` command that was executed in the `test.sh` script submitted with 1 node allocation. Let's have a look at execution logs:


In [14]:
# Wait 3 seconds to let the job execute and get the populated logs 
!sleep 3

# Check the execution logs 
!cat $slurm_job_output

0: Thu Mar 21 14:31:28 2024       
0: +---------------------------------------------------------------------------------------+
0: | NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
0: |-----------------------------------------+----------------------+----------------------+
0: | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
0: | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
0: |                                         |                      |               MIG M. |
0: |   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                    0 |
0: | N/A   38C    P0              55W / 300W |      4MiB / 81920MiB |      0%      Default |
0: |                                         |                      |             Disabled |
0: +-----------------------------------------+----------------------+----------------------+
0: |   1  NVIDIA A100 80GB PCIe    

## 1.2.4  Exercise: Submit Jobs Using `sbatch` Command  Requesting More Resources


Using what you have learned, submit the previous `test.sh` batch script with the `sbatch` command on **2 nodes** allocation.

To do so, you will need to:
1. Modify the `test_sbatch.sbatch` script to allocate 2 Nodes 
2. Submit the script again using `sbatch` command
3. Check the execution logs 


If you get stuck, you can look at the [solution](solutions/ex1.2.4.ipynb).

- The Linux command "/bin/bash" - tells your terminal that when you run the script it should use bash to execute it

In [25]:
# 1. Modify the `test_sbatch.sbatch` script to allocate 2 Nodes
!sinfo
#!srun -N 2 nvidia-smi

%%writefile /dli/code/test_sbatch.sbatch 
#!/bin/bash

#SBATCH -N 2
#SBATCH --job-name=dli_2nodes
#SBATCH -o /dli/nemo/log_2nodes/%j.out
#SBATCH -e /dli/nemo/log_2nodes/%j.err




PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
slurmpar*    up   infinite      2   idle slurmnode[1-2]


In [16]:
# 2. Submit the script again using `sbatch` command

In [17]:
# 3. Check the execution logs 

---
# 1.3 Run Interactive Sessions 

Interactive sessions allow to connect directly to a worker node and interact with it through the terminal. 

The SLURM manager allows to allocate resources in interactive session using the `--pty` argument as follows: `srun -N 1 --pty /bin/bash`. 
The session is closed when you exit the node or you cancel the interactive session job using the command `scancel JOBID`.


Since this is an interactive session, first, we need to launch a terminal window and submit a slurm job allocating resources in interactive mode. To do so, we will need to follow the 3 steps: 
1. Launch a terminal session
2. Check the GPUs resources using the command `nvidia-smi` 
3. Run an interactive session requesting 1 node by executing `srun -N 1 --pty /bin/bash`
4. Check the GPUs resources using the command `nvidia-smi` again 

Let's run our first interactive job requesting 1 node and check what GPU resources are at our disposal. 

![title](images/interactive_launch.png)

Notice that while connected to the session, the host name as displayed in the command line changes from "lab" (login node name) to "slurmnode1" indicating that we are now successfully working on a remote worker node.

Run the following cell to get a link to open a terminal session and the instructions to run an interactive session.

In [18]:
%%html

<pre>
   Step 1: Open a terminal session by following this <a href="", data-commandlinker-command="terminal:create-new">Terminal link</a>
   Step 2: Check the GPUs resources: <font color="green">nvidia-smi</font>
   Step 3: Run an interactive session: <font color="green">srun -N 1 --pty /bin/bash</font>
   Step 4: Check the GPUs resources again: <font color="green">nvidia-smi</font>
</pre>

---
<h2 style="color:green;">Congratulations!</h2>

You've made it through the first section of the course and are ready to begin training Deep Learning models on multiple GPUs. <br>

Before moving on, we need to make sure that no jobs are still running or waiting on the SLURM queue. 
Let's check the SLURM jobs queue by executing the following cell:

In [19]:
# Check the SLURM jobs queue 
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


If there are still jobs running or pending, execute the following cell to cancel all the user's jobs using the `scancel` command. 

In [20]:
# Cancel admin user jobs
!scancel -u $USER

# Check again the SLURM jobs queue (should be either empty, or the status TS column should be CG)
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


Next, we will be running basic GPT language model training on different distribution configurations. Move on to [02_GPT_LM_pretrainings.ipynb](02_GPT_LM_pretrainings.ipynb).