# Using FABRIC GPUs

Your compute nodes can include GPUs. These devices are made available as FABRIC components and can be added to your nodes like any other component.

This example notebook will demonstrate how to reserve and use Nvidia GPU devices on FABRIC.


## Setup the Experiment

#### Import FABRIC API

In [1]:
from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager

try: 
    fablib = fablib_manager()
                     
    fablib.show_config()
except Exception as e:
    print(f"Exception: {e}")

0,1
Credential Manager,cm.fabric-testbed.net
Orchestrator,orchestrator.fabric-testbed.net
Token File,/home/fabric/.tokens.json
Project ID,f8a6e0b0-ad14-47cb-9764-74c20ef3e4fc
Bastion Username,durbek_gafurov_0000000854
Bastion Private Key File,/home/fabric/work/fabric_config/bastionD
Bastion Host,bastion-1.fabric-testbed.net
Bastion Private Key Passphrase,
Slice Public Key File,/home/fabric/work/fabric_config/.ssh/slice_key.pub
Slice Private Key File,/home/fabric/work/fabric_config/.ssh/slice_key


## Create a Node

The cell below creates a slice that contains a single node. The node includes a GPU component.

### Set the Slice Name and FABRIC Site

Use a filter function to find random sites with your desired GPUs.


In [18]:
slice_name="MySlice6"

rtx6000_site = fablib.get_random_site(filter_function=lambda x: x['rtx6000_available'] > 0) # and x['disk_available']>10000 and x['cores_available']>2 and x['ram_available']>8)
# tesla_site = fablib.get_random_site(filter_function=lambda x: x['tesla_t4_available'] > 0 and x['disk_available']>10000 and x['cores_available']>2 and x['ram_available']>8)                                                                                                                                                                                                                          

rtx6000_node_name='rtx1'
tesla_node_name='tesla1'
rtx6000_site

'UCSD'

In [14]:
# rtx6000_site,tesla_site = "STAR","STAR"

In [19]:
try:
    #Create Slice
    slice = fablib.new_slice(name=slice_name)

    # Add node
    rtx_node = slice.add_node(name=rtx6000_node_name, site=rtx6000_site,disk=7000)
    rtx_node.add_component(model='GPU_RTX6000', name='gpu1')

#     tesla_node = slice.add_node(name=tesla_node_name, site=tesla_site,disk=7000)
#     tesla_node.add_component(model='GPU_TeslaT4', name='gpu1')


    #Submit Slice Request
    slice.submit()
except Exception as e:
    print(f"Exception: {e}")


Retry: 8, Time: 183 sec


0,1
ID,41dc5387-4147-49ee-abdd-9f751b2d2f06
Name,MySlice6
Lease Expiration (UTC),2023-01-15 01:01:35 +0000
Lease Start (UTC),2023-01-14 01:01:36 +0000
Project ID,f8a6e0b0-ad14-47cb-9764-74c20ef3e4fc
State,StableOK


ID,Name,Cores,RAM,Disk,Image,Image Type,Host,Site,Username,Management IP,State,Error,SSH Command,Public SSH Key File,Private SSH Key File
52efc8ef-7e3a-48d6-8d19-c3b85762e3b8,rtx1,64,384,4000,default_rocky_8,qcow2,ucsd-w1.fabric-testbed.net,UCSD,rocky,132.249.252.145,Active,,ssh -i /home/fabric/work/fabric_config/.ssh/slice_key -F /home/fabric/work/fabric_config/ssh_config rocky@132.249.252.145,/home/fabric/work/fabric_config/.ssh/slice_key.pub,/home/fabric/work/fabric_config/.ssh/slice_key



Time to stable 183 seconds
Running post_boot_config ... Time to post boot config 183 seconds


## Get the Slice

Retrieve the node information and save the management IP addresses.

In [20]:
try:
    slice = fablib.get_slice(name=slice_name)
    slice.show()
except Exception as e:
    print(f"Exception: {e}")

0,1
ID,41dc5387-4147-49ee-abdd-9f751b2d2f06
Name,MySlice6
Lease Expiration (UTC),2023-01-15 01:01:35 +0000
Lease Start (UTC),2023-01-14 01:01:36 +0000
Project ID,f8a6e0b0-ad14-47cb-9764-74c20ef3e4fc
State,StableOK


## Get the Nodes

Retrieve the nodes information and save the management IP address.


In [21]:
try:
    rtx_node = slice.get_node(rtx6000_node_name) 
    rtx_node.show()
    
    rtx_gpu = rtx_node.get_component('gpu1')
    rtx_gpu.show()
    
#     tesla_node = slice.get_node(tesla_node_name) 
#     tesla_node.show()
    
#     tesla_gpu = tesla_node.get_component('gpu1')
#     tesla_gpu.show()
except Exception as e:
    print(f"Exception: {e}")

0,1
ID,52efc8ef-7e3a-48d6-8d19-c3b85762e3b8
Name,rtx1
Cores,64
RAM,384
Disk,4000
Image,default_rocky_8
Image Type,qcow2
Host,ucsd-w1.fabric-testbed.net
Site,UCSD
Username,rocky


0,1
Name,rtx1-gpu1
Details,NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
Disk,0
Units,1
PCI Address,0000:25:00.0
Model,GPU_RTX6000
Type,GPU


Use the RTX6000 Node for the rest of the example

In [22]:
node = rtx_node

### GPU PCI Device

Run the command <code>lspci</code> to see your GPU PCI device(s). This is the raw GPU PCI device that is not yet configured for use.  You can use the GPUs as you would any GPUs.

View node1's GPU

In [23]:
command = "sudo dnf install -q -y pciutils && lspci | grep 'NVIDIA\|3D controller'"
try:
    stdout, stderr = node.execute(command)
except Exception as e:
    print(f"Exception: {e}")


Installed:
  pciutils-3.7.0-1.el8.x86_64                                                   

[31m Importing GPG key 0x6D745A60:
 Userid     : "Release Engineering <infrastructure@rockylinux.org>"
 Fingerprint: 7051 C470 A929 F454 CEBE 37B7 15AF 5DAC 6D74 5A60
 From       : /etc/pki/rpm-gpg/RPM-GPG-KEY-rockyofficial
 [0m00:07.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)


## Install Nvidia Drivers

Now, let's run the following commands to install the latest CUDA driver and the CUDA libraries and compiler.

In [24]:
commands = [
    'sudo dnf install -q -y epel-release',
    'sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo',
    'sudo dnf install -q -y kernel-devel kernel-headers nvidia-driver nvidia-settings cuda-driver cuda'
]
try:
    print("Installing CUDA...")
    for command in commands:
        stdout, stderr = node.execute(command)
    print("Done installing CUDA. Now, reboot for the changes to take effect.")
except Exception as e:
    print(f"Fail: {e}")

Installing CUDA...

Installed:
  epel-release-8-18.el8.noarch                                                  

Adding repo from: https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

Upgraded:
  glibc-2.28-211.el8.x86_64           glibc-all-langpacks-2.28-211.el8.x86_64   
  glibc-common-2.28-211.el8.x86_64    glibc-gconv-extra-2.28-211.el8.x86_64     
  libgcc-8.5.0-15.el8.x86_64          libgomp-8.5.0-15.el8.x86_64               
  libstdc++-8.5.0-15.el8.x86_64      
Installed:
  adwaita-cursor-theme-3.28.0-3.el8.noarch                                      
  adwaita-icon-theme-3.28.0-3.el8.noarch                                        
  alsa-lib-1.2.7.2-1.el8.x86_64                                                 
  at-spi2-atk-2.26.2-1.el8.x86_64                                               
  at-spi2-core-2.28.0-1.el8.x86_64                                              
  atk-2.28.1-1.el8.x86_64                                                  

And once CUDA is installed, reboot the machine.

In [25]:
reboot = 'sudo reboot'
try:
    print(reboot)
    node.execute(reboot)
    
    slice.wait_ssh(timeout=360,interval=10,progress=True)

    print("Now testing SSH abilites to reconnect...",end="")
    slice.update()
    slice.test_ssh()
    print("Reconnected!")

except Exception as e:
    print(f"Fail: {e}")

sudo reboot
Waiting for slice . Slice state: StableOK
Waiting for ssh in slice ... ssh successful
Now testing SSH abilites to reconnect...Reconnected!


## Testing the GPU and CUDA Installation

First, verify that the Nvidia drivers recognize the GPU by running `nvidia-smi`.

In [26]:
try:
    stdout, stderr = node.execute("nvidia-smi")
    print(f"stdout: {stdout}")
except Exception as e:
    print(f"Exception: {e}")

Sat Jan 14 01:16:20 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro RTX 6000     Off  | 00000000:00:07.0 Off |                    0 |
| N/A   32C    P0    54W / 250W |      0MiB / 23040MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Now, let's upload the following "Hello World" CUDA program file to the node.

`hello-world.cu`

*Source: https://computer-graphics.se/multicore/pdf/hello-world.cu*

*Author: Ingemar Ragnemalm*

>This file is from *"The real "Hello World!" for CUDA, OpenCL and GLSL!"* (https://computer-graphics.se/hello-world-for-cuda.html), written by Ingemar Ragnemalm, programmer and CUDA teacher. The only changes (if you download the original file from the website) are to additionally `#include <unistd.h>`, as `sleep()` is now a fuction defined in the `unistd.h` library.

In [28]:
node.upload_file('./hello-world.cu', 'hello-world.cu')

<SFTPAttributes: [ size=1108 uid=1000 gid=1000 mode=0o100664 atime=1673659090 mtime=1673659090 ]>

We now compile the `.cu` file using `nvcc`, the CUDA compiler tool installed with CUDA. In this example, we create an executable called `hello_world`.

In [31]:
try:
    stdout, stderr = node.execute("/usr/local/cuda-12.0/bin/nvcc -o hello_world hello-world.cu")
except Exception as e:
    print(f"Exception: {e}")

[31m hello-world.cu(45): error: identifier "sleep" is undefined

1 error detected in the compilation of "hello-world.cu".
 [0m

In [32]:
node.execute("nvcc hello-world.cu -L /usr/local/cuda/lib -lcudart -o hello-world")

[31m bash: nvcc: command not found
 [0m

('', 'bash: nvcc: command not found\n')

Finally, run the executable:

In [None]:
try:
    stdout, stderr = node.execute("./hello_world")
    print(f"stdout: {stdout}")
except Exception as e:
    print(f"Exception: {e}")

If you see `Hello World!`, the CUDA program ran successfully. `World!` was computed on the GPU from an array of offsets being summed with the string `Hello `, and the resut was printed to stdout.

### Congratulations! You have now successfully run a program on a FABRIC GPU!

## Cleanup Your Experiment

In [None]:
# try:
#     slice = fablib.get_slice(name=slice_name)
#     slice.delete()
# except Exception as e:
#     print(f"Exception: {e}")