# Using FABRIC GPUs

Your compute nodes can include GPUs. These devices are made available as FABRIC components and can be added to your nodes like any other component.

This example notebook will demonstrate how to reserve and use Nvidia GPU devices on FABRIC.


## Setup the Experiment

#### Import FABRIC API

In [1]:
from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager

try: 
    fablib = fablib_manager()
                     
    fablib.show_config()
except Exception as e:
    print(f"Exception: {e}")

0,1
Credential Manager,cm.fabric-testbed.net
Orchestrator,orchestrator.fabric-testbed.net
Token File,/home/fabric/.tokens.json
Project ID,f8a6e0b0-ad14-47cb-9764-74c20ef3e4fc
Bastion Username,durbek_gafurov_0000000854
Bastion Private Key File,/home/fabric/work/fabric_config/bastionD
Bastion Host,bastion-1.fabric-testbed.net
Bastion Private Key Passphrase,
Slice Public Key File,/home/fabric/work/fabric_config/.ssh/slice_key.pub
Slice Private Key File,/home/fabric/work/fabric_config/.ssh/slice_key


## Create a Node

The cell below creates a slice that contains a single node. The node includes a GPU component.

### Set the Slice Name and FABRIC Site

Use a filter function to find random sites with your desired GPUs.


In [2]:
slice_name="gpu_slice"

rtx6000_site = fablib.get_random_site(filter_function=lambda x: x['rtx6000_available'] > 0) # and x['disk_available']>10000 and x['cores_available']>2 and x['ram_available']>8)
# tesla_site = fablib.get_random_site(filter_function=lambda x: x['tesla_t4_available'] > 0 and x['disk_available']>10000 and x['cores_available']>2 and x['ram_available']>8)                                                                                                                                                                                                                          

rtx6000_node_name='rtx1'
tesla_node_name='tesla1'
rtx6000_site

'CLEM'

In [14]:
# rtx6000_site,tesla_site = "STAR","STAR"

In [3]:
try:
    #Create Slice
    slice = fablib.new_slice(name=slice_name)

    # Add node
    rtx_node = slice.add_node(name=rtx6000_node_name, site=rtx6000_site,disk=7000,image="default_ubuntu_20")
    rtx_node.add_component(model='GPU_RTX6000', name='gpu1')

#     tesla_node = slice.add_node(name=tesla_node_name, site=tesla_site,disk=7000)
#     tesla_node.add_component(model='GPU_TeslaT4', name='gpu1')


    #Submit Slice Request
    slice.submit()
except Exception as e:
    print(f"Exception: {e}")


Retry: 9, Time: 203 sec


0,1
ID,d092afca-dfd9-4b40-993e-09dc330186b7
Name,gpu_slice
Lease Expiration (UTC),2023-01-15 02:07:34 +0000
Lease Start (UTC),2023-01-14 02:07:34 +0000
Project ID,f8a6e0b0-ad14-47cb-9764-74c20ef3e4fc
State,StableOK


ID,Name,Cores,RAM,Disk,Image,Image Type,Host,Site,Username,Management IP,State,Error,SSH Command,Public SSH Key File,Private SSH Key File
48d5ea42-43f3-46d3-b599-a2bed6ea935d,rtx1,64,384,4000,default_ubuntu_20,qcow2,clem-w1.fabric-testbed.net,CLEM,ubuntu,2620:103:a006:12:f816:3eff:fe5a:b752,Active,,ssh -i /home/fabric/work/fabric_config/.ssh/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@2620:103:a006:12:f816:3eff:fe5a:b752,/home/fabric/work/fabric_config/.ssh/slice_key.pub,/home/fabric/work/fabric_config/.ssh/slice_key



Time to stable 203 seconds
Running post_boot_config ... Time to post boot config 203 seconds


## Get the Slice

Retrieve the node information and save the management IP addresses.

In [4]:
try:
    slice = fablib.get_slice(name=slice_name)
    slice.show()
except Exception as e:
    print(f"Exception: {e}")

0,1
ID,d092afca-dfd9-4b40-993e-09dc330186b7
Name,gpu_slice
Lease Expiration (UTC),2023-01-15 02:07:34 +0000
Lease Start (UTC),2023-01-14 02:07:34 +0000
Project ID,f8a6e0b0-ad14-47cb-9764-74c20ef3e4fc
State,StableOK


## Get the Nodes

Retrieve the nodes information and save the management IP address.


In [5]:
try:
    rtx_node = slice.get_node(rtx6000_node_name) 
    rtx_node.show()
    
    rtx_gpu = rtx_node.get_component('gpu1')
    rtx_gpu.show()
    
#     tesla_node = slice.get_node(tesla_node_name) 
#     tesla_node.show()
    
#     tesla_gpu = tesla_node.get_component('gpu1')
#     tesla_gpu.show()
except Exception as e:
    print(f"Exception: {e}")

0,1
ID,48d5ea42-43f3-46d3-b599-a2bed6ea935d
Name,rtx1
Cores,64
RAM,384
Disk,4000
Image,default_ubuntu_20
Image Type,qcow2
Host,clem-w1.fabric-testbed.net
Site,CLEM
Username,ubuntu


0,1
Name,rtx1-gpu1
Details,NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
Disk,0
Units,1
PCI Address,0000:25:00.0
Model,GPU_RTX6000
Type,GPU


Use the RTX6000 Node for the rest of the example

In [6]:
node = rtx_node

### GPU PCI Device

Run the command <code>lspci</code> to see your GPU PCI device(s). This is the raw GPU PCI device that is not yet configured for use.  You can use the GPUs as you would any GPUs.

View node1's GPU

In [24]:
command = "sudo apt-get install -q -y pciutils && lspci | grep 'NVIDIA\|3D controller'"
try:
    stdout, stderr = node.execute(command)
except Exception as e:
    print(f"Exception: {e}")

Reading package lists...
Building dependency tree...
Reading state information...
pciutils is already the newest version (1:3.6.4-1ubuntu0.20.04.1).
0 upgraded, 0 newly installed, 0 to remove and 83 not upgraded.
00:07.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)


In [None]:
node.get_ssh_command()

In [None]:
sudo apt-get install -y fakeroot build-essential crash kexec-tools makedumpfile kernel-wedge && sudo apt-get build-dep linux && sudo apt-get install -y git-core libncurses5 libncurses5-dev libelf-dev asciidoc binutils-dev

In [None]:
# node.execute("sudo apt-get update")
node.execute("sudo apt-get install -y fakeroot build-essential crash kexec-tools makedumpfile kernel-wedge && sudo apt-get build-dep linux && sudo apt-get install -y git-core libncurses5 libncurses5-dev libelf-dev asciidoc binutils-dev")

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  autoconf automake autopoint autotools-dev binutils binutils-common
  binutils-x86-64-linux-gnu cpp cpp-9 debhelper dh-autoreconf
  dh-strip-nondeterminism dpkg-dev dwz g++ g++-9 gcc gcc-9 gcc-9-base gettext
  intltool-debian libalgorithm-diff-perl libalgorithm-diff-xs-perl
  libalgorithm-merge-perl libarchive-cpio-perl libarchive-zip-perl libasan5
  libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libcroco3
  libcrypt-dev libctf-nobfd0 libctf0 libdebhelper-perl libdpkg-perl libdw1
  libfakeroot libfile-fcntllock-perl libfile-stripnondeterminism-perl
  libgcc-9-dev libgomp1 libisl22 libitm1 liblsan0 libltdl-dev
  libmail-sendmail-perl libmpc3 libquadmath0 libsnappy1v5 libstdc++-9-dev
  libsub-override-perl libsys-hostname-long-perl libtool libtsan0 libubsan1
  linux-libc-dev m4 make manpages-dev po-debconf
Suggested packages:
  autoconf-archive g

## Install Nvidia Drivers

Now, let's run the following commands to install the latest CUDA driver and the CUDA libraries and compiler.

In [27]:
commands = [
    'sudo apt-get install -q -y epel-release',
    # 'sudo apt config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo',
    'sudo apt-get install -q -y kernel-devel kernel-headers nvidia-driver nvidia-settings cuda-driver cuda'
]
try:
    print("Installing CUDA...")
    for command in commands:
        stdout, stderr = node.execute(command)
    print("Done installing CUDA. Now, reboot for the changes to take effect.")
except Exception as e:
    print(f"Fail: {e}")

Installing CUDA...
Reading package lists...
Building dependency tree...
Reading state information...
[31m E: Unable to locate package epel-release
 [0mReading package lists...
Building dependency tree...
Reading state information...
Package nvidia-driver is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

[31m E: Unable to locate package kernel-devel
E: Unable to locate package kernel-headers
E: Package 'nvidia-driver' has no installation candidate
E: Unable to locate package cuda-driver
E: Unable to locate package cuda
 [0mDone installing CUDA. Now, reboot for the changes to take effect.


And once CUDA is installed, reboot the machine.

In [17]:
reboot = 'sudo reboot'
try:
    print(reboot)
    node.execute(reboot)
    
    slice.wait_ssh(timeout=360,interval=10,progress=True)

    print("Now testing SSH abilites to reconnect...",end="")
    slice.update()
    slice.test_ssh()
    print("Reconnected!")

except Exception as e:
    print(f"Fail: {e}")

sudo reboot
Waiting for slice . Slice state: StableOK
Waiting for ssh in slice .. ssh successful
Now testing SSH abilites to reconnect...Reconnected!


## Testing the GPU and CUDA Installation

First, verify that the Nvidia drivers recognize the GPU by running `nvidia-smi`.

In [10]:
try:
    stdout, stderr = node.execute("nvidia-smi")
    print(f"stdout: {stdout}")
except Exception as e:
    print(f"Exception: {e}")

[31m bash: nvidia-smi: command not found
 [0mstdout: 


Now, let's upload the following "Hello World" CUDA program file to the node.

`hello-world.cu`

*Source: https://computer-graphics.se/multicore/pdf/hello-world.cu*

*Author: Ingemar Ragnemalm*

>This file is from *"The real "Hello World!" for CUDA, OpenCL and GLSL!"* (https://computer-graphics.se/hello-world-for-cuda.html), written by Ingemar Ragnemalm, programmer and CUDA teacher. The only changes (if you download the original file from the website) are to additionally `#include <unistd.h>`, as `sleep()` is now a fuction defined in the `unistd.h` library.

In [11]:
node.upload_file('./hello-world.cu', 'hello-world.cu')

<SFTPAttributes: [ size=1128 uid=1000 gid=1000 mode=0o100664 atime=1673662301 mtime=1673662301 ]>

We now compile the `.cu` file using `nvcc`, the CUDA compiler tool installed with CUDA. In this example, we create an executable called `hello_world`.

In [12]:
try:
    stdout, stderr = node.execute("/usr/local/cuda-12.0/bin/nvcc -o hello_world hello-world.cu")
except Exception as e:
    print(f"Exception: {e}")

[31m bash: /usr/local/cuda-12.0/bin/nvcc: No such file or directory
 [0m

In [13]:
node.execute("nvcc hello-world.cu -L /usr/local/cuda/lib -lcudart -o hello-world")

[31m bash: nvcc: command not found
 [0m

('', 'bash: nvcc: command not found\n')

Finally, run the executable:

In [14]:
try:
    stdout, stderr = node.execute("./hello_world")
    print(f"stdout: {stdout}")
except Exception as e:
    print(f"Exception: {e}")

[31m bash: ./hello_world: No such file or directory
 [0mstdout: 


If you see `Hello World!`, the CUDA program ran successfully. `World!` was computed on the GPU from an array of offsets being summed with the string `Hello `, and the resut was printed to stdout.

### Congratulations! You have now successfully run a program on a FABRIC GPU!

## Cleanup Your Experiment

In [None]:
# try:
#     slice = fablib.get_slice(name=slice_name)
#     slice.delete()
# except Exception as e:
#     print(f"Exception: {e}")