## Launch and set up NVIDIA bare metal server - with python-chi

At the beginning of the lease time, we will bring up our GPU server. We will use the `python-chi` Python API to Chameleon to provision our server.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected:

In [1]:
from chi import server, context, lease
import os

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@TACC")

VBox(children=(Dropdown(description='Select Project', options=('CHI-231138',), value='CHI-231138'), Output()))

VBox(children=(Dropdown(description='Select Site', options=('CHI@TACC', 'CHI@UC', 'CHI@EVL', 'CHI@NCAR', 'CHI@…

Change the string in the following cell to reflect the name of *your* lease, then run it to get your lease:

In [2]:
net_ID = "ahmed"

In [3]:
l = lease.get_lease(f"mltrain_{net_ID}") 
l.show()

HTML(value='\n        <h2>Lease Details</h2>\n        <table>\n            <tr><th>Name</th><td>mltrain_ahmed<…

Lease Details:
Name: mltrain_ahmed
ID: c08a818c-47c2-46b7-8286-314af9d4e904
Status: ACTIVE
Start Date: 2025-06-27 17:13:00
End Date: 2025-06-28 19:55:00
User ID: 873f8aeb9a6c23f0ad0d5881739fd9711619d7b3e7957b2743860414a1ba2191
Project ID: 477960c4206444c3a77b9e4ffa281ade

Node Reservations:
ID: bc27ebde-eb1d-408e-8eee-aed4e37d5f8f, Status: active, Min: 1, Max: 1

Floating IP Reservations:

Network Reservations:

Flavor Reservations:

Events:


The status should show as “ACTIVE” now that we are past the lease start time.

The rest of this notebook can be executed without any interactions from you, so at this point, you can save time by clicking on this cell, then selecting “Run” \> “Run Selected Cell and All Below” from the Jupyter menu.

As the notebook executes, monitor its progress to make sure it does not get stuck on any execution error, and also to see what it is doing!

We will use the lease to bring up a server with the `CC-Ubuntu24.04-CUDA` disk image.

> **Note**: the following cell brings up a server only if you don’t already have one with the same name! (Regardless of its error state.) If you have a server in ERROR state already, delete it first in the Horizon GUI before you run this cell.

In [4]:
username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
    f"node-mltrain-{username}", 
    reservation_id=l.node_reservations[0]["id"],
    image_name="CC-Ubuntu24.04-CUDA"
)
s.submit(idempotent=True)

Waiting for server node-mltrain-ahmed_offsechq_com's status to become ACTIVE. This typically takes 10 minutes for baremetal, but can take up to 20 minutes.


HBox(children=(Label(value=''), IntProgress(value=0, bar_style='success')))

Server has moved to status ACTIVE


Attribute,node-mltrain-ahmed_offsechq_com
Id,d1bcfde2-36e2-4700-a51d-c6a338b8a340
Status,ACTIVE
Image Name,CC-Ubuntu24.04-CUDA
Flavor Name,baremetal
Addresses,sharednet1:  IP: 10.52.1.92 (v4)  Type: fixed  MAC: 44:a8:42:26:62:bb  IP: 129.114.108.204 (v4)  Type: floating  MAC: 44:a8:42:26:62:bb
Network Name,sharednet1
Created At,2025-06-27T18:19:35Z
Keypair,ahmed_offsechq_com-jupyter
Reservation Id,
Host Id,7c346a87f6e9c6f9d67fb076fe5d76102c3a39bab2eac416bf31a8b7


<Server 'node-mltrain-ahmed_offsechq_com'>

Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

In [None]:
s.associate_floating_ip()

In [5]:
s.refresh()
s.check_connectivity()

Checking connectivity to 129.114.108.204 port 22.


HBox(children=(Label(value=''), IntProgress(value=0, bar_style='success')))

Connection successful


In the output below, make a note of the floating IP that has been assigned to your instance (in the “Addresses” row).

In [None]:
s.refresh()
s.show(type="widget")

## Retrieve code and notebooks on the instance

Now, we can use `python-chi` to execute commands on the instance, to set it up. We’ll start by retrieving the code and other materials on the instance.

In [None]:
repo = "https://github.com/A7med7x7/ReproGen.git"

In [None]:
s.execute(f"git clone {repo}")

## Set up Docker

To use common deep learning frameworks like Tensorflow or PyTorch, and ML training platforms like MLFlow and Ray, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

In [None]:
s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")

## Set up the NVIDIA container toolkit

We will also install the NVIDIA container toolkit, with which we can access GPUs from inside our containers.

In [None]:
s.execute("curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list")
s.execute("sudo apt update")
s.execute("sudo apt-get install -y nvidia-container-toolkit")
s.execute("sudo nvidia-ctk runtime configure --runtime=docker")
# for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
s.execute("sudo jq 'if has(\"exec-opts\") then . else . + {\"exec-opts\": [\"native.cgroupdriver=cgroupfs\"]} end' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp > /dev/null && sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json")
s.execute("sudo systemctl restart docker")

and we can install `nvtop` to monitor GPU usage:

In [None]:
s.execute("sudo apt -y install nvtop")

### Mounting S3_buckets to filesystem

modify the configuration file for FUSE to allow mounting virtual filesystems, and allow the persmission to all users including the docker containers

In [None]:
s.execute("sudo sed -i '/^#user_allow_other/s/^#//' /etc/fuse.conf") #

mounting the buckets using rclone 

In [12]:
buckets = {
    'fancyproject-data': 'data',
    'fancyproject-mlflow-metrics': 'metrics'
}

for bucket_name, mount_dir in buckets.items():
    
    s.execute(f"sudo mkdir -p /mnt/{mount_dir}")
    s.execute(f"sudo chown -R cc /mnt/{mount_dir}")
    s.execute(f"sudo chgrp -R cc /mnt/{mount_dir}")
    s.execute(f"rclone mount rclone_s3:{bucket_name} /mnt/{mount_dir} --allow-other --daemon")

In [13]:
s.execute("ls -l /mnt/") # we should be able to see the mounted buckets

total 0
drwxrwxr-x 1 cc cc 0 Jun 27 21:12 data
drwxrwxr-x 1 cc cc 0 Jun 27 21:13 metrics


<Result cmd='ls -l /mnt/' exited=0>

## Access the server over SSH


From your local terminal, run

    ssh -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

where

-   in place of `~/.ssh/id_rsa_chameleon`, substitute the path to your own key that you had uploaded to CHI@TACC and CHI@UC
-   in place of `A.B.C.D`, use the floating IP address you just associated to your instance.