## Launch and set up GPU for model training


Run the following cell, and make sure the correct project is selected:

In [None]:
from chi import server, context, lease
import os, time

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@UC")

Retrieve an existing lease named `node1_a100_gpu_team36` from Chameleon Cloud.

Change the string in the following cell to reflect the name of *your* lease, then run it to get your lease:

In [None]:
l = lease.get_lease(f"node1_a100_gpu_team36") 
l.show()

## Launch and set up NVIDIA A100 40GB server - with python-chi

At the beginning of the lease time, we will bring up our GPU server. We will use the `python-chi` Python API to Chameleon to provision our server.

> **Note**: if you don’t have access to the Chameleon Jupyter environment, or if you prefer to set up your AMD MI100 server by hand, the next section provides alternative instructions! If you want to set up your server “by hand”, skip to the next section.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected:

The status should show as “ACTIVE” now that we are past the lease start time.

We will use the lease to bring up a server with the `CC-Ubuntu24.04-CUDA` disk image.

In [None]:
username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
    f"node-mltrain-{username}", 
    reservation_id=l.node_reservations[0]["id"],
    image_name="CC-Ubuntu24.04-CUDA"
)
s.submit(idempotent=True)

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

In [None]:
# Attach a floating IP to your server
s.associate_floating_ip()

In [None]:
# Refresh info
s.refresh()
s.check_connectivity()

In the output below, make a note of the floating IP that has been assigned to your instance (in the “Addresses” row).

In [None]:
s.refresh()
s.show(type="widget")

## Retrieve code and notebooks on the instance

Now, we can use `python-chi` to execute commands on the instance, to set it up. We’ll start by retrieving the code and other materials on the instance.

In [None]:
s.execute("git clone --recurse-submodules https://github.com/Sanjeevan1998/ml-ops-project.git")

(Optional: checkout model-training)

In [None]:
s.execute("cd ml-ops-project && git checkout model-training");

In [None]:
s.execute("cd code/model-training/")

## Set up Docker

To use common deep learning frameworks like Tensorflow or PyTorch, and ML training platforms like MLFlow and Ray, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

In [None]:
s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")

## Set up the NVIDIA container toolkit

We will also install the NVIDIA container toolkit, with which we can access GPUs from inside our containers.

In [None]:
s.execute("curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list")
s.execute("sudo apt update")
s.execute("sudo apt-get install -y nvidia-container-toolkit")
s.execute("sudo nvidia-ctk runtime configure --runtime=docker")
s.execute("sudo systemctl restart docker")

and we can install `nvtop` to monitor GPU usage:

In [None]:
s.execute("sudo apt update")
s.execute("sudo apt -y install nvtop")

#s.execute("sudo apt -y install cmake libncurses-dev libsystemd-dev libudev-dev libdrm-dev libgtest-dev")
#s.execute("git clone https://github.com/Syllo/nvtop")
#s.execute("mkdir -p nvtop/build && cd nvtop/build && cmake .. -DAMDGPU_SUPPORT=ON && sudo make install")

### Build a container image - for MLFlow section

Finally, we will build a container image in which to work in the MLFlow section, that has:

-   a Jupyter notebook server
-   Pytorch and Pytorch Lightning
-   CUDA, which allows deep learning frameworks like Pytorch to use the NVIDIA GPU accelerator
-   and MLFlow

You can see our Dockerfile for this image at: [Dockerfile.jupyter-torch-mlflow-cuda](https://github.com/teaching-on-testbeds/mltrain-chi/tree/main/docker/Dockerfile.jupyter-torch-mlflow-cuda)

Building this container may take a bit of time, but that’s OK: we can get it started and then continue to the next section while it builds in the background, since we don’t need this container immediately.

In [None]:
s.execute("docker build -t jupyter-legal_ai:latest -f DockerFile.jupyter-ray-complete .")

Leave that cell running, and in the meantime, open an SSH sesson on your server. From your local terminal, run

    ssh -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

where

-   in place of `~/.ssh/id_rsa_chameleon`, substitute the path to your own key that you had uploaded to CHI@TACC
-   in place of `A.B.C.D`, use the floating IP address you just associated to your instance.