make sure you are running this notebook in a uv env that has this repo installed.
The easiest way to do so would be to:
# clone this repo
git clone https://github.com/DeanLight/slurm-ops
# cd into it
cd slurm-ops
# create a uv env for it
uv sync
# open this repo in vscode and select the venv you created as the kernel for the notebookNow the import cell should run
from slurm_ops.core import (
start_or_connect,
job_stat,
update_ssh_node_config,
find_free_port,
get_port_forwarding_command,
)These commands print the commands that you wanna put in your terminal to interact with tillicum
The ssh_config_templates/ directory contains two files that need to go
into your ~/.ssh/:
config- Main SSH config with hosts forklone-login,klone-node,tillicum-login, andmoana. UsesControlMasterfor persistent connections so you only authenticate once.klone-node-config- Included by the main config for theklone-nodehost. Contains theProxyJumpsetup so you can SSH directly to a compute node through the login node.
Before copying, edit ssh_config_templates/config and replace
deanlcs with your own username.
Then copy them into place:
# ! cp ../ssh_config_templates/* ~/.ssh/
# ! for f in config klone-node-config tillicum-node-config; do sed -i '' 's/deanlcs/<YOUR_CS_ID>/g' ~/.ssh/"$f"; doneTest that your SSH config works. The first time you connect you’ll need
to authenticate (2FA, password, etc). After that, ControlMaster keeps
the connection alive so subsequent SSH commands reuse it without
re-authenticating.
! ssh tillicum-login echo "Connected successfully"Connected successfully
Running this command will get us a tmux pane with an allocation for a compute node with a single gpu for an hour. If its already running, we will get the command to reattach to the tmux server
job_name = "remote_dev"
slurm_host = "tillicum-login"
start_or_connect(job_name, slurm_host,
slurm_args="--qos=debug --gpus=1 --cpus-per-task=8 --mem=200G --time=01:00:00")No running job 'remote_dev' found. Starting salloc...
Run this in your terminal:
ssh -t tillicum-login "tmux new-session -A -s remote_dev 'salloc --job-name=remote_dev --qos=debug --gpus=1 --cpus-per-task=8 --mem=200G --time=01:00:00'"
This is how we can get info about our job
node, job_id = job_stat(job_name, slurm_host)Job 'remote_dev' status on tillicum-login:
Job ID: 62526
Node: g004
Time left: 31:25
Partition: gpu-h200
Useful commands (run these on the login node or via ssh):
# Get a shell on the compute node:
ssh -t tillicum-login 'srun --jobid=62526 --overlap --pty bash'
# Cancel the job:
ssh tillicum-login 'scancel 62526'
# View job details:
ssh tillicum-login 'scontrol show job 62526'
If we want to have a remote vscode on our node, we need to change our ssh config based on the compute node our job is on
update_ssh_node_config(job_name,host=slurm_host)Updated /Users/deanlight/.ssh/tillicum-node-config: Hostname → g004
'g004'
! cat /Users/deanlight/.ssh/tillicum-node-configHost tillicum-node
User deanlcs
Hostname g007
ProxyJump tillicum-login
you will note that now the hostname changed in the ssh config.
Now, if you
- installed the remote development vscode extension
- you can look for the command
Remote-SSH: connect to hostand - you can choose to connect to tillicum-node.
- It will ask for your password which is your uwid password
- Then it will spin a vscode server on the compute node with a new vscode window for you to use.
Note: the first time you do this, it could take some time for the vscode server to install itself on your home dir in tillicum
Now, you can controll the job via the tmux terminal, cancel it by exiting the terminal etc, and you do the actual work using the vscode interface.
Ok, now that we have access to a strong compute server, we can get ready to do some knarly ML shit.
here are a couple of things you want to do on the compute server.
Note: since the file system is shared for all server, all file changes you make here, including installations, will persist to other slurm jobs. They will also persist to the login node.
Home directories have limited space. Create a scratch directory and
symlink ~/.cache to it so that tools like uv and huggingface-cli
cache to the larger filesystem.
mkdir -p /gpfs/scrubbed/$USER/cache
chmod 700 /gpfs/scrubbed/$USER
ln -sf /gpfs/scrubbed/$USER/cache ~/.cacheWorking with project specific env vars is a hassle, lets install direnv.
This allows you to set up a .envrc file in different directories, and
each time you cd in/out of directories, it will change your env vars
to have all env vars defined in .envrc files in directories in your
current path.
curl -sfL https://direnv.net/install.sh | bashNow follow the config instructions it gives you, they should be something like: > hook it into your shell (.bashrc or .zshrc) by adding eval “$(direnv hook bash/zsh)”. Finally, create a .envrc file in your project directory, add environment variables, and run direnv allow set UV
curl -LsSf https://astral.sh/uv/install.sh | shThis installs to ~/.local/bin. Make sure it’s on your PATH (the
installer adds it to .bashrc automatically).
Now we need to make sure that uv doesnt install big packages into our
home directory, so lets put the following into an envrc in our home
directory.
# Add to ~/.envrc
export UV_CACHE_DIR="/gpfs/scrubbed/$USER/cache/uv"
export UV_PYTHON_INSTALL_DIR="/gpfs/scrubbed/$USER/cache/uv/python"
# Then install a Python version once:
uv python install 3.12
# and you can look at your cache dirs to see stuff was installed outside of your homedirTo make auth to github more sensible, we can install the github cli.
Since we dont have apt-get style access on tillicum, we install it by
downloading the binaries and putting them in our local bin dir.
VERSION="2.86.0"
ARCH="amd64"mkdir -p ~/.local/bin && \
cd /tmp && \
wget -q "https://github.com/cli/cli/releases/download/v${VERSION}/gh_${VERSION}_linux_${ARCH}.tar.gz" -O gh.tar.gz && \
tar -xzf gh.tar.gz && \
cp gh_${VERSION}_linux_${ARCH}/bin/gh ~/.local/bin/ && \
chmod +x ~/.local/bin/gh && \
rm -rf gh_${VERSION}_linux_${ARCH} gh.tar.gzThen authenticate. You’ll need a personal access token:
gh auth loginThis is not a one-time step – you need to run this every time you start a new job that needs cuda in tillicum. Yeah, it sucks, I agree.
module load gcc/13.4.0
module load cuda/13.0.0When running batch jobs via shell scripts, you can put it in the beggining of the scripts.
Many times, you’ll want to run some service on tillicum and communicate with it from your pc. A classic example is running a vllm inference server on tillicum, so you can use it to do all the llm inference time tasks on code your are developing on your laptop.
To do so, we will show an example of spinning up a vllm server on a tillicum node, and then port forwarding so you can ask for completion on your local laptop.
On your compute node, do the following:
# lets make an example project, in practice, this will stand in for a repo you are developing and clone into on tillicum
## Project Setup
# make a new directory
project_name=vllm_worker
mkdir $project_name
cd $project_name
# set up env vars for uv to use your cache directory to install the env
project_venv_dir=${UV_CACHE_DIR}/${project_name}
mkdir $project_venv_dir
ln -sf $project_venv_dir$ .venv
# start a new uv project and add vllm to it
uv init
uv python pin 3.12
uv add vllm llguidanceNow create an .envrc file with the following content, and then allow
it using direnv allow
export PORT=8555
export MODEL=Qwen/Qwen2.5-1.5B-Instruct
export API_KEY="123_secret_password"And create a script called start_vllm.sh with the content:
#!/bin/bash
if [ -z "$MODEL" ]; then
echo "Error: MODEL is not set" >&2
exit 1
fi
if [ -z "$PORT" ]; then
echo "Error: PORT is not set" >&2
exit 1
fi
if [ -z "$API_KEY" ]; then
echo "Error: API_KEY is not set" >&2
exit 1
fi
module load gcc/13.4.0
module load cuda/13.0.0
vllm serve $MODEL --port $PORT --api-key $API_KEY \
--enable-prefix-caching \
--seed 42Now, when you login to the cluster and want to run a vllm server, cd to the project directory and run:
uv run bash start_vllm.shOnce your vllm server is up, we need to do the port forwarding from the node, to our local computer
local_port = find_free_port(above=8000)
_=get_port_forwarding_command(local_port=8555, remote_port=vllm_port, node=node, host=slurm_host)ssh -N -f -L 8555:g007.hyak.local:8555 tillicum-login
Once we have the port forwaring working, we can talk to our vllm server locally
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv('../.envrc') # make sure you have a similar .envrc on your local machine
api_key = os.environ['API_KEY']
model = os.environ['MODEL']
openai_api_base = f"http://localhost:{local_port}/v1"
client = OpenAI(
api_key=api_key,
base_url=openai_api_base,
)
completion = client.completions.create(
model=model,
prompt="who are you?",
)
print("Completion result:", completion)And there you go!
To see only the repeating instruction, see the notebook
nbs/01_flow.ipynb