## vCluster Tutorial 

This is a tutorial to launch Snowflake vCluster feature. The target audience of this tutorial is anyone who did not have prior experience with either Snowflake or SPCS(Snowpark Container Services).

As part of this tutorial we will:
* Setup Snowflake resources like Database, Schema, Stage, Image Repository
* Install snowflake cluster client(spcsclusterctl)
* Build and push docker images to the Snowflake Image Repository
* Provision vCluster and GPU compute pools
* Run jobs, observe the job statuses and run exec commands towards the cluster


NOTE: 

Before beginning the tutorial, make sure that you have:

* Docker installed. You can follow: https://docs.docker.com/engine/install/ to install it
* Installed kubectl: https://kubernetes.io/docs/tasks/tools/ (this will be used to access vCluster)
* Have running python
* Have Snowflake account, username and password and role


In [339]:
# prereqs

# docker, version: 27.4.0
# kubectl 
# spcsclusterctl installed

# Snowflake account locator
# Snowflake 


In [None]:

!pip install torchvision
!pip install snowflake-connector-python


In [341]:
# check docker version
! docker --version


Docker version 27.4.0, build bde2b89


In [53]:
import os
import torchvision
import torchvision.transforms as transforms

# Snowflake connection variables
# DO NOT CHANGE
#-----
SNOWFLAKE_VCLUSTER_HOST="snowflake.prod3.us-west-2.aws.snowflakecomputing.com"
#-----

SNOWFLAKE_ACCOUNT="YOUR_LOCATOR"
SNOWFLAKE_USER="YOUR_LOCATOR"
SNOWFLAKE_PASSWORD="YOUR_PASSWORD"

# Snowflake data related variables
SNOWFLAKE_DATABASE="YOUR_DB"
SNOWFLAKE_SCHEMA="YOUR_SCHEMA"
SNOWFLAKE_ROLE="YOUR_ROLE"

IMAGE_REPO_NAME='test_repo' # image repo is used to store docker images

SNOWFLAKE_DATA_STAGE="DATA_STAGE"
STAGE_PATH=f"{SNOWFLAKE_DATA_STAGE}/test-data"

# Snowflake vCluster related variables
CLUSTER_NAME="TEST10"
WORKER_INSTANCE_TYPE="GPU_NV_S"

# Trainer job resources 
TRAINER_NUM_GPUS="1"
TRAINER_NUM_CPUS="4"
TRAINER_MEM_GI="10Gi"

snowflake_path="~/.snowflake"
snowflake_abs_path=os.path.expanduser("~/.snowflake")

# -------- exporting to env variables for convenience

os.environ['SNOWFLAKE_ABS_PATH']=snowflake_abs_path

# export parameters as env variables for convenience
os.environ['IMAGE_REPO_NAME']=IMAGE_REPO_NAME

os.environ['SNOWFLAKE_DATA_STAGE']=SNOWFLAKE_DATA_STAGE
os.environ['STAGE_PATH']=STAGE_PATH

os.environ['SNOWFLAKE_VCLUSTER_HOST']=SNOWFLAKE_VCLUSTER_HOST
os.environ['SNOWFLAKE_ACCOUNT']=SNOWFLAKE_ACCOUNT
os.environ['SNOWFLAKE_USER']=SNOWFLAKE_USER
os.environ['SNOWFLAKE_PASSWORD']=SNOWFLAKE_PASSWORD

os.environ['SNOWFLAKE_DATABASE']=SNOWFLAKE_DATABASE
os.environ['SNOWFLAKE_SCHEMA']=SNOWFLAKE_SCHEMA
os.environ['SNOWFLAKE_ROLE']=SNOWFLAKE_ROLE

os.environ['CLUSTER_NAME']=CLUSTER_NAME
os.environ['WORKER_INSTANCE_TYPE']=WORKER_INSTANCE_TYPE

os.environ['TRAINER_NUM_GPUS']=TRAINER_NUM_GPUS
os.environ['TRAINER_NUM_CPUS']=TRAINER_NUM_CPUS
os.environ['TRAINER_MEM_GI']=TRAINER_MEM_GI


## Setting up snowflake


In [None]:
# create database and schema

import snowflake.connector

connection_parameters = {
    "account": os.environ['SNOWFLAKE_ACCOUNT'],
    "user": os.environ['SNOWFLAKE_USER'],
    "password": os.environ['SNOWFLAKE_PASSWORD'],
    "role": os.environ['SNOWFLAKE_ROLE'],
    "client_session_keep_alive": True
}

if 'SNOWFLAKE_HOST' in os.environ:
    connection_parameters['host'] = os.environ['SNOWFLAKE_HOST']

def upload_files_to_stage(local_dir:str, stage_path:str):
    for root, _, files in os.walk(local_dir):
        for file in files:
            local_file_path = os.path.join(root, file)
            remote_path = local_file_path.replace(local_dir, "")
            
            put_command = f"PUT 'file://{local_file_path}' @{stage_path} AUTO_COMPRESS=TRUE OVERWRITE=TRUE"
            print(f"Uploading: {local_file_path} to @{stage_path}")
            cur.execute(put_command)


with snowflake.connector.connect(**connection_parameters) as conn:
    cur = conn.cursor()
    print(cur.execute(f"CREATE DATABASE IF NOT EXISTS {SNOWFLAKE_DATABASE}").fetchall())
    print(cur.execute(f"USE DATABASE {SNOWFLAKE_DATABASE}").fetchall())
    print(cur.execute(f"CREATE SCHEMA IF NOT EXISTS {SNOWFLAKE_SCHEMA}").fetchall())
    print(cur.execute(f"USE SCHEMA {SNOWFLAKE_SCHEMA}").fetchall())
    print(cur.execute(f"CREATE IMAGE REPOSITORY IF NOT EXISTS {IMAGE_REPO_NAME}").fetchall())
    image_repos = cur.execute(f"show image repositories like '{IMAGE_REPO_NAME}'").fetchall()
    
    # url is the fourth parameter
    image_url = image_repos[0][4]
        
    # store data and train images for convenience
    DATA_IMAGE_REPO=f"{image_url}/temp/download_data:08"
    TRAIN_IMAGE_REPO=f"{image_url}/temp/train:08"
    
    os.environ['DATA_IMAGE_REPO']=DATA_IMAGE_REPO
    os.environ['TRAIN_IMAGE_REPO']=TRAIN_IMAGE_REPO

    # creating stage for dummy data
    print(cur.execute(f"CREATE STAGE IF NOT EXISTS  {SNOWFLAKE_DATA_STAGE} ").fetchall())

    # download data to the local machine
    dest_local_dir = "./data"
    os.makedirs(dest_local_dir, exist_ok=True)
    
    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    torchvision.datasets.CIFAR10(root=dest_local_dir, train=True, download=True, transform=transform)
    torchvision.datasets.CIFAR10(root=dest_local_dir, train=False, download=True, transform=transform)


    upload_files_to_stage(f"{dest_local_dir}/cifar-10-batches-py", f"{SNOWFLAKE_DATABASE}.{SNOWFLAKE_SCHEMA}.{STAGE_PATH}")
    


## Setting up vCluster
The commands below set up vCluster in customer account.
We use `spcsclusterctl` program to manage Snowflake vClusters. The current release is available in https://github.com/Snowflake-Labs/spcs-templates/releases/tag/v0.0.1 location.

In the following sections we will download and install `spcsclusterctl` on the well-known location


In [5]:

# retrieve the proper spcscluster binary link
output=!uname -m
print(f'platform: {output}')
platform = output[0]

def get_spcscluster_link():
    amd_link = "https://github.com/Snowflake-Labs/spcs-templates/releases/download/v0.0.1/spcsclusterctl.linux_amd64"
    arm_link = "https://github.com/Snowflake-Labs/spcs-templates/releases/download/v0.0.1/spcsclusterctl.darwin_amd64"

    if platform=='arm64':
        return arm_link, 'spcsclusterctl.darwin_amd64'
    else:
        return amd_link, 'spcsclusterctl.linux_amd64'


spcscluster_link, filename = get_spcscluster_link()
os.environ['SPCSCLUSTER_LINK']=spcscluster_link
os.environ['SPCSCLUSTER_FILENAME']=filename




platform: ['arm64']


In [6]:
%%bash
#download and install spcscluster binary

wget -q -P ~/.snowflake $SPCSCLUSTER_LINK
mv ~/.snowflake/$SPCSCLUSTER_FILENAME ~/.snowflake/spcsclusterctl
chmod +x ~/.snowflake/spcsclusterctl
export PATH=$SNOWFLAKE_ABS_PATH:$PATH

# allow binary execution on mac os
if [[ "$OSTYPE" == "darwin"* ]]; then
    if xattr -l ~/.snowflake/spcsclusterctl | grep -q "com.apple.quarantine"; then
        xattr -d com.apple.quarantine ~/.snowflake/spcsclusterctl
    fi    
fi



In [7]:

# exporting to local PATH for convenience
os.environ["PATH"] = f"{snowflake_abs_path}:" + os.environ["PATH"]



In [8]:
!which spcsclusterctl

/Users/aivanou/.snowflake/spcsclusterctl


In [None]:

# create cluster with $CLUSTER_NAME name
!spcsclusterctl create-cluster --cluster=$CLUSTER_NAME


In [None]:

# list existing clusters
!spcsclusterctl list-clusters


In [13]:

# Add compute pool(a set of nodes) to a vCluster
!spcsclusterctl create-compute-pool \
    --cluster=$CLUSTER_NAME \
    --compute-pool-name=$CLUSTER_NAME_$WORKER_INSTANCE_TYPE \
    --num-instances=1 \
    --instance-type=$WORKER_INSTANCE_TYPE


2025/05/16 13:02:47 unable to create compute pool: 002002 (42710): SQL compilation error:
Object 'GPU_NV_S' already exists.


In [14]:

# Examine nodes with type $WORKER_INSTANCE_TYPE
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- get nodes -l snowflake.com/instance-type-name=$WORKER_INSTANCE_TYPE

# Example of the output:
# NAME                 STATUS   ROLES    AGE    VERSION
# node-10-16-113-122   Ready    <none>   121m   v1.30.2



NAME                STATUS   ROLES    AGE   VERSION
node-10-16-36-230   Ready    <none>   52s   v1.30.11


In [15]:

# Export node name that we will be using to run jobs on
NODE_HOSTNAME="node-10-16-36-230"
os.environ['NODE_HOSTNAME']=NODE_HOSTNAME


## Buiding and running data containers

The cells below build `download-data` container that uploads data to the node  from the stage that was used above

In [19]:

!docker login $DATA_IMAGE_REPO -u $SNOWFLAKE_USER -p $SNOWFLAKE_PASSWORD


Login Succeeded


In [45]:
!echo $DATA_IMAGE_REPO

sfengineering-xaccounttest2.registry.snowflakecomputing.com/aivanoudb/public/test_repo/temp/download_data:08


In [None]:

# Build docker image
!docker build \
 --platform linux/amd64 \
 -t $DATA_IMAGE_REPO \
 -f ./download_data/Dockerfile ./download_data


In [None]:

# push image to the Snowflake Image Registry
!docker push $DATA_IMAGE_REPO


In [55]:

# Populate pod with environment variables and save it in `./download_data/pod.yaml` file
!envsubst < ./download_data/pod.template.yaml > ./download_data/pod.yaml


In [56]:

# delete previous execution, if any
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- delete -f ./download_data/pod.yaml
# start new execution
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- apply -f ./download_data/pod.yaml


Error from server (NotFound): error when deleting "./download_data/pod.yaml": jobs.batch "download-data" not found
2025/05/16 15:35:43 exit status 1
job.batch/download-data created


In [None]:

# list current pods in vCluster
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- get pods -A

# Example of output
# NAMESPACE     NAME                       READY   STATUS      RESTARTS   AGE
# default       download-data-67kgt        1/1     Running     0          4s



In [None]:

# get logs of the job
!spcsclusterctl kubectl --cluster=TEST10 -- logs download-data-fm947


## Building and running dummy job



In [None]:

# Build an image
# Note: This is not actually a trainer but an example of how to run trainer-like containers that have access to GPUs and hostpaths
!docker build \
 --platform linux/amd64 \
 -t $TRAIN_IMAGE_REPO \
 -f ./trainer/Dockerfile ./trainer


In [None]:

# Push image to the image repository
!docker push $TRAIN_IMAGE_REPO


In [63]:


# Populate pod with environment variables and save it in `./download_data/pod.yaml` file
!envsubst < ./trainer/pod.template.yaml > ./trainer/pod.yaml

# delete existing job if exists
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- delete -f ./trainer/pod.yaml

# create new job
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- apply -f ./trainer/pod.yaml



Error from server (NotFound): error when deleting "./trainer/pod.yaml": jobs.batch "trainer" not found
2025/05/16 15:40:18 exit status 1
job.batch/trainer created


In [None]:

!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- get pods -A

# Expected output
# NAMESPACE     NAME                       READY   STATUS      RESTARTS   AGE
# default       download-data-67kgt        0/1     Completed   0          40m
# default       trainer-f2npd              1/1     Running     0          10s


In [87]:

# Delete trainer command
# !spcsclusterctl kubectl --cluster=TEST10 -- delete -f ./trainer/pod.yaml


job.batch "trainer" deleted


In [67]:

# Get trainer based on kubectl labels
OUTPUT=!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- get pod -l workload-type=trainer -o jsonpath="{.items[0].metadata.name}"
TRAINER_ID=OUTPUT[0]
os.environ['TRAINER_ID'] = TRAINER_ID

print(f"trainer id: {TRAINER_ID}")


trainer id: trainer-bjx5j


In [None]:

# Run nvidia-smi command inside a trainer
# NOTE: WAIT UNTIL CONTAINER IS IN `RUNNING` state!!!!
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- exec -it $TRAINER_ID -- nvidia-smi



In [None]:

!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- exec -it $TRAINER_ID -- ls /data/v1


In [None]:

# Get trainer logs
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- logs $TRAINER_ID


## Quick iteration between local and remote environments


In [335]:

# Local directory to copy, change accordingly!
local_dir = "/Users/aivanou/code/spcs-templates/stanford_beacon/"
os.environ['LOCAL_DIR_TO_COPY']=local_dir

!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- cp $LOCAL_DIR_TO_COPY $TRAINER_ID:/app/code

!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- exec -it $TRAINER_ID ls /app/code



kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
 data		 download_data.py   trainer		 '~'
 download_data	 stanford_beacon    vcluster-demo.ipynb


In [338]:

#run command remotely
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- exec -it $TRAINER_ID python /app/main.py


kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
E0305 15:50:33.777110   21327 websocket.go:296] Unknown stream id 1, discarding message
Running dummy workload
iteration: 0
iteration: 1
^C
Traceback (most recent call last):
  File "/app/main.py", line 33, in <module>
    time.sleep(2)
KeyboardInterrupt
command terminated with exit code 130
2025/03/05 15:50:38 exit status 130
