## vCluster Tutorial 

This is a tutorial to launch Snowflake vCluster feature. The target audience of this tutorial is anyone who did not have prior experience with either Snowflake or SPCS(Snowpark Container Services).

As part of this tutorial we will:
* Setup Snowflake resources like Database, Schema, Stage, Image Repository
* Install snowflake cluster client(spcsclusterctl)
* Build and push docker images to the Snowflake Image Repository
* Provision vCluster and GPU compute pools
* Run jobs, observe the job statuses and run exec commands towards the cluster


NOTE: 

Before beginning the tutorial, make sure that you have:

* Docker installed. You can follow: https://docs.docker.com/engine/install/ to install it
* Installed kubectl: https://kubernetes.io/docs/tasks/tools/ (this will be used to access vCluster)
* Have running python
* Have Snowflake account, username and password and role


In [339]:
# prereqs

# docker, version: 27.4.0
# kubectl 
# spcsclusterctl installed

# Snowflake account locator
# Snowflake 


In [340]:

!pip install torchvision
!pip install snowflake-connector-python




In [341]:
# check docker version
! docker --version


Docker version 27.4.0, build bde2b89


In [353]:
import os
import torchvision
import torchvision.transforms as transforms

# Snowflake connection variables
# DO NOT CHANGE
SNOWFLAKE_HOST="snowflake.prod3.us-west-2.aws.snowflakecomputing.com"
SNOWFLAKE_ACCOUNT="YOUR_LOCATOR"
SNOWFLAKE_USER="YOUR_USERNAME"
SNOWFLAKE_PASSWORD="YOUR_PASSWORD"

# Snowflake data related variables
SNOWFLAKE_DATABASE="YOUR_DB"
SNOWFLAKE_SCHEMA="YOUR_SCHEMA"
SNOWFLAKE_ROLE="YOUR_ROLE"

IMAGE_REPO_NAME='test_repo' # image repo is used to store docker images

SNOWFLAKE_DATA_STAGE="DATA_STAGE"
STAGE_PATH=f"{SNOWFLAKE_DATA_STAGE}/test-data"

# Snowflake vCluster related variables
CLUSTER_NAME="TEST10"
WORKER_INSTANCE_TYPE="GPU_NV_S"

# Trainer job resources 
TRAINER_NUM_GPUS="1"
TRAINER_NUM_CPUS="4"
TRAINER_MEM_GI="10Gi"

snowflake_path="~/.snowflake"
snowflake_abs_path=os.path.expanduser("~/.snowflake")

# -------- exporting to env variables for convenience

os.environ['SNOWFLAKE_ABS_PATH']=snowflake_abs_path

# export parameters as env variables for convenience
os.environ['IMAGE_REPO_NAME']=IMAGE_REPO_NAME

os.environ['SNOWFLAKE_DATA_STAGE']=SNOWFLAKE_DATA_STAGE
os.environ['STAGE_PATH']=STAGE_PATH

os.environ['SNOWFLAKE_HOST']=SNOWFLAKE_HOST
os.environ['SNOWFLAKE_ACCOUNT']=SNOWFLAKE_ACCOUNT
os.environ['SNOWFLAKE_USER']=SNOWFLAKE_USER
os.environ['SNOWFLAKE_PASSWORD']=SNOWFLAKE_PASSWORD

os.environ['SNOWFLAKE_DATABASE']=SNOWFLAKE_DATABASE
os.environ['SNOWFLAKE_SCHEMA']=SNOWFLAKE_SCHEMA
os.environ['SNOWFLAKE_ROLE']=SNOWFLAKE_ROLE

os.environ['CLUSTER_NAME']=CLUSTER_NAME
os.environ['WORKER_INSTANCE_TYPE']=WORKER_INSTANCE_TYPE

os.environ['TRAINER_NUM_GPUS']=TRAINER_NUM_GPUS
os.environ['TRAINER_NUM_CPUS']=TRAINER_NUM_CPUS
os.environ['TRAINER_MEM_GI']=TRAINER_MEM_GI


## Setting up snowflake


In [345]:
# create database and schema

import snowflake.connector

connection_parameters = {
    "account": os.environ['SNOWFLAKE_ACCOUNT'],
    "host": os.environ['SNOWFLAKE_HOST'],
    "user": os.environ['SNOWFLAKE_USER'],
    "password": os.environ['SNOWFLAKE_PASSWORD'],
    "role": os.environ['SNOWFLAKE_ROLE'],
    "client_session_keep_alive": True
}

def upload_files_to_stage(local_dir:str, stage_path:str):
    for root, _, files in os.walk(local_dir):
        for file in files:
            local_file_path = os.path.join(root, file)
            remote_path = local_file_path.replace(local_dir, "")
            
            put_command = f"PUT 'file://{local_file_path}' @{stage_path} AUTO_COMPRESS=TRUE OVERWRITE=TRUE"
            print(f"Uploading: {local_file_path} to @{stage_path}")
            cur.execute(put_command)


with snowflake.connector.connect(**connection_parameters) as conn:
    cur = conn.cursor()
    print(cur.execute(f"CREATE DATABASE IF NOT EXISTS {SNOWFLAKE_DATABASE}").fetchall())
    print(cur.execute(f"USE DATABASE {SNOWFLAKE_DATABASE}").fetchall())
    print(cur.execute(f"CREATE SCHEMA IF NOT EXISTS {SNOWFLAKE_SCHEMA}").fetchall())
    print(cur.execute(f"USE SCHEMA {SNOWFLAKE_SCHEMA}").fetchall())
    print(cur.execute(f"CREATE IMAGE REPOSITORY IF NOT EXISTS {IMAGE_REPO_NAME}").fetchall())
    image_repos = cur.execute(f"show image repositories like '{IMAGE_REPO_NAME}'").fetchall()
    
    # url is the fourth parameter
    image_url = image_repos[0][4]
        
    # store data and train images for convenience
    DATA_IMAGE_REPO=f"{image_url}/temp/download_data:08"
    TRAIN_IMAGE_REPO=f"{image_url}/temp/train:08"
    
    os.environ['DATA_IMAGE_REPO']=DATA_IMAGE_REPO
    os.environ['TRAIN_IMAGE_REPO']=TRAIN_IMAGE_REPO

    # creating stage for dummy data
    print(cur.execute(f"CREATE STAGE IF NOT EXISTS  {SNOWFLAKE_DATA_STAGE} ").fetchall())

    # download data to the local machine
    dest_local_dir = "./data"
    os.makedirs(dest_local_dir, exist_ok=True)
    
    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    torchvision.datasets.CIFAR10(root=dest_local_dir, train=True, download=True, transform=transform)
    torchvision.datasets.CIFAR10(root=dest_local_dir, train=False, download=True, transform=transform)


    upload_files_to_stage(f"{dest_local_dir}/cifar-10-batches-py", f"{SNOWFLAKE_DATABASE}.{SNOWFLAKE_SCHEMA}.{STAGE_PATH}")
    


[('AIVANOUDB already exists, statement succeeded.',)]
[('Statement executed successfully.',)]
[('PUBLIC already exists, statement succeeded.',)]
[('Statement executed successfully.',)]
[('TEST_REPO already exists, statement succeeded.',)]
[('DATA_STAGE already exists, statement succeeded.',)]
Files already downloaded and verified
Files already downloaded and verified
Uploading: ./data/cifar-10-batches-py/data_batch_1 to @AIVANOUDB.PUBLIC.DATA_STAGE/test-data
Uploading: ./data/cifar-10-batches-py/readme.html to @AIVANOUDB.PUBLIC.DATA_STAGE/test-data
Uploading: ./data/cifar-10-batches-py/batches.meta to @AIVANOUDB.PUBLIC.DATA_STAGE/test-data
Uploading: ./data/cifar-10-batches-py/data_batch_2 to @AIVANOUDB.PUBLIC.DATA_STAGE/test-data
Uploading: ./data/cifar-10-batches-py/data_batch_5 to @AIVANOUDB.PUBLIC.DATA_STAGE/test-data
Uploading: ./data/cifar-10-batches-py/test_batch to @AIVANOUDB.PUBLIC.DATA_STAGE/test-data
Uploading: ./data/cifar-10-batches-py/data_batch_4 to @AIVANOUDB.PUBLIC.DAT

## Setting up vCluster
The commands below set up vCluster in customer account.
We use `spcsclusterctl` program to manage Snowflake vClusters. The current release is available in https://github.com/Snowflake-Labs/spcs-templates/releases/tag/v0.0.1 location.

In the following sections we will download and install `spcsclusterctl` on the well-known location


In [346]:

# retrieve the proper spcscluster binary link
output=!uname -m
print(f'platform: {output}')
platform = output[0]

def get_spcscluster_link():
    amd_link = "https://github.com/Snowflake-Labs/spcs-templates/releases/download/v0.0.1/spcsclusterctl.linux_amd64"
    arm_link = "https://github.com/Snowflake-Labs/spcs-templates/releases/download/v0.0.1/spcsclusterctl.darwin_amd64"

    if platform=='arm64':
        return arm_link, 'spcsclusterctl.darwin_amd64'
    else:
        return amd_link, 'spcsclusterctl.linux_amd64'


spcscluster_link, filename = get_spcscluster_link()
os.environ['SPCSCLUSTER_LINK']=spcscluster_link
os.environ['SPCSCLUSTER_FILENAME']=filename




platform: ['arm64']


In [347]:
%%bash
#download and install spcscluster binary

wget -q -P ~/.snowflake $SPCSCLUSTER_LINK
mv ~/.snowflake/$SPCSCLUSTER_FILENAME ~/.snowflake/spcsclusterctl
chmod +x ~/.snowflake/spcsclusterctl
export PATH=$SNOWFLAKE_ABS_PATH:$PATH

# allow binary execution on mac os
if [[ "$OSTYPE" == "darwin"* ]]; then
    if xattr -l ~/.snowflake/spcsclusterctl | grep -q "com.apple.quarantine"; then
        xattr -d com.apple.quarantine ~/.snowflake/spcsclusterctl
    fi    
fi



In [348]:

# exporting to local PATH for convenience
os.environ["PATH"] = f"{snowflake_abs_path}:" + os.environ["PATH"]



In [349]:
!which spcsclusterctl

/Users/aivanou/.snowflake/spcsclusterctl


In [350]:

# create cluster with $CLUSTER_NAME name
!spcsclusterctl create-cluster --cluster=$CLUSTER_NAME


2025/03/05 15:53:54 unable to create cluster: 002002 (42710): SQL compilation error:
Object 'TEST10' already exists.


In [351]:

# list existing clusters
!spcsclusterctl list-clusters


TEST10                                 RUNNING wss://bvb4q26h-sfengineering-xaccounttest2.snowflakecomputing.app/proxy-connect


In [306]:

# Add compute pool(a set of nodes) to a vCluster
!spcsclusterctl create-compute-pool \
    --cluster=$CLUSTER_NAME \
    --compute-pool-name=$CLUSTER_NAME_$WORKER_INSTANCE_TYPE \
    --num-instances=1 \
    --instance-type=$WORKER_INSTANCE_TYPE


2025/03/05 15:22:12 Waiting for compute pool GPU_NV_S to become IDLE OR ACTIVE, currently STARTING, 5s...
2025/03/05 15:22:17 Waiting for compute pool GPU_NV_S to become IDLE OR ACTIVE, currently STARTING, 10s...
2025/03/05 15:22:22 Waiting for compute pool GPU_NV_S to become IDLE OR ACTIVE, currently STARTING, 15s...
2025/03/05 15:22:27 Waiting for compute pool GPU_NV_S to become IDLE OR ACTIVE, currently STARTING, 20s...
2025/03/05 15:22:32 Waiting for compute pool GPU_NV_S to become IDLE OR ACTIVE, currently STARTING, 25s...
2025/03/05 15:22:38 Waiting for compute pool GPU_NV_S to become IDLE OR ACTIVE, currently STARTING, 31s...
2025/03/05 15:22:42 Waiting for compute pool GPU_NV_S to become IDLE OR ACTIVE, currently STARTING, 35s...
2025/03/05 15:22:47 Waiting for compute pool GPU_NV_S to become IDLE OR ACTIVE, currently STARTING, 40s...
2025/03/05 15:22:52 Waiting for compute pool GPU_NV_S to become IDLE OR ACTIVE, currently STARTING, 45s...
2025/03/05 15:22:57 Waiting for comput

In [352]:

# Examine nodes with type $WORKER_INSTANCE_TYPE
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- get nodes -l snowflake.com/instance-type-name=$WORKER_INSTANCE_TYPE

# Example of the output:
# NAME                 STATUS   ROLES    AGE    VERSION
# node-10-16-113-122   Ready    <none>   121m   v1.30.2



NAME               STATUS   ROLES    AGE   VERSION
node-10-16-22-87   Ready    <none>   30m   v1.30.2


In [309]:

# Export node name that we will be using to run jobs on
NODE_HOSTNAME="node-10-16-22-87"
os.environ['NODE_HOSTNAME']=NODE_HOSTNAME


## Buiding and running data containers

The cells below build `download-data` container that uploads data to the node  from the stage that was used above

In [311]:

# Build docker image
!docker build \
 --platform linux/amd64 \
 -t $DATA_IMAGE_REPO \
 -f ./download_data/Dockerfile ./download_data


[1A[1B[0G[?25l[+] Building 0.0s (0/0)  docker:desktop-linux
[?25h[1A[0G[?25l[+] Building 0.0s (0/0)  docker:desktop-linux
[?25h[1A[0G[?25l[+] Building 0.0s (0/1)                                    docker:desktop-linux
[?25h[1A[0G[?25l[+] Building 0.2s (1/2)                                    docker:desktop-linux
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 376B                                       0.0s
[0m => [internal] load metadata for docker.io/library/python:3.10.16-bullsey  0.2s
[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.3s (1/2)                                    docker:desktop-linux
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 376B                                       0.0s
[0m => [internal] load metadata for docker.io/library/python:3.10.16-bullsey  0.3s
[?25h[1A[1A[1A[1A[0G[?25l[+] Buildi

In [312]:

# push image to the Snowflake Image Registry
!docker push $DATA_IMAGE_REPO


The push refers to repository [sfengineering-xaccounttest2.registry.snowflakecomputing.com/aivanoudb/public/test_repo/temp/download_data]

[1B60a6aa12: Preparing 
[1B329d7ea5: Preparing 
[1Bdc793194: Preparing 
[1B257de18f: Preparing 
[1B99b5e532: Preparing 
[1Bd9368c48: Preparing 
[1B3388e560: Preparing 
[1Bd9f33740: Preparing 
[1B9ede331e: Preparing 
[1B8371eb4d: Preparing 
[1B7edfab12: Preparing 
[1B59c16281: Preparing 
[4B8371eb4d: Pushed   538.3MB/529.2MB[11A[2K[10A[2K[10A[2K[10A[2K[12A[2K[10A[2K[9A[2K[13A[2K[12A[2K[10A[2K[12A[2K[10A[2K[9A[2K[10A[2K[12A[2K[12A[2K[12A[2K[9A[2K[12A[2K[8A[2K[11A[2K[9A[2K[12A[2K[9A[2K[12A[2K[9A[2K[12A[2K[9A[2K[9A[2K[12A[2K[9A[2KPushing  39.03MB/43.55MB[9A[2K[7A[2K[12A[2K[12A[2K[9A[2K[9A[2K[9A[2K[9A[2K[9A[2K[9A[2K[9A[2K[9A[2K[7A[2K[12A[2K[6A[2K[6A[2K[6A[2K[6A[2K[5A[2K[6A[2K[5A[2K[6A[2K[5A[2K[6A[2K[5A[2K[9A[2K[6A[2K[6A[2K[5

In [313]:

# Populate pod with environment variables and save it in `./download_data/pod.yaml` file
!envsubst < ./download_data/pod.template.yaml > ./download_data/pod.yaml


In [314]:

# delete previous execution, if any
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- delete -f ./download_data/pod.yaml
# start new execution
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- apply -f ./download_data/pod.yaml


Error from server (NotFound): error when deleting "./download_data/pod.yaml": jobs.batch "download-data-pod" not found
2025/03/05 15:25:31 exit status 1
job.batch/download-data-pod created


In [317]:

# list current pods in vCluster
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- get pods -A

# Example of output
# NAMESPACE     NAME                       READY   STATUS      RESTARTS   AGE
# default       download-data-67kgt        1/1     Running     0          4s



NAMESPACE     NAME                       READY   STATUS      RESTARTS   AGE
default       download-data-pod-5k68k    0/1     Completed   0          59s
kube-system   coredns-85c984886d-79zds   1/1     Running     0          22m


In [318]:

# get logs of the job
!spcsclusterctl kubectl --cluster=TEST10 -- logs download-data-pod-5k68k


Downloading data from DATA_STAGE/test-data to /data/v1
Executing SQL: GET @DATA_STAGE/test-data file:///data/v1/
[('batches.meta', 158, 'DOWNLOADED', ''), ('batches.meta.gz', 172, 'DOWNLOADED', ''), ('data_batch_1', 31035704, 'DOWNLOADED', ''), ('data_batch_1.gz', 28299022, 'DOWNLOADED', ''), ('data_batch_2', 31035320, 'DOWNLOADED', ''), ('data_batch_2.gz', 28336334, 'DOWNLOADED', ''), ('data_batch_3', 31035999, 'DOWNLOADED', ''), ('data_batch_3.gz', 28350244, 'DOWNLOADED', ''), ('data_batch_4', 31035696, 'DOWNLOADED', ''), ('data_batch_4.gz', 28339657, 'DOWNLOADED', ''), ('data_batch_5', 31035623, 'DOWNLOADED', ''), ('data_batch_5.gz', 28321928, 'DOWNLOADED', ''), ('readme.html', 88, 'DOWNLOADED', ''), ('readme.html.gz', 121, 'DOWNLOADED', ''), ('test_batch', 31035526, 'DOWNLOADED', ''), ('test_batch.gz', 28324909, 'DOWNLOADED', '')]
Successfully downloaded files to: /data/v1 in 5.2539403438568115 seconds


## Building and running dummy job



In [319]:

# Build an image
# Note: This is not actually a trainer but an example of how to run trainer-like containers that have access to GPUs and hostpaths
!docker build \
 --platform linux/amd64 \
 -t $TRAIN_IMAGE_REPO \
 -f ./trainer/Dockerfile ./trainer


[1A[1B[0G[?25l[+] Building 0.0s (0/0)  docker:desktop-linux
[?25h[1A[0G[?25l[+] Building 0.0s (0/0)  docker:desktop-linux
[?25h[1A[0G[?25l[+] Building 0.0s (0/1)                                    docker:desktop-linux
[?25h[1A[0G[?25l[+] Building 0.2s (1/2)                                    docker:desktop-linux
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 291B                                       0.0s
[0m => [internal] load metadata for nvcr.io/nvidia/pytorch:23.08-py3          0.2s
[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.3s (1/2)                                    docker:desktop-linux
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 291B                                       0.0s
[0m => [internal] load metadata for nvcr.io/nvidia/pytorch:23.08-py3          0.3s
[?25h[1A[1A[1A[1A[0G[?25l[+] Buildi

In [320]:

# Push image to the image repository
!docker push $TRAIN_IMAGE_REPO


The push refers to repository [sfengineering-xaccounttest2.registry.snowflakecomputing.com/aivanoudb/public/test_repo/temp/train]

[1B6a5c0c0a: Preparing 
[1Bc532dc3a: Preparing 
[1Bd40829a3: Preparing 
[1Bf06889a8: Preparing 
[1Bde957bf9: Preparing 
[1Bd72b366b: Preparing 
[1Bd60948c2: Preparing 
[1Bc62aa077: Preparing 
[1Bf8e11cd2: Preparing 
[1Bcc44fb7f: Preparing 
[1Be47f9c7b: Preparing 
[1B8d31a53c: Preparing 
[1B9aa07ba1: Preparing 
[1Bee369218: Preparing 
[1Bfcec3069: Preparing 
[1Bd969b81e: Preparing 
[1Bb8c94866: Preparing 
[1Bbf18a086: Preparing 
[1B5dd672e1: Preparing 
[1Bed85754c: Preparing 
[1Bc5d7c0ee: Preparing 
[1Bce70c059: Preparing 
[1Bce6da356: Preparing 
[1Bfc451872: Preparing 
[1B96e1b637: Preparing 
[1B7c3d5c98: Preparing 
[1Ba808a6ab: Preparing 
[1B5b39fd80: Preparing 
[1B0f687071: Preparing 
[23B62aa077: Waiting g 
[23B8e11cd2: Waiting g 
[18Bcec3069: Waiting g 
[15Bdd672e1: Waiting g 
[15Bd85754c: Waiting g 
[15B5d7c0ee: Waiti

In [321]:


# Populate pod with environment variables and save it in `./download_data/pod.yaml` file
!envsubst < ./trainer/pod.template.yaml > ./trainer/pod.yaml

# delete existing job if exists
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- delete -f ./trainer/pod.yaml

# create new job
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- apply -f ./trainer/pod.yaml



Error from server (NotFound): error when deleting "./trainer/pod.yaml": jobs.batch "trainer" not found
2025/03/05 15:33:11 exit status 1
job.batch/trainer created


In [327]:

!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- get pods -A

# Expected output
# NAMESPACE     NAME                       READY   STATUS      RESTARTS   AGE
# default       download-data-67kgt        0/1     Completed   0          40m
# default       trainer-f2npd              1/1     Running     0          10s


NAMESPACE     NAME                       READY   STATUS      RESTARTS   AGE
default       download-data-pod-5k68k    0/1     Completed   0          15m
default       trainer-sb798              1/1     Running     0          8m15s
kube-system   coredns-85c984886d-79zds   1/1     Running     0          37m


In [87]:

# Delete trainer command
# !spcsclusterctl kubectl --cluster=TEST10 -- delete -f ./trainer/pod.yaml


job.batch "trainer" deleted


In [328]:

# Get trainer based on kubectl labels
OUTPUT=!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- get pod -l workload-type=trainer -o jsonpath="{.items[0].metadata.name}"
TRAINER_ID=OUTPUT[0]
os.environ['TRAINER_ID'] = TRAINER_ID

print(f"trainer id: {TRAINER_ID}")


trainer id: trainer-sb798


In [329]:

# Run nvidia-smi command inside a trainer
# NOTE: WAIT UNTIL CONTAINER IS IN `RUNNING` state!!!!
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- exec -it $TRAINER_ID -- nvidia-smi



E0305 15:41:37.464117   17196 websocket.go:296] Unknown stream id 1, discarding message
Wed Mar  5 23:41:37 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   17C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+---------

In [330]:

!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- exec -it $TRAINER_ID -- ls /data/v1


batches.meta	 data_batch_2	  data_batch_4	   readme.html
batches.meta.gz  data_batch_2.gz  data_batch_4.gz  readme.html.gz
data_batch_1	 data_batch_3	  data_batch_5	   test_batch
data_batch_1.gz  data_batch_3.gz  data_batch_5.gz  test_batch.gz


In [331]:

# Get trainer logs
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- logs $TRAINER_ID


## Quick iteration between local and remote environments


In [335]:

# Local directory to copy, change accordingly!
local_dir = "/Users/aivanou/code/spcs-templates/stanford_beacon/"
os.environ['LOCAL_DIR_TO_COPY']=local_dir

!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- cp $LOCAL_DIR_TO_COPY $TRAINER_ID:/app/code

!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- exec -it $TRAINER_ID ls /app/code



kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
 data		 download_data.py   trainer		 '~'
 download_data	 stanford_beacon    vcluster-demo.ipynb


In [338]:

#run command remotely
!spcsclusterctl kubectl --cluster=$CLUSTER_NAME -- exec -it $TRAINER_ID python /app/main.py


kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
E0305 15:50:33.777110   21327 websocket.go:296] Unknown stream id 1, discarding message
Running dummy workload
iteration: 0
iteration: 1
^C
Traceback (most recent call last):
  File "/app/main.py", line 33, in <module>
    time.sleep(2)
KeyboardInterrupt
command terminated with exit code 130
2025/03/05 15:50:38 exit status 130
