<a href="https://colab.research.google.com/github/SattamAltwaim/SaSOKE/blob/main/notebooks/2_train_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SOKE Stage 1: Train Decoupled Tokenizer (DETO)
Trains VQ-VAE models to discretize continuous sign language poses into tokens.


In [None]:
# Clone repo if not already present
import os
if not os.path.exists('/content/SaSOKE'):
    !git clone https://github.com/SattamAltwaim/SaSOKE.git
%cd /content/SaSOKE

# Mount Drive for data
from google.colab import drive
drive.mount('/content/drive')

drive_data = '/content/drive/MyDrive/GraduationProject/CodeFiles/SaSOKE'
print("Code repo:", os.getcwd())
print("Data location:", drive_data)


## Configuration Setup


In [None]:
# Update config paths for Colab/CUDA environment
import yaml

with open('configs/deto.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Update for CUDA
config['ACCELERATOR'] = 'gpu'
config['DEVICE'] = [0]

# Point to Drive for data/models
config['DATASET']['H2S']['ROOT'] = f'{drive_data}/data/How2Sign'
config['DATASET']['H2S']['MEAN_PATH'] = f'{drive_data}/smpl-x/mean.pt'
config['DATASET']['H2S']['STD_PATH'] = f'{drive_data}/smpl-x/std.pt'

# Reduce workers for Colab
config['TRAIN']['NUM_WORKERS'] = 2

# Save updated config
with open('configs/deto_colab.yaml', 'w') as f:
    yaml.dump(config, f)

print("Config updated - code from GitHub, data from Drive")


## Train Tokenizer


In [None]:
# Start training
!python -m train --cfg configs/deto_colab.yaml --nodebug


## Monitor Training
Check tensorboard logs in `experiments/mgpt/DETO/` or use W&B if configured.


In [None]:
# Load tensorboard
%load_ext tensorboard
%tensorboard --logdir experiments/mgpt/DETO/


## Copy Checkpoint
After training, copy the checkpoint to the expected location for SOKE training.


In [None]:
# Copy checkpoint to Drive
!mkdir -p {drive_data}/checkpoints/vae
!cp experiments/mgpt/DETO/checkpoints/last.ckpt {drive_data}/checkpoints/vae/tokenizer.ckpt
print(f"Tokenizer saved to {drive_data}/checkpoints/vae/tokenizer.ckpt")
