## 3. Run Model Training on Ray Cluster

Running our `legal_bert_triplet_finetune_a100.py` training script as a job on the Ray cluster. 


### Step 3.1: Navigate to Docker Training Directory

The following command changes the current directory to `~/ml-ops-project/code/model-training/docker_training/`. This directory contains `docker-compose-ray-cuda.yaml` file needed to start the Ray cluster and potentially other Docker-related files.

In [None]:
# Run this in CHI@UC GPU node terminal
cd ~/ml-ops-project/code/model-training/docker_training/

### Step 3.2: Ensure Clean Docker Environment

Before starting the Ray cluster, it's good practice to stop any potentially conflicting Docker services. 
1. If there is a local MLflow stack running(from `docker-compose-mlflow.yaml`) on this node for testing, stop it now, as well' be using the centralized KVM@TACC MLflow.
2. Stop any other standalone Docker containers (e.g., an `legalai_env` container) that might use the GPU or conflict with ports.

In [2]:
# Run this in CHI@UC GPU node terminal, if applicable

# If local MLflow stack (from a different compose file) was running:
# sudo docker compose -f docker-compose-mlflow.yaml down

# If there's any specific standalone container like 'legalai_env':
# sudo docker stop legalai_env
# sudo docker rm legalai_env

### Step 3.3: Start the Ray Cluster

First, ensure any previous instances of this Ray cluster are completely removed, including their volumes, to prevent state conflicts. Then, start the Ray cluster services (Ray head, Ray worker, internal MinIO, Grafana) in detached mode (`-d`) using your `docker-compose-ray-cuda.yaml` file.

In [None]:
# Run this in CHI@UC GPU node terminal (ensure the directory ~/ml-ops-project/code/model-training/docker_training/)

# Clean up previous Ray cluster attempt (removes containers, networks, AND volumes)
sudo docker compose -f docker-compose-ray-cuda.yaml down -v

# Start the Ray cluster
sudo docker compose -f docker-compose-ray-cuda.yaml up -d

### Step 3.4: Verify Ray Cluster Status

Wait about 60 seconds for all services and healthchecks to initialize. Then, check if the Ray cluster components are running correctly.

In [None]:
# Run this in your CHI@UC GPU node terminal
sudo docker ps

There should be `ray_head_legalai` (ideally with `(healthy)` in its status if the healthcheck is passing), `ray_worker_1_legalai`, `ray_minio_for_internal_use`, and `ray_grafana_legalai` listed as `Up`.

To check containers that might have exited (e.g., `ray_minio_internal_create_bucket` should exit with code 0 after success):
`sudo docker ps -a`

If `ray_head_legalai` or `ray_worker_1_legalai` are not `Up` or are restarting, check their logs:
`sudo docker compose -f docker-compose-ray-cuda.yaml logs ray-head`
`sudo docker compose -f docker-compose-ray-cuda.yaml logs ray-worker-1`

try accessing the Ray Dashboard in your web browser: (e.g., `http://192.5.87.28:8265`).

### Step 3.5: Set Environment Variables for Ray Job Submission

These environment variables need to be set in the terminal session on CHI@UC GPU node from which we submit the Ray job.

In [None]:
# Run these commands in CHI@UC GPU node terminal
# Ensure you are in a directory where app-cred-legalai-model-access-openrc.sh is accessible or provide its full path.
# cd ~/ml-ops-project # Or similar

echo "Setting up KVM@TACC MLflow and MinIO environment variables..."
# Replace <KVM_MLFLOW_PORT>, <KVM_MINIO_API_PORT>, and KVM MinIO credentials with actual values.
export MLFLOW_TRACKING_URI="http://129.114.27.166:5000" 
export MLFLOW_S3_ENDPOINT_URL="http://129.114.27.166:9000"
export AWS_ACCESS_KEY_ID="YOUR_KVM_MINIO_ACCESS_KEY"
export AWS_SECRET_ACCESS_KEY="YOUR_KVM_MINIO_SECRET_KEY"

echo "Sourcing CHI@UC Swift credentials..."
source ~/ml-ops-project/app-cred-legalai-model-access-openrc.sh

echo "Environment variables set. Verify OS_AUTH_URL is populated: $OS_AUTH_URL"

### Step 3.6: Submit the Training Job to Ray Cluster

Now, submit the Python training script (`legal_bert_triplet_finetune_a100.py`) to the running Ray cluster. 

In [None]:
ray job submit --address http://127.0.0.1:8265 \
 --working-dir /home/jovyan/work/code/model-training/training_script/ \
 --runtime-env-json '{
    "pip": "requirements.txt",
    "env_vars": {
        "MLFLOW_TRACKING_URI": "'"${MLFLOW_TRACKING_URI}"'",
        "MLFLOW_S3_ENDPOINT_URL": "'"${MLFLOW_S3_ENDPOINT_URL}"'",
        "AWS_ACCESS_KEY_ID": "'"${AWS_ACCESS_KEY_ID}"'",
        "AWS_SECRET_ACCESS_KEY": "'"${AWS_SECRET_ACCESS_KEY}"'",
        "OS_AUTH_URL": "'"${OS_AUTH_URL}"'",
        "OS_IDENTITY_API_VERSION": "'"${OS_IDENTITY_API_VERSION}"'",
        "OS_PROJECT_ID": "'"${OS_PROJECT_ID}"'",
        "OS_PROJECT_NAME": "'"${OS_PROJECT_NAME}"'",
        "OS_USER_DOMAIN_NAME": "'"${OS_USER_DOMAIN_NAME}"'",
        "OS_PROJECT_DOMAIN_ID": "'"${OS_PROJECT_DOMAIN_ID}"'",
        "OS_APPLICATION_CREDENTIAL_ID": "'"${OS_APPLICATION_CREDENTIAL_ID}"'",
        "OS_APPLICATION_CREDENTIAL_NAME": "'"${OS_APPLICATION_CREDENTIAL_NAME}"'",
        "OS_APPLICATION_CREDENTIAL_SECRET": "'"${OS_APPLICATION_CREDENTIAL_SECRET}"'",
        "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"
    }
 }' \
 -- python3 legal_bert_triplet_finetune_a100.py \
    --data_path "/home/jovyan/work/code/model-training/training_data/legal_data.jsonl" \
    --model_name_or_path "swift://object-store-persist-group36/model/Legal-BERT/" \
    --local_model_temp_dir "/home/jovyan/work/temp_swift_downloads_ray_job" \
    --output_dir "/home/jovyan/work/sbert_output_ray_job" \
    --num_epochs 1 \
    --batch_size 4 \
    --mlflow_experiment_name "LegalAI-RayJob-KVM-MLflow" \
    --mlflow_run_name "rayjob-kvm-$(date +%Y%m%d-%H%M%S)-bs4-final" \
    --dev_split_ratio 0.2 \
    --evaluation_steps 50 \
    --evaluate_base_model \
    --random_seed 42 \
    --upload_model_to_swift \
    --swift_container_name "object-store-persist-group36" \
    --swift_upload_prefix "models/my_finetuned_legal_bert_rayjob_kvm/run_$(date +%Y%m%d-%H%M%S)_bs4-final"

### Step 3.7: Monitor the Ray Job

1.  **Terminal Output:** The `ray job submit` command will stream logs from the job to your terminal. Watch for progress and any error messages.
2.  **Ray Dashboard:** Open `http://<CHI_UC_NODE_FLOATING_IP>:8265` (e.g., `http://192.5.87.28:8265`) in web browser. Navigate to the "Jobs" section to see the status, logs, and resource usage of your submitted job.
3.  **MLflow UI (KVM@TACC):** Open `http://129.114.27.166:<KVM_MLFLOW_PORT>` in browser. Look for the experiment `LegalAI-RayJob-KVM-MLflow` and the specific run name used. The parameters, metrics, and artifacts (like the model's Swift URI if upload is successful) being logged here should be visible.

### Step 3.8: Clean Up Ray Cluster

Once training job is complete and the results are verified, stop and remove the Ray cluster and its associated Docker containers and volumes.

In [None]:
# Run this in your CHI@UC GPU node terminal (in ~/ml-ops-project/code/model-training/docker_training/)
sudo docker compose -f docker-compose-ray-cuda.yaml down -v