# OpenTinker-Miles: Build, Deploy & Test Guide

This notebook walks through the complete workflow for building, deploying, and testing the OpenTinker-Miles training API.

## Contents
1. [Build Docker Image](#1-build-docker-image)
2. [Start Container](#2-start-container)
3. [Health Check & Service Status](#3-health-check--service-status)
4. [Model Creation Test](#4-model-creation-test)
5. [View Logs & Ray Status](#5-view-logs--ray-status)
6. [Cleanup](#6-cleanup)

## Configuration

In [None]:
import subprocess
import json
import time
import requests
from IPython.display import display, Markdown, HTML

# Configuration
IMAGE_NAME = "us-west1-docker.pkg.dev/devv-404803/gmi-test-repo/opentinker-miles:latest"
CONTAINER_NAME = "opentinker-miles-test"
DATA_MOUNT = "/mnt/slime-data-tinker"  # Host path with pre-downloaded models/datasets
SHM_SIZE = "16g"

# Port mappings (host:container)
API_PORT = 8001        # Training API
RAY_DASHBOARD_PORT = 8266  # Ray Dashboard
RAY_CLIENT_PORT = 10002    # Ray Client

# API Configuration
GMI_BASE_URL = f"http://localhost:{API_PORT}"
GMI_API_KEY = "slime-dev-key"
MODEL_PATH = "/data/models/Qwen2.5-0.5B-Instruct_torch_dist"

print(f"Image: {IMAGE_NAME}")
print(f"Container: {CONTAINER_NAME}")
print(f"API URL: {GMI_BASE_URL}")
print(f"Data Mount: {DATA_MOUNT}")

## Helper Functions

In [None]:
def run_cmd(cmd, check=True, capture=True):
    """Run a shell command and return output"""
    result = subprocess.run(cmd, shell=True, capture_output=capture, text=True)
    if capture:
        if result.returncode != 0 and check:
            print(f"Error: {result.stderr}")
        return result.stdout.strip(), result.stderr.strip(), result.returncode
    return None, None, result.returncode

def docker_exec(cmd):
    """Execute command inside the container"""
    return run_cmd(f"docker exec {CONTAINER_NAME} {cmd}")

def wait_for_api(timeout=60):
    """Wait for API to be ready"""
    start = time.time()
    while time.time() - start < timeout:
        try:
            resp = requests.get(f"{GMI_BASE_URL}/health", timeout=5)
            if resp.status_code == 200:
                return True
        except:
            pass
        time.sleep(2)
    return False

def poll_future(request_id, timeout=180):
    """Poll for async operation completion"""
    headers = {"X-API-Key": GMI_API_KEY, "Content-Type": "application/json"}
    start = time.time()
    
    while time.time() - start < timeout:
        resp = requests.post(
            f"{GMI_BASE_URL}/api/v1/retrieve_future",
            json={"request_id": request_id},
            headers=headers,
            timeout=30
        )
        
        if resp.status_code == 200:
            return resp.json()
        elif resp.status_code == 408:
            time.sleep(2)
            continue
        else:
            raise Exception(f"Error {resp.status_code}: {resp.text}")
    
    raise TimeoutError(f"Operation timed out after {timeout}s")

print("✓ Helper functions defined")

---
## 1. Build Docker Image

Build the OpenTinker-Miles Docker image using the provided build script.

In [None]:
# Check if we're in the right directory
import os
project_dir = os.path.dirname(os.path.dirname(os.path.abspath("__file__")))
print(f"Project directory: {project_dir}")

# List docker directory contents
stdout, stderr, rc = run_cmd("ls -la docker/")
print("\nDocker directory contents:")
print(stdout)

In [None]:
# Build the Docker image
print("Building Docker image...")
print("This may take a few minutes if layers are not cached.\n")

stdout, stderr, rc = run_cmd("./docker/build.sh", check=False)
print(stdout)
if stderr:
    print(f"\nStderr:\n{stderr}")

if rc == 0:
    print("\n✓ Docker image built successfully!")
else:
    print(f"\n✗ Build failed with exit code {rc}")

In [None]:
# Verify the image exists
stdout, _, _ = run_cmd(f"docker images | grep opentinker-miles | head -3")
print("Docker images:")
print(stdout)

---
## 2. Start Container

Start the container with:
- All GPUs (`--gpus all`)
- Shared memory for NCCL (`--shm-size=16g`)
- Data volume mount (`-v /mnt/slime-data-tinker:/data`)

In [None]:
# Stop and remove existing container if it exists
print("Cleaning up existing container...")
run_cmd(f"docker stop {CONTAINER_NAME} 2>/dev/null", check=False)
run_cmd(f"docker rm {CONTAINER_NAME} 2>/dev/null", check=False)
print("✓ Cleanup done")

In [None]:
# Start the container
docker_run_cmd = f"""
docker run -d \
  --name {CONTAINER_NAME} \
  --gpus all \
  --shm-size={SHM_SIZE} \
  -v {DATA_MOUNT}:/data \
  -p {API_PORT}:8000 \
  -p {RAY_DASHBOARD_PORT}:8265 \
  -p {RAY_CLIENT_PORT}:10001 \
  -e LOG_LEVEL=INFO \
  {IMAGE_NAME}
""".strip()

print("Starting container with command:")
print(docker_run_cmd)
print()

stdout, stderr, rc = run_cmd(docker_run_cmd)
if rc == 0:
    print(f"✓ Container started: {stdout[:12]}...")
else:
    print(f"✗ Failed to start container: {stderr}")

In [None]:
# Wait for API to be ready
print("Waiting for API to be ready...")
if wait_for_api(timeout=60):
    print("✓ API is ready!")
else:
    print("✗ API failed to start within timeout")
    print("\nChecking container logs:")
    stdout, _, _ = run_cmd(f"docker logs {CONTAINER_NAME} --tail 30")
    print(stdout)

---
## 3. Health Check & Service Status

Verify the service is running correctly.

In [None]:
# Health check
print("=" * 60)
print("HEALTH CHECK")
print("=" * 60)

try:
    resp = requests.get(f"{GMI_BASE_URL}/health", timeout=10)
    health = resp.json()
    
    print(f"Status:          {health.get('status')}")
    print(f"Version:         {health.get('version')}")
    print(f"Ray Initialized: {health.get('ray_initialized')}")
    print(f"Active Clients:  {health.get('active_training_clients')}")
    print(f"Model IDs:       {health.get('model_ids', [])}")
    print(f"Futures Count:   {health.get('futures_count')}")
    print("=" * 60)
    
    if health.get('status') == 'healthy':
        print("✓ Service is healthy!")
    else:
        print("⚠ Service may have issues")
        
except Exception as e:
    print(f"✗ Health check failed: {e}")

In [None]:
# Check container status
print("Container Status:")
stdout, _, _ = run_cmd(f"docker ps --filter name={CONTAINER_NAME} --format 'table {{{{.Names}}}}\t{{{{.Status}}}}\t{{{{.Ports}}}}'")
print(stdout)

---
## 4. Model Creation Test

Create a training model and verify it's ready.

In [None]:
# Create a model
print("=" * 60)
print("MODEL CREATION TEST")
print("=" * 60)
print(f"Base Model: {MODEL_PATH}")
print(f"LoRA: disabled (rank=0)")
print("=" * 60)

headers = {"X-API-Key": GMI_API_KEY, "Content-Type": "application/json"}

try:
    # Submit create model request
    print("\n[1/3] Submitting create model request...")
    resp = requests.post(
        f"{GMI_BASE_URL}/api/v1/create_model",
        json={
            "base_model": MODEL_PATH,
            "lora_config": {"rank": 0, "alpha": 0}
        },
        headers=headers,
        timeout=30
    )
    
    if resp.status_code != 200:
        print(f"✗ Failed: {resp.status_code} - {resp.text}")
    else:
        result = resp.json()
        request_id = result["request_id"]
        print(f"✓ Request submitted: {request_id}")
        
        # Poll for completion
        print("\n[2/3] Waiting for model creation (this may take 1-2 minutes)...")
        result = poll_future(request_id, timeout=180)
        
        model_id = result.get("model_id")
        print(f"✓ Model created: {model_id}")
        print(f"  Status: {result.get('status')}")
        
        # Verify via health check
        print("\n[3/3] Verifying model state...")
        health = requests.get(f"{GMI_BASE_URL}/health").json()
        print(f"✓ Active training clients: {health.get('active_training_clients')}")
        print(f"  Model IDs: {health.get('model_ids', [])}")
        
        print("\n" + "=" * 60)
        print("✓ MODEL CREATION TEST PASSED!")
        print("=" * 60)
        
except Exception as e:
    print(f"\n✗ Test failed: {e}")
    import traceback
    traceback.print_exc()

---
## 5. View Logs & Ray Status

Monitor service logs and Ray cluster status.

In [None]:
# View container logs (last 50 lines)
print("=" * 60)
print("CONTAINER LOGS (last 50 lines)")
print("=" * 60)

stdout, stderr, _ = run_cmd(f"docker logs {CONTAINER_NAME} --tail 50 2>&1")
print(stdout)

In [None]:
# View Ray cluster status
print("=" * 60)
print("RAY CLUSTER STATUS")
print("=" * 60)

stdout, _, _ = docker_exec("ray status")
print(stdout)

In [None]:
# List Ray actors
print("=" * 60)
print("RAY ACTORS")
print("=" * 60)

stdout, _, _ = docker_exec("ray list actors 2>/dev/null")
print(stdout)

In [None]:
# List only ALIVE actors
print("=" * 60)
print("ALIVE RAY ACTORS")
print("=" * 60)

stdout, _, _ = docker_exec("ray list actors --filter state=ALIVE 2>/dev/null")
print(stdout)

In [None]:
# Check GPU status
print("=" * 60)
print("GPU STATUS")
print("=" * 60)

stdout, _, _ = docker_exec("nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv")
print(stdout)

In [None]:
# Check Ray placement groups
print("=" * 60)
print("RAY PLACEMENT GROUPS")
print("=" * 60)

stdout, _, _ = docker_exec("ray list placement-groups 2>/dev/null")
print(stdout)

### Log Viewing Commands

Use these commands to view logs in different ways:

In [None]:
# Print useful log commands
print("Useful commands for viewing logs:\n")
print(f"# View all logs")
print(f"docker logs {CONTAINER_NAME}")
print()
print(f"# View last N lines")
print(f"docker logs {CONTAINER_NAME} --tail 100")
print()
print(f"# Follow logs in real-time")
print(f"docker logs {CONTAINER_NAME} -f")
print()
print(f"# View logs with timestamps")
print(f"docker logs {CONTAINER_NAME} -t")
print()
print(f"# View logs since last 10 minutes")
print(f"docker logs {CONTAINER_NAME} --since 10m")
print()
print(f"# Combined: last 50 lines with timestamps, follow")
print(f"docker logs {CONTAINER_NAME} --tail 50 -f -t")

---
## 6. Cleanup

Clean up models and optionally stop the container.

In [None]:
# Cleanup active models
print("=" * 60)
print("CLEANUP")
print("=" * 60)

headers = {"X-API-Key": GMI_API_KEY, "Content-Type": "application/json"}

# Get active models
try:
    health = requests.get(f"{GMI_BASE_URL}/health", timeout=10).json()
    model_ids = health.get("model_ids", [])
    
    if not model_ids:
        print("✓ No active models to cleanup")
    else:
        print(f"Found {len(model_ids)} active model(s): {model_ids}")
        
        for model_id in model_ids:
            print(f"\nDeleting model: {model_id}...")
            resp = requests.post(
                f"{GMI_BASE_URL}/api/v1/delete_model",
                json={"model_id": model_id},
                headers=headers,
                timeout=30
            )
            
            if resp.status_code == 200:
                result = resp.json()
                req_id = result.get("request_id")
                if req_id:
                    poll_future(req_id, timeout=60)
                print(f"✓ Deleted: {model_id}")
            else:
                print(f"⚠ Delete returned: {resp.status_code}")
        
        # Verify cleanup
        health = requests.get(f"{GMI_BASE_URL}/health", timeout=10).json()
        print(f"\nActive clients after cleanup: {health.get('active_training_clients')}")
        
except Exception as e:
    print(f"✗ Cleanup error: {e}")

print("\n" + "=" * 60)
print("✓ CLEANUP COMPLETE")
print("=" * 60)

In [None]:
# Optional: Stop the container
# Uncomment the following lines to stop and remove the container

# print("Stopping container...")
# run_cmd(f"docker stop {CONTAINER_NAME}")
# print("✓ Container stopped")

# print("Removing container...")
# run_cmd(f"docker rm {CONTAINER_NAME}")
# print("✓ Container removed")

In [None]:
# Restart container (if needed)
# Uncomment to restart the container

# print("Restarting container...")
# run_cmd(f"docker restart {CONTAINER_NAME}")
# print("Waiting for API...")
# if wait_for_api(timeout=60):
#     print("✓ Container restarted and API is ready!")
# else:
#     print("✗ API failed to come up after restart")

---
## Quick Reference

### Docker Commands
```bash
# Start container
docker run -d --name opentinker-miles-test --gpus all --shm-size=16g \
  -v /mnt/slime-data-tinker:/data \
  -p 8001:8000 -p 8266:8265 -p 10002:10001 \
  us-west1-docker.pkg.dev/devv-404803/gmi-test-repo/opentinker-miles:latest

# View logs
docker logs opentinker-miles-test --tail 100

# Restart container
docker restart opentinker-miles-test

# Stop and remove
docker stop opentinker-miles-test && docker rm opentinker-miles-test
```

### Ray Commands (inside container)
```bash
# Check Ray status
docker exec opentinker-miles-test ray status

# List all actors
docker exec opentinker-miles-test ray list actors

# List alive actors only
docker exec opentinker-miles-test ray list actors --filter state=ALIVE

# List placement groups
docker exec opentinker-miles-test ray list placement-groups
```

### API Endpoints
- Health: `GET http://localhost:8001/health`
- Create Model: `POST http://localhost:8001/api/v1/create_model`
- Delete Model: `POST http://localhost:8001/api/v1/delete_model`
- Save Weights: `POST http://localhost:8001/api/v1/save_weights`
- Retrieve Future: `POST http://localhost:8001/api/v1/retrieve_future`

### URLs
- Training API: http://localhost:8001
- Ray Dashboard: http://localhost:8266
- API Docs: http://localhost:8001/docs