# OR-SIM — End-to-End: Kaggle Backend + Local Frontend via ngrok

**Environment**: Kaggle GPU T4x2 or P100 (>=15 GB VRAM)

```
Your Machine                          Kaggle Notebook (GPU)
       
Browser  http://localhost:5173       |  uvicorn :8000  (FastAPI + MedGemma)
   Vite dev-server                    |  MedASR ONNX    (CPU/GPU)
          HTTPS / WSS               |        
   ngrok public URL +  ngrok tunnel
```

## Steps at a glance
| # | Cell | Action |
|---|------|--------|
| 1 | GPU check | Verify T4/P100 is present |
| 2 | Clone repo | Pull OR-SIM from GitHub |
| 3 | llama-cpp | CUDA 12.4 prebuilt wheel (fast, no compile) |
| 4 | Python deps | FastAPI, ONNX, pydantic, sounddevice, etc. |
| 5 | MedGemma | Download GGUF from HuggingFace (~2.1 GB) |
| 6 | MedASR | Verify ONNX model files are in the repo |
| 7 | ngrok | Install pyngrok + set auth token |
| 8 | Server | Launch FastAPI backend (subprocess) |
| 9 | Tunnel | Open ngrok — **copy the HTTPS URL** |
| 10 | Frontend | Instructions for your local machine (markdown) |
| 11 | Sanity | Confirm tunnel reaches /api/health and /api/state |
| 12 | Keep-alive | Ping loop — prevents Kaggle idle timeout |

## Prerequisites
| Requirement | Where |
|---|---|
| Kaggle GPU notebook (T4 x2 or P100) | Notebook Settings  Accelerator |
| Internet access ON | Notebook Settings  Internet |
| ngrok auth token | https://dashboard.ngrok.com/get-started/your-authtoken |
| OR-SIM pushed to GitHub | Already done — all phases committed |


---
## Cell 1 — Verify GPU


In [None]:
import subprocess, sys

result = subprocess.run(
    ['nvidia-smi', '--query-gpu=name,memory.total,driver_version', '--format=csv,noheader'],
    capture_output=True, text=True,
)
if result.returncode == 0:
    print('GPU:', result.stdout.strip())
else:
    print('No GPU detected — go to Notebook Settings > Accelerator and enable GPU.')
    sys.exit('GPU required')


---
## Cell 2 — Clone OR-SIM repository

> **`REPO_URL` is pre-filled with your GitHub repo.  No changes needed.**

In [None]:
import os

REPO_URL = 'https://github.com/Aditya-Lingam-9000/OR-SIM.git'
REPO_DIR = '/kaggle/working/OR-SIM'

if not os.path.isdir(os.path.join(REPO_DIR, '.git')):
    subprocess.run(['git', 'clone', '--depth', '1', REPO_URL, REPO_DIR], check=True)
    print(f'Cloned -> {REPO_DIR}')
else:
    subprocess.run(['git', '-C', REPO_DIR, 'pull', '--ff-only'], check=True)
    print(f'Updated existing clone at {REPO_DIR}')

# Add repo root to sys.path so backend.* packages are importable in all later cells
if REPO_DIR not in sys.path:
    sys.path.insert(0, REPO_DIR)

# Show the 5 most recent commits as a sanity check
subprocess.run(['git', '-C', REPO_DIR, 'log', '--oneline', '-5'], check=True)


---
## Cell 3 — Install llama-cpp-python (CUDA 12.4 prebuilt wheel — no compilation)

Kaggle T4 instances ship CUDA 12.4.  
We use the prebuilt wheel from `abetlen.github.io` — identical to the Phase 3 approach.
Takes ~30 s vs ~5 minutes if built from source.

In [None]:
subprocess.run(
    [
        sys.executable, '-m', 'pip', 'install',
        'llama-cpp-python',
        '--extra-index-url', 'https://abetlen.github.io/llama-cpp-python/whl/cu124',
        '--quiet',
    ],
    check=True,
)

# Verify GPU offload was compiled in — MUST be True for decent latency
import llama_cpp
lib    = getattr(llama_cpp, 'llama_cpp', None)
gpu_ok = lib.llama_supports_gpu_offload() if lib else False
print(f'llama_supports_gpu_offload() = {gpu_ok}')
if not gpu_ok:
    raise RuntimeError(
        'GPU offload not compiled in.  '
        'Check nvidia-smi shows CUDA 12.x — the cu124 wheel requires CUDA >= 12.4.'
    )

from llama_cpp import Llama
print('llama-cpp-python imported successfully')


---
## Cell 4 — Install OR-SIM Python dependencies

There is no `requirements.txt` in the repo — we install each package explicitly.  
`llama-cpp-python` was installed in Cell 3 from the CUDA wheel and is **intentionally omitted**
here to prevent pip replacing it with the CPU-only build from PyPI.

In [None]:
# PortAudio C library — required by sounddevice on headless Linux.
# sounddevice is imported at module level inside backend/asr, so it must
# be importable even though there is no real microphone on Kaggle.
subprocess.run(
    ['apt-get', 'install', '-y', '-q', 'libportaudio2'],
    check=False,  # non-fatal — already present on some Kaggle images
)

PACKAGES = [
    # Audio / ASR
    'numpy',
    'scipy',
    'sounddevice',        # LiveTranscriber import (headless OK — device open only on mic use)
    'onnxruntime',        # MedASR CTC model (CPU inference)
    # Backend server
    'loguru',             # structured logging
    'pydantic>=2.0',      # Pydantic v2 data models
    'fastapi',            # REST + WebSocket framework
    'uvicorn[standard]',  # ASGI server (includes httptools + websocket extras)
    'websockets',         # WebSocket transport layer
    'httpx',              # async HTTP client
    # Model download (used in Cell 5)
    'huggingface_hub',
]

subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet'] + PACKAGES, check=True)
print('All OR-SIM Python dependencies installed')


---
## Cell 5 — Download MedGemma GGUF from HuggingFace

Model: `unsloth/medgemma-4b-it-GGUF`  `medgemma-4b-it-Q3_K_M.gguf` (~2.1 GB)  
Downloaded directly into the cloned repo at the path the backend expects.  
**No dataset upload required** — identical to the Phase 3 approach.

In [None]:
from huggingface_hub import hf_hub_download
from pathlib import Path

GGUF_DIR = Path(REPO_DIR) / 'models' / 'medgemma'
GGUF_DIR.mkdir(parents=True, exist_ok=True)

EXPECTED_GGUF = GGUF_DIR / 'medgemma-4b-it-Q3_K_M.gguf'

# Skip download if already present (e.g. re-running the notebook)
if EXPECTED_GGUF.exists():
    MODEL_PATH = EXPECTED_GGUF
    print(f'GGUF already present ({MODEL_PATH.stat().st_size / 1e9:.2f} GB) — skipping download.')
else:
    print('Downloading medgemma-4b-it-Q3_K_M.gguf from HuggingFace...')
    print('(~2.1 GB — takes 2-5 minutes on Kaggle)')
    MODEL_PATH = Path(hf_hub_download(
        repo_id   = 'unsloth/medgemma-4b-it-GGUF',
        filename  = 'medgemma-4b-it-Q3_K_M.gguf',
        local_dir = str(GGUF_DIR),
    ))

print(f'Model path : {MODEL_PATH}')
print(f'File size  : {MODEL_PATH.stat().st_size / 1e9:.2f} GB')


---
## Cell 6 — Verify MedASR ONNX model files

`models/medasr/model.int8.onnx` and `tokens.txt` are committed to the repo.  
This cell confirms they were cloned successfully.

In [None]:
asr_dir = Path(REPO_DIR) / 'models' / 'medasr'
all_ok  = True

for fname in ('model.int8.onnx', 'tokens.txt'):
    fpath = asr_dir / fname
    if fpath.exists():
        print(f'  {fname:<30}  ({fpath.stat().st_size / 1e6:.1f} MB)')
    else:
        print(f'MISSING: {fpath}')
        all_ok = False

if not all_ok:
    raise FileNotFoundError(
        'MedASR model files missing from cloned repo.\n'
        'Run locally: git add models/medasr/ && git push'
    )
print('MedASR model files verified')


---
## Cell 7 — Install pyngrok and configure your auth token

1. Sign up (free) at https://ngrok.com  
2. Copy your token from https://dashboard.ngrok.com/get-started/your-authtoken  
3. Either paste it directly OR store it as a Kaggle Secret (recommended):
   - *Notebook Settings  Add-ons  Secrets  Add* — key: `NGROK_TOKEN`
   - Uncomment the `UserSecretsClient` lines and remove the placeholder string

In [None]:
subprocess.run(
    [sys.executable, '-m', 'pip', 'install', 'pyngrok', '--quiet'],
    check=True,
)
from pyngrok import ngrok
print('pyngrok installed')

#  CONFIGURE 
# Option A — paste token directly (do NOT commit a notebook with a real token)
NGROK_AUTH_TOKEN = 'YOUR_NGROK_AUTH_TOKEN'

# Option B — Kaggle Secret (recommended for shared notebooks)
# from kaggle_secrets import UserSecretsClient
# NGROK_AUTH_TOKEN = UserSecretsClient().get_secret('NGROK_TOKEN')
# 

if NGROK_AUTH_TOKEN == 'YOUR_NGROK_AUTH_TOKEN':
    raise ValueError(
        'Paste your real ngrok token above or use a Kaggle Secret.'
        '  Get it from https://dashboard.ngrok.com/get-started/your-authtoken'
    )

ngrok.set_auth_token(NGROK_AUTH_TOKEN)
print('ngrok auth token configured')


---
## Cell 8 — Start the FastAPI backend server

The server is launched as a **subprocess** (`python -m backend.server`).
Using a subprocess (not a thread) avoids asyncio event-loop conflicts between uvicorn
and Jupyter's own event loop — the most common cause of silent startup failures on Kaggle.

MedGemma loads into GPU VRAM during startup — allow up to **60 seconds**.

In [None]:
import socket, time, urllib.request, json as _json

# Stop any server left from a previous run of this cell
try:
    if _server_proc.poll() is None:
        _server_proc.terminate()
        _server_proc.wait(timeout=5)
        print('Stopped previous server instance.')
except NameError:
    pass  # first run

env = os.environ.copy()
env['PYTHONPATH'] = REPO_DIR

# python -m backend.server  is the canonical entry-point (same as local dev)
_server_proc = subprocess.Popen(
    [sys.executable, '-m', 'backend.server', '--host', '0.0.0.0', '--port', '8000'],
    cwd=REPO_DIR,
    env=env,
    # stdout/stderr flow to the Kaggle cell output so startup logs are visible
)
print(f'Server PID: {_server_proc.pid}')
print('Waiting for FastAPI to bind port 8000', end='', flush=True)

deadline   = time.time() + 60
connected  = False
while time.time() < deadline:
    if _server_proc.poll() is not None:
        raise RuntimeError(
            f'Server process exited early (exit code {_server_proc.returncode}).\n'
            'Scroll up for the traceback.'
        )
    try:
        with socket.create_connection(('127.0.0.1', 8000), timeout=1):
            connected = True
            break
    except OSError:
        print('.', end='', flush=True)
        time.sleep(1)

print()
if not connected:
    raise TimeoutError('Server did not bind within 60 s — check the logs above.')

# Quick health-check via plain urllib (no extra libraries needed)
try:
    with urllib.request.urlopen('http://127.0.0.1:8000/api/health', timeout=5) as r:
        health = _json.loads(r.read())
    print(f'Server healthy: {health}')
except Exception as e:
    print(f'Health check failed: {e}  — server may still be loading MedGemma, re-run this cell.')


---
## Cell 9 — Open ngrok tunnel and print the public URLs

**Copy the HTTPS URL printed below — you will paste it into `frontend/.env.local`.**

In [None]:
# Close any stale tunnels from previous runs
for _t in ngrok.get_tunnels():
    ngrok.disconnect(_t.public_url)

tunnel      = ngrok.connect(8000, bind_tls=True)
NGROK_HTTPS = tunnel.public_url
NGROK_WSS   = NGROK_HTTPS.replace('https://', 'wss://')

print()
print('=' * 65)
print('  OR-SIM BACKEND IS LIVE')
print('=' * 65)
print(f'  HTTPS (REST API) : {NGROK_HTTPS}')
print(f'  WSS  (WebSocket) : {NGROK_WSS}/ws/state')
print('=' * 65)
print()
print('Copy the HTTPS URL, then follow Cell 10 on your local machine.')


---
## Cell 10 — Connect the local frontend  *(run on YOUR machine, not Kaggle)*

### Step 1 — Create `frontend/.env.local`

```
# d:\OR-SIM\frontend\.env.local
VITE_BACKEND_URL=https://abcd-1234.ngrok-free.app
```

Replace `https://abcd-1234.ngrok-free.app` with the **HTTPS URL from Cell 9**. No trailing slash.

PowerShell one-liner (edit the URL first):
```powershell
'VITE_BACKEND_URL=https://abcd-1234.ngrok-free.app' | Out-File -Encoding utf8 d:\OR-SIM\frontend\.env.local
```

### Step 2 — Start the Vite dev server

```powershell
cd d:\OR-SIM\frontend
npm run dev
```

Expected output:
```
  VITE v6.x.x  ready in ~641 ms
  -> Local:   http://localhost:5173/
```

### Step 3 — Open the simulator

1. Open **http://localhost:5173** in Chrome or Edge.
2. Pick a surgery (Heart Transplant / Liver Resection / Kidney PCNL).
3. Click **Start Session**.
4. Allow microphone access when the browser asks.
5. Speak surgical commands — machines glow ON/OFF in the 3D room in real-time!

### Stopping

- Click **Stop Session** in the browser.
- Stop the Kaggle kernel (square button) to free GPU VRAM.
- The ngrok tunnel closes automatically with the kernel.

### Troubleshooting

| Symptom | Fix |
|---|---|
| WS status dot stays red | Check `.env.local` — no trailing slash, correct URL |
| `ERR_NGROK_3200` | Free tier allows 1 tunnel; close other open ngrok sessions |
| `ModuleNotFoundError: backend` | Re-run Cell 2 (sys.path), then Cell 8 |
| `sounddevice` ImportError | Re-run Cell 4 (installs libportaudio2) |
| Server exits immediately | Scroll up in Cell 8 for the Python traceback |
| GGUF not found | Check Cell 5 printed a valid file size |
| Kaggle kernel idle timeout | Re-run Cells 8 and 9 to restart server and tunnel |


---
## Cell 11 — Sanity check: confirm tunnel reaches the API


In [None]:
# ngrok adds a browser-warning page for unauthenticated requests.
# The header below skips it so we get the raw JSON response.
SKIP_WARN = {'ngrok-skip-browser-warning': '1'}

for path, label in [('/api/health', 'Health'), ('/api/state', 'State')]:
    url = f'{NGROK_HTTPS}{path}'
    try:
        req = urllib.request.Request(url, headers=SKIP_WARN)
        with urllib.request.urlopen(req, timeout=10) as r:
            data = _json.loads(r.read())
        preview = _json.dumps(data, indent=2)[:400]
        print(f'{label}:\n{preview}\n')
    except Exception as e:
        print(f'{label} check failed: {e}')


---
## Cell 12 — Keep-alive ping *(optional — run while using the simulator)*

Kaggle kills idle kernels after ~30 minutes.  Interrupt the cell (square button) to stop.

In [None]:
import datetime

PING_INTERVAL = 60     # seconds between pings
MAX_PINGS     = 720    # 720 x 60 s = 12 hours (full Kaggle session limit)

print(f'Keep-alive running — ping every {PING_INTERVAL}s.  Stop with the interrupt button.')
for _i in range(1, MAX_PINGS + 1):
    time.sleep(PING_INTERVAL)

    if _server_proc.poll() is not None:
        print(f'Server process died (code {_server_proc.returncode}).  Re-run Cell 8 + 9.')
        break

    try:
        _req = urllib.request.Request(f'{NGROK_HTTPS}/api/health', headers=SKIP_WARN)
        with urllib.request.urlopen(_req, timeout=5) as _r:
            _status = _r.status
    except Exception:
        _status = 'error'

    _ts = datetime.datetime.now(datetime.timezone.utc).strftime('%H:%M:%S UTC')
    print(f'  [{_ts}] ping #{_i:03d} -> HTTP {_status}')
