# Voice Symptom Intake & Documentation Assistant - Google Colab Deployment

This notebook deploys the Voice Symptom Intake & Documentation Assistant on Google Colab with GPU support.

## ‚úÖ Advantages of Colab Deployment:
- **Free GPU Access** (Tesla T4 with 16GB VRAM)
- **No Local Setup Issues** (no Windows file locking, FFmpeg, etc.)
- **Faster Inference** (MedASR & MedGemma both run on GPU)
- **Public URL Access** via ngrok

## ‚ö†Ô∏è Important Notes:
- Sessions last up to 12 hours (free tier)
- You'll need a **Hugging Face token** with "Read" access
- Accept model terms for `google/medasr` and `google/medgemma-1.5-4b-it`

## üÜï Latest Updates:
- Enhanced JSON parsing for reliable documentation generation
- Robust error recovery with intelligent fallback strategies
- Optimized generation parameters for medical documentation

## Step 1: Check GPU Availability

In [None]:
!nvidia-smi

## Step 2: Install Dependencies

Install all required packages (this takes ~3-5 minutes)

In [None]:
%%capture
# Install transformers from specific commit for MedASR support
!pip install git+https://github.com/huggingface/transformers.git@65dc261512cbdb1ee72b88ae5b222f2605aad8e5

# Install other dependencies
!pip install fastapi uvicorn[standard] python-multipart
!pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install accelerate librosa soundfile noisereduce audioread
!pip install pydantic pydantic-settings python-dotenv
!pip install pyngrok nest-asyncio

print("‚úÖ All dependencies installed successfully!")

## Step 3: Upload Your Code

**Option A:** Upload project folder from your computer
- Click the folder icon on the left sidebar
- Upload the entire `voice-symptom-triage-assistant` folder

**Option B:** Clone from GitHub (if you've pushed your code)

In [None]:
# Option B: Clone from GitHub (uncomment and modify if using)
# !git clone https://github.com/YOUR_USERNAME/voice-symptom-triage-assistant.git
# %cd voice-symptom-triage-assistant

# Option A: If uploaded manually, navigate to the folder
%cd /content/voice-symptom-triage-assistant

## Step 4: Configure Hugging Face Token & Environment

**Get your token from:** https://huggingface.co/settings/tokens

**Make sure you've accepted terms for:**
- https://huggingface.co/google/medasr
- https://huggingface.co/google/medgemma-1.5-4b-it

In [None]:
import os

# REPLACE WITH YOUR HUGGING FACE TOKEN
HF_TOKEN = "hf_YOUR_TOKEN_HERE"

# Create .env file with enhanced MedGemma parameters
with open('.env', 'w') as f:
    f.write(f"HF_TOKEN={HF_TOKEN}\n")
    f.write("MEDASR_MODEL=google/medasr\n")
    f.write("MEDGEMMA_MODEL=google/medgemma-1.5-4b-it\n")
    f.write("DEVICE=cuda\n")
    f.write("ENABLE_GPU=true\n")
    
    # MedGemma Generation Parameters (optimized for JSON output)
    f.write("MEDGEMMA_TEMPERATURE=0.1\n")
    f.write("MEDGEMMA_MAX_TOKENS=1024\n")
    f.write("MEDGEMMA_REPETITION_PENALTY=1.1\n")
    
    # Audio settings
    f.write("AUDIO_SAMPLE_RATE=16000\n")
    f.write("MAX_AUDIO_DURATION_SECONDS=300\n")
    
    # Model cache directory
    f.write("MODEL_CACHE_DIR=./models\n")
    f.write("LOG_LEVEL=INFO\n")

print("‚úÖ Environment configured with enhanced MedGemma parameters!")
print("   - Temperature: 0.1 (deterministic JSON output)")
print("   - Max Tokens: 1024 (complete documentation)")
print("   - Repetition Penalty: 1.1 (prevent loops)")

## Step 5: Enable GPU for MedGemma (Colab T4 has 16GB VRAM!)

Update MedGemma service to use GPU acceleration on Colab.

**Note:** Your local code is optimized for CPU due to limited VRAM, but Colab can handle GPU inference.

In [None]:
# Update medgemma_service.py to enable GPU on Colab
import re

# Read the current service file
with open('app/models/medgemma_service.py', 'r') as f:
    content = f.read()

# Replace the CPU-forced loading with GPU-capable loading
gpu_loading_code = '''        try:
            logger.info(f"Loading MedGemma model on device: {self.device}")
            
            # Colab T4 GPU has 16GB VRAM - can fit MedGemma with float16!
            dtype = torch.float16 if torch.cuda.is_available() else torch.float32
            
            self.tokenizer = AutoTokenizer.from_pretrained(
                settings.medgemma_model,
                token=settings.hf_token if settings.hf_token else None
            )
            
            self.model = AutoModelForCausalLM.from_pretrained(
                settings.medgemma_model,
                torch_dtype=dtype,
                device_map="auto" if settings.enable_gpu else None,
                token=settings.hf_token if settings.hf_token else None,
                low_cpu_mem_usage=True
            )
            
            if not settings.enable_gpu:
                self.model = self.model.to("cpu")
                self.device = "cpu"
            
            self.model.eval()
            
            logger.info(f"MedGemma model loaded successfully on {self.device}")
'''

# Find and replace the _load_model method's model loading section
# This replaces from "try:" up to the model.eval() call
pattern = r'(\s+def _load_model\(self\):[^\n]+\n[^\n]+\n\s+try:).*?(\s+logger\.info\("MedGemma model loaded successfully"\))'
replacement = r'\1' + gpu_loading_code + r'\2'

content_updated = re.sub(pattern, replacement, content, flags=re.DOTALL)

# Write back
with open('app/models/medgemma_service.py', 'w') as f:
    f.write(content_updated)

print("‚úÖ MedGemma service updated for GPU acceleration!")
print("   Model will use float16 precision on T4 GPU")
print("   Expected speedup: 5-10x faster than CPU")

## Step 6: Set Up ngrok Tunnel

**Get your ngrok authtoken from:** https://dashboard.ngrok.com/get-started/your-authtoken

In [None]:
from pyngrok import ngrok
import nest_asyncio

# REPLACE WITH YOUR NGROK AUTHTOKEN
NGROK_AUTH_TOKEN = "YOUR_NGROK_TOKEN_HERE"

# Set up ngrok
ngrok.set_auth_token(NGROK_AUTH_TOKEN)
nest_asyncio.apply()

print("‚úÖ ngrok configured!")

## Step 7: Start the Server

This will:
1. Load MedASR model on GPU (~30 seconds)
2. Load MedGemma model on GPU (~2-3 minutes first time)
3. Start the FastAPI server
4. Create a public ngrok URL

**Note:** With the enhanced JSON parsing, documentation generation is now much more reliable!

In [None]:
import subprocess
import threading
import time

# Start ngrok tunnel
public_url = ngrok.connect(8000)
print(f"\n{'='*60}")
print(f"üåê PUBLIC URL: {public_url}")
print(f"{'='*60}\n")
print("Open this URL in your browser to access the application!\n")
print("üÜï Features in this deployment:")
print("   ‚úì Enhanced JSON parsing for reliable documentation")
print("   ‚úì Intelligent fallback if JSON parsing fails")
print("   ‚úì Optimized generation parameters")
print("   ‚úì GPU acceleration for faster results\n")

# Start uvicorn server
!python -m uvicorn app.main:app --host 0.0.0.0 --port 8000

## Step 8: Access Your Application

1. **Copy the public ngrok URL** from the output above
2. **Open it in your browser**
3. **Start recording** or **upload audio files**
4. **View transcription and documentation results**

## ‚úÖ Features:
- Audio recording directly in browser
- File upload (WAV, MP3, M4A, FLAC, OGG)
- Real-time transcription with MedASR
- Structured documentation with MedGemma
- **Enhanced JSON parsing** - no more "N/A" fields!
- **Robust error recovery** - always generates documentation
- Export results as JSON
- Copy to clipboard

## üõë To Stop:
- Click the **Stop** button in Colab
- Or press **Ctrl+C** in the cell output

## üìù Notes:
- Free Colab sessions last **up to 12 hours**
- The ngrok URL **changes each time** you restart
- Models are **cached** after first download (faster restarts)
- GPU inference is **10x faster** than CPU
- Documentation fields should now populate correctly (check logs for parsing_method)

## Troubleshooting

### If models fail to load:
1. Check your HF_TOKEN is valid
2. Verify you accepted model terms on Hugging Face
3. Try restarting the runtime: Runtime ‚Üí Restart runtime

### If ngrok fails:
1. Verify your ngrok authtoken
2. Free ngrok accounts have limits (1 tunnel at a time)
3. Try getting a new authtoken from ngrok dashboard

### If audio fails:
1. Check browser microphone permissions
2. Hard refresh browser (Ctrl+Shift+R)
3. Ensure audio files are under 5 minutes (default limit)

### If documentation shows "N/A" fields:
1. Check the server logs for "parsing_method" messages
2. Look for "json_successful" or "text_extraction_fallback"
3. If seeing frequent fallback, the model may need more warmup time
4. Check logs for any JSON extraction warnings

### Debug Mode:
If you want to see detailed MedGemma output for debugging:
```python
# In a new cell
import logging
logging.getLogger('app.models.medgemma_service').setLevel(logging.DEBUG)
```
Then restart the server to see detailed parsing logs.