# üöÄ Free GPU LLM Host (Ollama + Ngrok)
Run this notebook to host `llama3.1:8b` (Q4 quantized) on a Google T4 GPU and expose it via API to your Render backend.

**IMPORTANT**: Make sure to select **Runtime** ‚Üí **Change runtime type** ‚Üí **T4 GPU** before running!

In [None]:
# 0. Verify GPU is Available
!nvidia-smi
import torch
print(f"\n‚úÖ CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úÖ GPU Device: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è WARNING: No GPU detected! Go to Runtime ‚Üí Change runtime type ‚Üí T4 GPU")

In [None]:
# 1. Install Ollama & Ngrok
!curl -fsSL https://ollama.com/install.sh | sh
!pip install pyngrok

In [None]:
# 2. Configure Ngrok (REQUIRED)
import getpass
from pyngrok import ngrok, conf

print("Enter your Ngrok Authtoken (from dashboard.ngrok.com):")
token = getpass.getpass()
conf.get_default().auth_token = token

In [None]:
# 3. Start Ollama Server with GPU Support
import os
import threading
import time
import subprocess

def start_ollama():
    # Configure Ollama environment
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    # Enable GPU by default (Ollama auto-detects CUDA)
    os.environ['OLLAMA_GPU_LAYERS'] = '999'  # Use all GPU layers
    subprocess.run(["ollama", "serve"])

threading.Thread(target=start_ollama, daemon=True).start()
print("‚è≥ Starting Ollama server...")
time.sleep(8)

# Verify server is running
!curl -s http://localhost:11434 && echo "‚úÖ Ollama server is UP" || echo "‚ùå Server failed to start"

# Pull Model (should automatically use GPU)
print("\n‚¨áÔ∏è Pulling llama3.1:8b model (Q4 quantized, optimized for T4)...")
!ollama pull llama3.1:8b

# Test inference to confirm GPU usage
print("\nüß™ Testing GPU inference...")
!ollama run llama3.1:8b "Say hi in 3 words" --verbose

In [None]:
# 4. Expose Public URL
public_url = ngrok.connect(11434, bind_tls=True).public_url
print("‚úÖ Universal LLM API Ready!")
print(f"üîë COPY THIS URL: {public_url}")
print("\nüìã Add this to your Render Environment Variables:")
print(f"   OLLAMA_BASE_URL={public_url}")