# Italian Exercise Generator - Colab Inference API

This notebook creates a FastAPI inference service for the Italian Exercise Generator model.

**What it does:**
- Loads your fine-tuned `italian_exercise_generator_lora` model with vLLM (4.4x faster inference)
- Exposes a FastAPI endpoint for generating Italian exercises
- Creates a public tunnel via ngrok so your local API can access it

**Usage:**
1. Run all cells in order
2. Copy the ngrok URL from the output
3. Export it locally: `export INFERENCE_API_URL="https://your-url.ngrok.io"`
4. Start your local API: `./run_api.sh`
5. Your local API will now use Colab GPU for homework generation!

In [1]:
# Cell 1: Install dependencies
!pip install fastapi uvicorn pyngrok vllm nest-asyncio spacy -q
!python -m spacy download it_core_news_sm
print("✅ Dependencies installed")
print("✅ Italian NLP model installed")

Collecting it-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-3.8.0/it_core_news_sm-3.8.0-py3-none-any.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('it_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
✅ Dependencies installed
✅ Italian NLP model installed


In [2]:
# Cell 2: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
print("✅ Google Drive mounted")

Mounted at /content/drive
✅ Google Drive mounted


In [3]:
# Cell 3: Setup paths and verify model exists
import os
import sys

PROJECT_ROOT = "/content/drive/MyDrive/Colab Notebooks/italian_teacher"
LORA_PATH = os.path.join(PROJECT_ROOT, "models/italian_exercise_generator_v4")
BASE_MODEL = "swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA"

# Add project to Python path for imports
sys.path.insert(0, PROJECT_ROOT)

# Verify LoRA adapter exists
if not os.path.exists(LORA_PATH):
    print(f"❌ LoRA adapter not found at: {LORA_PATH}")
    print("Please update LORA_PATH to point to your italian_exercise_generator_lora model")
else:
    print(f"✅ LoRA adapter found at: {LORA_PATH}")
    print(f"✅ Base model: {BASE_MODEL}")
    print(f"✅ Project root: {PROJECT_ROOT}")

✅ LoRA adapter found at: /content/drive/MyDrive/Colab Notebooks/italian_teacher/models/italian_exercise_generator_v4
✅ Base model: swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
✅ Project root: /content/drive/MyDrive/Colab Notebooks/italian_teacher


In [4]:
# Cell 4: Merge LoRA adapter with base model (one-time, takes ~2-3 minutes)
import torch
import os

MERGED_MODEL_PATH = os.path.join(PROJECT_ROOT, "models/italian_exercise_generator_v4_merged")

# Check if already merged
if os.path.exists(MERGED_MODEL_PATH):
    print(f"✅ Merged model already exists at: {MERGED_MODEL_PATH}")
    print("Skipping merge step...")
    MODEL_PATH = MERGED_MODEL_PATH
else:
    print("⏳ Merging LoRA adapter with base model...")
    print(f"   Base model: {BASE_MODEL}")
    print(f"   LoRA adapter: {LORA_PATH}")
    print("")

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from peft import PeftModel

    # Load base model
    print("1. Loading base model from HuggingFace (~8GB)...")
    base_model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

    # Load LoRA adapter
    print("2. Loading LoRA adapter from Google Drive...")
    model = PeftModel.from_pretrained(base_model, LORA_PATH)

    # Merge and unload
    print("3. Merging LoRA weights into base model...")
    model = model.merge_and_unload()

    # Save merged model
    print(f"4. Saving merged model to {MERGED_MODEL_PATH}...")
    model.save_pretrained(MERGED_MODEL_PATH)
    tokenizer.save_pretrained(MERGED_MODEL_PATH)

    MODEL_PATH = MERGED_MODEL_PATH
    print(f"✅ Model merged and saved successfully!")

    # Free memory
    del model
    del base_model
    torch.cuda.empty_cache()

print(f"\n✅ Ready to load merged model for vLLM: {MODEL_PATH}")

✅ Merged model already exists at: /content/drive/MyDrive/Colab Notebooks/italian_teacher/models/italian_exercise_generator_v4_merged
Skipping merge step...

✅ Ready to load merged model for vLLM: /content/drive/MyDrive/Colab Notebooks/italian_teacher/models/italian_exercise_generator_v4_merged


In [5]:
# Cell 5: Load merged model with vLLM (~30 seconds)
from vllm import LLM

print("⏳ Loading merged model with vLLM for fast inference...")

llm = LLM(
    model=MODEL_PATH,
    tensor_parallel_size=1,
    dtype="half",
    max_model_len=2048,
    gpu_memory_utilization=0.85,
    trust_remote_code=True
)

print("✅ Italian Exercise Generator model loaded with vLLM!")
print(f"🔥 GPU: {torch.cuda.get_device_name() if torch.cuda.is_available() else 'CPU'}")
print(f"💾 GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f}GB")

INFO 10-10 19:13:21 [__init__.py:216] Automatically detected platform cuda.
⏳ Loading merged model with vLLM for fast inference...
INFO 10-10 19:13:30 [utils.py:233] non-default args: {'trust_remote_code': True, 'dtype': 'half', 'max_model_len': 2048, 'gpu_memory_utilization': 0.85, 'disable_log_stats': True, 'model': '/content/drive/MyDrive/Colab Notebooks/italian_teacher/models/italian_exercise_generator_v4_merged'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


INFO 10-10 19:13:47 [model.py:547] Resolved architecture: LlamaForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 10-10 19:13:47 [model.py:1510] Using max model len 2048
INFO 10-10 19:13:50 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 10-10 19:18:26 [llm.py:306] Supported_tasks: ['generate']
✅ Italian Exercise Generator model loaded with vLLM!
🔥 GPU: NVIDIA L4
💾 GPU Memory: 0.00GB


In [6]:
# Cell 6: Create FastAPI application
import nest_asyncio
from src.api.inference import create_inference_app

# Allow nested event loops (required for Colab)
nest_asyncio.apply()

# Port for Colab API (8001 to avoid conflict with local API on 8000)
COLAB_PORT = 8001

# Create the FastAPI app
app = create_inference_app(llm, port=COLAB_PORT)

print(f"✅ FastAPI application created (port {COLAB_PORT})")
print(f"📋 Version: 2.0.0")
print("🚀 Ready to start server!")

✅ FastAPI application created (port 8001)
📋 Version: 2.0.0
🚀 Ready to start server!


In [7]:
# Cell 7: Setup ngrok tunnel
from pyngrok import ngrok

# Set your ngrok auth token (get free token at https://ngrok.com)
NGROK_AUTH_TOKEN = "33VKJ1gR2EjYu8WvlRmSqOiUiJk_2qCL7X8Kp4vHTUAcu4xvh"

# Authenticate ngrok
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Create tunnel
tunnel = ngrok.connect(COLAB_PORT)
public_url = str(tunnel.public_url)

print("🌐 ngrok tunnel created!")
print(f"\n📍 Public URL: {public_url}")
print(f"\n🔗 API Endpoints:")
print(f"   Health: {public_url}/health")
print(f"   Generate: {public_url}/generate")
print(f"\n✅ Copy the public URL above for use in your local environment")

🌐 ngrok tunnel created!

📍 Public URL: https://orthoscopic-nonengrossingly-lashon.ngrok-free.dev

🔗 API Endpoints:
   Health: https://orthoscopic-nonengrossingly-lashon.ngrok-free.dev/health
   Generate: https://orthoscopic-nonengrossingly-lashon.ngrok-free.dev/generate

✅ Copy the public URL above for use in your local environment


In [8]:
# Cell 8: Start FastAPI server
import uvicorn
from threading import Thread
import time
import requests
import json

print(f"🚀 Starting FastAPI server on port {COLAB_PORT}...")

# Create uvicorn config
config = uvicorn.Config(
    app=app,
    host="0.0.0.0",
    port=COLAB_PORT,
    log_level="error"
)

# Create server
server = uvicorn.Server(config)

# Start in background thread
def run_server():
    import asyncio
    asyncio.run(server.serve())

server_thread = Thread(target=run_server, daemon=True)
server_thread.start()

# Wait for server to be ready
print("⏳ Waiting for server to start...")
time.sleep(3)

# Test if it's working
try:
    response = requests.get(f"http://localhost:{COLAB_PORT}/health", timeout=2)

    if response.status_code == 200:
        print("\n✅ SERVER IS RUNNING!")
        print(f"📡 Listening on http://0.0.0.0:{COLAB_PORT}\n")

        print("🧪 Health check response:")
        print(json.dumps(response.json(), indent=2))

        print("\n" + "="*70)
        print("🌐 YOUR NGROK PUBLIC URL:")
        print("="*70)
        print(f"\n{public_url}\n")
        print("="*70)

        print("\n📋 COPY AND RUN ON YOUR MAC:\n")
        print(f'export INFERENCE_API_URL="{public_url}"')
        print("./run_api.sh")

        print("\n" + "="*70)
        print("\n⚡ Server is running! Keep this notebook open!")
        print("🛑 To stop: Runtime → Interrupt execution")
        print("="*70)

except Exception as e:
    print(f"\n❌ Server failed to start: {e}")
    print("\n🔄 Try this:")
    print("   1. Runtime → Restart runtime")
    print("   2. Re-run all cells")

🚀 Starting FastAPI server on port 8001...
⏳ Waiting for server to start...

✅ SERVER IS RUNNING!
📡 Listening on http://0.0.0.0:8001

🧪 Health check response:
{
  "status": "healthy",
  "gpu_available": true,
  "gpu_memory_allocated_gb": 0.0,
  "model_loaded": true,
  "port": 8001
}

🌐 YOUR NGROK PUBLIC URL:

https://orthoscopic-nonengrossingly-lashon.ngrok-free.dev


📋 COPY AND RUN ON YOUR MAC:

export INFERENCE_API_URL="https://orthoscopic-nonengrossingly-lashon.ngrok-free.dev"
./run_api.sh


⚡ Server is running! Keep this notebook open!
🛑 To stop: Runtime → Interrupt execution
