# LLaVA Vision-Language Model - 5 Practical Uses Demo

This notebook demonstrates the 5 practical uses of LLaVA (llava-hf/llava-1.5-7b-hf):
1. Image Captioning
2. Tag/Object/Concept Extraction
3. Instruction Following with Images
4. OCR-like Understanding (Text in Images)
5. Structured Output Generation

**Libraries used:**
- transformers: For LLaVA model and processor
- torch: PyTorch backend
- PIL: Image loading and processing



In [None]:
# -*- coding: utf-8 -*-
"""
LLaVA FastAPI Server for Google Colab
This server runs the LLaVA model and exposes it via FastAPI with ngrok tunneling
"""

# Install required packages
print("Installing required packages...")
import subprocess
import sys

packages = [
    "fastapi",
    "uvicorn",
    "pyngrok",
    "transformers",
    "torch",
    "pillow",
    "accelerate",
]

for package in packages:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])

print("✓ All packages installed successfully!")


Installing required packages...
✓ All packages installed successfully!
✓ Libraries imported successfully!

STARTING LLAVA FASTAPI SERVER
Loading LLaVA model and processor...
This may take 2-5 minutes on first run...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


processor_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/674 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


config.json:   0%|          | 0.00/950 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

✓ Model loaded successfully!
✓ Model device: cuda:0
✓ Model dtype: torch.float16

Starting ngrok tunnel...

🌐 PUBLIC URL: NgrokTunnel: "https://kenna-explosible-nonmonistically.ngrok-free.dev" -> "http://localhost:8000"

⚠️  IMPORTANT: Copy this URL and use it in your local app.py
   Update COLAB_SERVER_URL = 'NgrokTunnel: "https://kenna-explosible-nonmonistically.ngrok-free.dev" -> "http://localhost:8000"'




INFO:     Started server process [1167]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)



✓ Server is running!
✓ Keep this cell running - do NOT stop it
✓ Use the public URL above in your local application

Processing image: grindelwald.webp
Image size: (750, 500)
Prompt: Describe this image in one concise sentence.
Response: A beautiful blue lake is surrounded by mountains and a road.
INFO:     192.116.63.38:0 - "POST /v1/analyze HTTP/1.1" 200 OK
INFO:     192.116.63.38:0 - "GET /health HTTP/1.1" 200 OK


In [None]:

# Import libraries
from fastapi import FastAPI, File, UploadFile, Form, HTTPException
from fastapi.responses import JSONResponse
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import io
import json
from pyngrok import ngrok
import uvicorn
from typing import Optional
import threading

print("✓ Libraries imported successfully!")


In [None]:

# Initialize FastAPI app
app = FastAPI(
    title="LLaVA Vision-Language Model API",
    description="API for image analysis using LLaVA model",
    version="1.0.0"
)

# Global variables for model and processor
processor = None
model = None
model_loaded = False
device = None
dtype = None


In [None]:

def load_llava_model():
    """
    Load the LLaVA model and processor from Hugging Face
    Model: llava-hf/llava-1.5-7b-hf
    """
    global processor, model, model_loaded, device, dtype

    if model_loaded:
        print("Model already loaded!")
        return

    print("=" * 70)
    print("Loading LLaVA model and processor...")
    print("This may take 2-5 minutes on first run...")
    print("=" * 70)

    if torch.cuda.is_available():
        device = torch.device("cuda")
        dtype = torch.float16
        print("✓ CUDA available - using GPU")
        print(f"✓ GPU: {torch.cuda.get_device_name(0)}")

    else:
        device = torch.device("cpu")
        dtype = torch.float32
        print("⚠️  CUDA not available - using CPU")
        print("⚠️  Note: CPU inference will be slower")

    model_id = "llava-hf/llava-1.5-7b-hf"

    # Load processor
    processor = AutoProcessor.from_pretrained(model_id)

    # Load model with appropriate settings for CPU or GPU
    if device.type == "cuda":
        model = LlavaForConditionalGeneration.from_pretrained(
            model_id,
            torch_dtype=dtype,
            device_map="auto"
        )
    else:
        # For CPU, load without device_map and move manually
        model = LlavaForConditionalGeneration.from_pretrained(
            model_id,
            torch_dtype=dtype,
            low_cpu_mem_usage=True
        )
        model = model.to(device)

    model_loaded = True

    print("✓ Model loaded successfully!")
    print(f"✓ Model device: {model.device}")
    print(f"✓ Model dtype: {model.dtype}")
    print("=" * 70)


In [None]:

def generate_response(image: Image.Image, prompt: str) -> str:
    """
    Generate a response from LLaVA given an image and text prompt

    Args:
        image: PIL Image object
        prompt: Text prompt/instruction

    Returns:
        Generated text response
    """
    if not model_loaded:
        raise HTTPException(status_code=503, detail="Model not loaded yet")

    # Prepare conversation format (LLaVA uses a specific format)
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": prompt},
            ],
        },
    ]

    # Apply chat template
    prompt_text = processor.apply_chat_template(conversation, add_generation_prompt=True)

    # Process inputs
    inputs = processor(images=image, text=prompt_text, return_tensors="pt").to(0, torch.float16)

    # Generate response
    output = model.generate(**inputs, max_new_tokens=200, do_sample=False)

    # Decode and extract only the generated text (remove prompt)
    generated_text = processor.decode(output[0], skip_special_tokens=True)

    # Extract only the assistant's response
    if "ASSISTANT:" in generated_text:
        response = generated_text.split("ASSISTANT:")[-1].strip()
    else:
        response = generated_text.strip()

    return response


In [None]:

@app.get("/")
async def root():
    """Root endpoint"""
    return {
        "message": "LLaVA Vision-Language Model API",
        "status": "running",
        "model_loaded": model_loaded,
        "endpoints": {
            "analyze": "/v1/analyze",
            "health": "/health"
        }
    }

@app.get("/health")
async def health_check():
    """Health check endpoint"""
    return {
        "status": "healthy",
        "model_loaded": model_loaded,
        "cuda_available": torch.cuda.is_available(),
        "device": str(model.device) if model_loaded else "N/A"
    }

@app.post("/v1/analyze")
async def analyze_image(
    file: UploadFile = File(...),
    prompt: str = Form(...)
):
    """
    Analyze an uploaded image using LLaVA model

    Args:
        file: Image file to analyze
        prompt: Text prompt/instruction for the model

    Returns:
        JSON response with analysis results
    """
    try:
        # Validate file type
        if not file.content_type.startswith("image/"):
            raise HTTPException(status_code=400, detail="File must be an image")

        # Read and process image
        image_bytes = await file.read()
        image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

        print(f"Processing image: {file.filename}")
        print(f"Image size: {image.size}")
        print(f"Prompt: {prompt}")

        # Generate response using LLaVA
        response = generate_response(image, prompt)

        print(f"Response: {response}")

        return {
            "success": True,
            "filename": file.filename,
            "image_size": {
                "width": image.size[0],
                "height": image.size[1]
            },
            "prompt": prompt,
            "response": response
        }

    except Exception as e:
        print(f"Error processing image: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Error processing image: {str(e)}")


In [None]:
# Setup ngrok tunnel and run server
if __name__ == "__main__":
    print("\n" + "=" * 70)
    print("STARTING LLAVA FASTAPI SERVER")
    print("=" * 70)

    # Load model before starting server
    load_llava_model()

    # Set ngrok auth token (you'll need to add your token)
    # Get free token from: https://dashboard.ngrok.com/get-started/your-authtoken
    # Uncomment and add your token:
    ngrok.set_auth_token("37lUpwcYVFfOuHkVQBCfhfZ66OE_7uSiArpZk7HsmLcHX3Wqd")

    # Start ngrok tunnel
    print("\nStarting ngrok tunnel...")
    public_url = ngrok.connect(8000)
    print(f"\n{'=' * 70}")
    print(f"🌐 PUBLIC URL: {public_url}")
    print(f"{'=' * 70}")
    print(f"\n⚠️  IMPORTANT: Copy this URL and use it in your local app.py")
    print(f"   Update COLAB_SERVER_URL = '{public_url}'\n")
    print(f"{"=" * 70}\n")

    # Run FastAPI server using threading to avoid event loop issues
    config = uvicorn.Config(app, host="127.0.0.1", port=8000, log_level="info")
    server = uvicorn.Server(config)

    # Run server in a thread to work in Colab
    thread = threading.Thread(target=server.run)
    thread.start()

    print("\n✓ Server is running!")
    print("✓ Keep this cell running - do NOT stop it")
    print("✓ Use the public URL above in your local application\n")

    # Keep the main thread alive
    thread.join()