# LLaVA Vision-Language Model - 5 Practical Uses Demo

This notebook demonstrates the 5 practical uses of LLaVA (llava-hf/llava-1.5-7b-hf):
1. Image Captioning
2. Tag/Object/Concept Extraction
3. Instruction Following with Images
4. OCR-like Understanding (Text in Images)
5. Structured Output Generation

**Libraries used:**
- transformers: For LLaVA model and processor
- torch: PyTorch backend
- PIL: Image loading and processing



In [None]:
"""
LLaVA FastAPI Server for Google Colab
This server runs the LLaVA model and exposes it via FastAPI with ngrok tunneling
"""

# Install required packages
print("Installing required packages...")
import subprocess
import sys

packages = [
    "fastapi",
    "uvicorn",
    "pyngrok",
    "transformers",
    "torch",
    "pillow",
    "accelerate",
]

for package in packages:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])

print("‚úì All packages installed successfully!")

Installing required packages...
‚úì All packages installed successfully!


In [None]:

# Import libraries
from fastapi import FastAPI, File, UploadFile, Form, HTTPException
from fastapi.responses import JSONResponse
import torch
import torch.nn.functional as F
from transformers import AutoProcessor, LlavaForConditionalGeneration, CLIPModel, CLIPProcessor
from PIL import Image
import io
import json
from pyngrok import ngrok
import uvicorn
from typing import Optional
import threading

print("‚úì Libraries imported successfully!")


‚úì Libraries imported successfully!


In [None]:

# Initialize FastAPI app
app = FastAPI(
    title="LLaVA Vision-Language Model API",
    description="API for image analysis using LLaVA model",
    version="1.0.0"
)

# Global variables for LLaVA model and processor
processor = None
model = None
model_loaded = False
device = None
dtype = None

# Global variables for CLIP model and processor
clip_model = None
clip_processor = None
clip_model_loaded = False


In [None]:

def load_llava_model():
    """
    Load the LLaVA model and processor from Hugging Face
    Model: llava-hf/llava-1.5-7b-hf
    """
    global processor, model, model_loaded, device, dtype

    if model_loaded:
        print("Model already loaded!")
        return

    print("=" * 70)
    print("Loading LLaVA model and processor...")
    print("This may take 2-5 minutes on first run...")
    print("=" * 70)

    if torch.cuda.is_available():
        device = torch.device("cuda")
        dtype = torch.float16
        print("‚úì CUDA available - using GPU")
        print(f"‚úì GPU: {torch.cuda.get_device_name(0)}")

    else:
        device = torch.device("cpu")
        dtype = torch.float32
        print("‚ö†Ô∏è  CUDA not available - using CPU")
        print("‚ö†Ô∏è  Note: CPU inference will be slower")

    model_id = "llava-hf/llava-1.5-7b-hf"

    # Load processor
    processor = AutoProcessor.from_pretrained(model_id)

    # Load model with appropriate settings for CPU or GPU
    if device.type == "cuda":
        model = LlavaForConditionalGeneration.from_pretrained(
            model_id,
            torch_dtype=dtype,
            device_map="auto"
        )
    else:
        # For CPU, load without device_map and move manually
        model = LlavaForConditionalGeneration.from_pretrained(
            model_id,
            torch_dtype=dtype,
            low_cpu_mem_usage=True
        )
        model = model.to(device)

    model_loaded = True

    print("‚úì Model loaded successfully!")
    print(f"‚úì Model device: {model.device}")
    print(f"‚úì Model dtype: {model.dtype}")
    print("=" * 70)


In [None]:
def load_clip_model():
    """
    Load the CLIP model (ViT-B/32)
    This model acts as the OpenVision encoder
    It converts images into 512-dimensional embeddings
    """
    global clip_model, clip_processor, clip_model_loaded

    if clip_model_loaded:
        print("CLIP model already loaded!")
        return

    print("=" * 70)
    print("Loading CLIP model and processor...")
    print("=" * 70)

    MODEL_NAME = "openai/clip-vit-base-patch32"

    clip_processor = CLIPProcessor.from_pretrained(MODEL_NAME)
    clip_model = CLIPModel.from_pretrained(MODEL_NAME)

    # Move to GPU if available
    if torch.cuda.is_available():
        clip_model = clip_model.to("cuda")
        print("‚úì CLIP model loaded on GPU")
    else:
        print("‚úì CLIP model loaded on CPU")

    # Set model to evaluation mode (important: no training)
    clip_model.eval()

    clip_model_loaded = True
    print("‚úì CLIP model loaded successfully")
    print("=" * 70)

In [None]:
def generate_embeddings(image: Image.Image):
    """
    Preprocess image and convert it to model input
    Generate normalized embeddings for the image

    Args:
        image: PIL Image object

    Returns:
        Normalized embedding tensor of shape [1, 512]
    """
    if not clip_model_loaded:
        raise HTTPException(status_code=503, detail="CLIP model not loaded yet")

    # Preprocess image
    inputs = clip_processor(images=image, return_tensors="pt")

    # Move to same device as model
    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}

    # Disable gradient calculation (we are not training)
    with torch.no_grad():
        image_features = clip_model.get_image_features(**inputs)

    # Normalize embeddings to unit length (L2 normalization)
    # This is required for cosine similarity to work correctly
    image_embedding = F.normalize(image_features, p=2, dim=1)

    return image_embedding

In [None]:
def cosine_similarity(embedding1, embedding2):
    """
    Calculate cosine similarity between two embeddings

    Args:
        embedding1: First embedding tensor [1, 512]
        embedding2: Second embedding tensor [1, 512]

    Returns:
        Cosine similarity score (float between -1 and 1)
    """
    cosine = torch.nn.CosineSimilarity(dim=1)
    similarity = cosine(embedding1, embedding2)
    return similarity.item()

In [None]:

def generate_response(image: Image.Image, prompt: str) -> str:
    """
    Generate a response from LLaVA given an image and text prompt

    Args:
        image: PIL Image object
        prompt: Text prompt/instruction

    Returns:
        Generated text response
    """
    if not model_loaded:
        raise HTTPException(status_code=503, detail="Model not loaded yet")

    # Prepare conversation format (LLaVA uses a specific format)
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": prompt},
            ],
        },
    ]

    # Apply chat template
    prompt_text = processor.apply_chat_template(conversation, add_generation_prompt=True)

    # Process inputs
    inputs = processor(images=image, text=prompt_text, return_tensors="pt").to(0, torch.float16)

    # Generate response
    output = model.generate(**inputs, max_new_tokens=200, do_sample=False)

    # Decode and extract only the generated text (remove prompt)
    generated_text = processor.decode(output[0], skip_special_tokens=True)

    # Extract only the assistant's response
    if "ASSISTANT:" in generated_text:
        response = generated_text.split("ASSISTANT:")[-1].strip()
    else:
        response = generated_text.strip()

    return response


In [None]:

@app.get("/")
async def root():
    """Root endpoint"""
    return {
        "message": "LLaVA Vision-Language Model API",
        "status": "running",
        "model_loaded": model_loaded,
        "endpoints": {
            "analyze": "/v1/analyze",
            "embed": "/v1/embed",
            "cosine_sim": "/v1/cosine-sim",
            "health": "/health"
        }
    }

@app.get("/health")
async def health_check():
    """Health check endpoint"""
    return {
        "status": "healthy",
        "model_loaded": model_loaded,
        "cuda_available": torch.cuda.is_available(),
        "device": str(model.device) if model_loaded else "N/A"
    }

@app.post("/v1/analyze")
async def analyze_image(
    file: UploadFile = File(...),
    prompt: str = Form(...)
):
    """
    Analyze an uploaded image using LLaVA model

    Args:
        file: Image file to analyze
        prompt: Text prompt/instruction for the model

    Returns:
        JSON response with analysis results
    """
    try:
        # Validate file type
        if not file.content_type.startswith("image/"):
            raise HTTPException(status_code=400, detail="File must be an image")

        # Read and process image
        image_bytes = await file.read()
        image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

        print(f"Processing image: {file.filename}")
        print(f"Image size: {image.size}")
        print(f"Prompt: {prompt}")

        # Generate response using LLaVA
        response = generate_response(image, prompt)

        print(f"Response: {response}")

        return {
            "success": True,
            "filename": file.filename,
            "image_size": {
                "width": image.size[0],
                "height": image.size[1]
            },
            "prompt": prompt,
            "response": response
        }

    except Exception as e:
        print(f"Error processing image: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Error processing image: {str(e)}")

@app.post("/v1/embed")
async def embed_image(
    file: UploadFile = File(...)
):
    """
    Generate embeddings for an uploaded image using CLIP model

    Args:
        file: Image file to generate embeddings for

    Returns:
        JSON response with embedding vector (512-dimensional)
    """
    try:
        # Validate file type
        if not file.content_type.startswith("image/"):
            raise HTTPException(status_code=400, detail="File must be an image")

        # Read and process image
        image_bytes = await file.read()
        image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

        print(f"Generating embeddings for: {file.filename}")
        print(f"Image size: {image.size}")

        # Generate embeddings using CLIP
        embedding = generate_embeddings(image)

        # Convert to list for JSON serialization
        embedding_list = embedding.cpu().numpy().tolist()[0]

        print(f"Embedding shape: {embedding.shape}")
        print(f"Embedding generated successfully")

        return {
            "success": True,
            "filename": file.filename,
            "image_size": {
                "width": image.size[0],
                "height": image.size[1]
            },
            "embedding": embedding_list,
            "embedding_shape": list(embedding.shape),
            "embedding_dim": embedding.shape[1]
        }

    except Exception as e:
        print(f"Error generating embeddings: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Error generating embeddings: {str(e)}")

@app.post("/v1/cosine-sim")
async def calculate_cosine_similarity(
    file1: UploadFile = File(...),
    file2: UploadFile = File(...)
):
    """
    Calculate cosine similarity between two uploaded images using CLIP embeddings

    Args:
        file1: First image file
        file2: Second image file

    Returns:
        JSON response with cosine similarity score (range: -1 to 1)
        - 1.0 means identical images
        - 0.0 means orthogonal/unrelated
        - -1.0 means opposite (rare in practice)
    """
    try:
        # Validate file types
        if not file1.content_type.startswith("image/"):
            raise HTTPException(status_code=400, detail="File1 must be an image")
        if not file2.content_type.startswith("image/"):
            raise HTTPException(status_code=400, detail="File2 must be an image")

        # Read and process first image
        image1_bytes = await file1.read()
        image1 = Image.open(io.BytesIO(image1_bytes)).convert("RGB")

        # Read and process second image
        image2_bytes = await file2.read()
        image2 = Image.open(io.BytesIO(image2_bytes)).convert("RGB")

        print(f"Calculating similarity between: {file1.filename} and {file2.filename}")
        print(f"Image1 size: {image1.size}")
        print(f"Image2 size: {image2.size}")

        # Generate embeddings for both images
        embedding1 = generate_embeddings(image1)
        embedding2 = generate_embeddings(image2)

        # Calculate cosine similarity
        similarity_score = cosine_similarity(embedding1, embedding2)

        print(f"Cosine similarity: {similarity_score:.4f}")

        return {
            "success": True,
            "file1": {
                "filename": file1.filename,
                "size": {"width": image1.size[0], "height": image1.size[1]}
            },
            "file2": {
                "filename": file2.filename,
                "size": {"width": image2.size[0], "height": image2.size[1]}
            },
            "cosine_similarity": similarity_score,
            "interpretation": {
                "score": similarity_score,
                "description": (
                    "Very similar" if similarity_score > 0.9 else
                    "Similar" if similarity_score > 0.7 else
                    "Somewhat similar" if similarity_score > 0.5 else
                    "Different" if similarity_score > 0.3 else
                    "Very different"
                )
            }
        }

    except Exception as e:
        print(f"Error calculating cosine similarity: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Error calculating cosine similarity: {str(e)}")


In [None]:
# Setup ngrok tunnel and run server
if __name__ == "__main__":
    print("\n" + "=" * 70)
    print("STARTING LLAVA FASTAPI SERVER")
    print("=" * 70)

    # Load models before starting server
    load_llava_model()
    load_clip_model()

    # Set ngrok auth token (you'll need to add your token)
    # Get free token from: https://dashboard.ngrok.com/get-started/your-authtoken
    # Uncomment and add your token:
    ngrok.set_auth_token("37lUpwcYVFfOuHkVQBCfhfZ66OE_7uSiArpZk7HsmLcHX3Wqd")

    # Start ngrok tunnel
    print("\nStarting ngrok tunnel...")
    public_url = ngrok.connect(8000)
    print(f"\n{'=' * 70}")
    print(f"üåê PUBLIC URL: {public_url}")
    print(f"{'=' * 70}")
    print(f"\n‚ö†Ô∏è  IMPORTANT: Copy this URL and use it in your local app.py")
    print(f"   Update COLAB_SERVER_URL = '{public_url}'\n")
    print(f"{"=" * 70}\n")

    # Run FastAPI server using threading to avoid event loop issues
    config = uvicorn.Config(app, host="127.0.0.1", port=8000, log_level="info")
    server = uvicorn.Server(config)

    # Run server in a thread to work in Colab
    thread = threading.Thread(target=server.run)
    thread.start()

    print("\n‚úì Server is running!")
    print("‚úì Keep this cell running - do NOT stop it")
    print("‚úì Use the public URL above in your local application\n")

    # Keep the main thread alive
    thread.join()


STARTING LLAVA FASTAPI SERVER
Loading LLaVA model and processor...
This may take 2-5 minutes on first run...
‚úì CUDA available - using GPU
‚úì GPU: Tesla T4


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


‚úì Model loaded successfully!
‚úì Model device: cuda:0
‚úì Model dtype: torch.float16
Loading CLIP model and processor...
‚úì CLIP model loaded on GPU
‚úì CLIP model loaded successfully

Starting ngrok tunnel...

üåê PUBLIC URL: NgrokTunnel: "https://kenna-explosible-nonmonistically.ngrok-free.dev" -> "http://localhost:8000"

‚ö†Ô∏è  IMPORTANT: Copy this URL and use it in your local app.py
   Update COLAB_SERVER_URL = 'NgrokTunnel: "https://kenna-explosible-nonmonistically.ngrok-free.dev" -> "http://localhost:8000"'




INFO:     Started server process [4348]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)



‚úì Server is running!
‚úì Keep this cell running - do NOT stop it
‚úì Use the public URL above in your local application

Calculating similarity between: apple_leaf.png and tomato-healthy.png
Image1 size: (212, 214)
Image2 size: (274, 156)
Cosine similarity: 0.7979
INFO:     192.116.63.38:0 - "POST /v1/cosine-sim HTTP/1.1" 200 OK
Processing image: WhatsApp Image 2025-12-27 at 19.16.53.jpeg
Image size: (1200, 800)
Prompt: Describe this image in one concise sentence.
Response: A plant with green leaves and red spots.
INFO:     192.116.63.38:0 - "POST /v1/analyze HTTP/1.1" 200 OK
