# Deploying pretrained VLM to FastAPI

### Integrating the pretrained VLM Model (CLIP + OD) into FastAPI

To aid in deployment, we would define the following functions and class

**FastAPI Application**
- test is a simple GET endpoint that returns a greeting.
- predict is a POST endpoint that takes a VLMInput object as input, performs object detection, and returns the bounding box coordinates of the detected objects.

1. **.Test Endpoint**
    - This endpoint is a simple test endpoint that returns a hello message with the item ID.
    - Request Method: GET
    - Request Body: None
    - Response: {"Hello": "World_{item_id}"}
    
2. **.od_predict - Object Detection Endpoint**
    - This endpoint takes an image URL or path and runs object detection on it using the SSD300 model.
    - Request Method: POST
    - Request Body: VLMInput object with path_or_url and optional labels and threshold fields
    - Response: Object detection predictions in the format {"label": [(x1, y1, x2, y2), ...]}
    
3. **.clip_predict - CLIP Endpoint**
    - This endpoint takes an image URL or path and runs CLIP (Contrastive Language-Image Pre-training) on it with the given labels.
    - Request Method: POST
    - Request Body: VLMInput object with path_or_url and labels fields
    - Response: CLIP predictions in the format {"label": probability}
    
4. **clip_od_predict - CLIP + OD Endpoint**
    - This endpoint takes an image URL or path and runs CLIP+ OD on it to extract the labels's object bounding box.
    - Request Method: POST
    - Request Body: VLMInput object with path_or_url, labels, and threshold fields
    - Response: OD predictions in the format {"label": [(x1, y1, x2, y2), ...]}

## Notes
This is just a copy of the `app.py`, which can be found @ `/vlm_app`, along with the `requirement.txt` and `Dockerfile`.

```python
import base64
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np
import io
import torchvision
from PIL import Image
import os
from transformers import pipeline
import urllib.request
from torchvision.transforms import v2 as T
from torchvision import transforms
import torch
from urllib.request import urlretrieve
from transformers import CLIPProcessor, CLIPModel
from os import remove
from transformers import (
    AutoImageProcessor,
    AutoModelForObjectDetection,
    CLIPProcessor,
    CLIPModel,
)

app = FastAPI()

device = "cuda" if torch.cuda.is_available() else "cpu"
detr_model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50", device_map=device)
detr_processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", device_map=device)
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", device_map=device)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32", device_map=device)

def load_image(path_or_url):
    """Loads an image from a given URL or path. If the input is a URL,
    it downloads the image and saves it as a temporary file. If the input is a path,
    it loads the image from the path. The image is then converted to RGB format and returned.
    """
    if path_or_url.startswith("http"):  # assume URL if starts with http
        urlretrieve(path_or_url, "tmp.png")
        img = Image.open("tmp.png").convert("RGB")
        remove("tmp.png")  # cleanup temporary file
    else:
        img = Image.open(path_or_url).convert("RGB")
    return img


def object_detection_predict(image):
    """Runs object detection on a given image using the SSD300 model.
    The image is preprocessed, and the model is run on the device (either CPU or GPU).
    The detections are then returned."""
    weights = torchvision.models.detection.SSD300_VGG16_Weights.DEFAULT
    ssd_model = torchvision.models.detection.ssd300_vgg16(
        weights=True, box_score_thresh=0.9
    )

    transform = transforms.Compose(
        [
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ]
    )
    processed_image = transform(image).unsqueeze(0)

    # Label Encoding
    id_2_label = {idx: x for idx, x in enumerate(weights.meta["categories"])}

    # Run inference
    ssd_model.to(device).eval()  # Set the model to evaluation mode
    with torch.no_grad():
        detections = ssd_model(processed_image.to(device))[0]

    boxes = detections["boxes"].tolist()
    labels = detections["labels"].tolist()
    scores = detections["scores"].tolist()

    detected_dict = {}
    for box, label, score in zip(boxes, labels, scores):
        if score > 0.1:
            class_id = label  # Get the class ID
            detected_dict[id_2_label[label]] = box
    return detected_dict


def clip_predict(image, labels):
    """This function performs object detection and identification using the CLIP model + DETR"""
    inputs = clip_processor(
        text=labels.split(","), images=image, return_tensors="pt", padding=True
    )
    inputs.to(device)
    with torch.no_grad():
        outputs = clip_model(**inputs)
    logits_per_image = (
        outputs.logits_per_image
    )  # this is the image-text similarity score
    probs = logits_per_image.softmax(
        dim=1
    )  # we can take the softmax to get the label probabilities
    return {x: y.item() for x, y in zip(labels.split(","), probs[0])}


def clip_od_predict(img, labels, threshold):
    def detect_objects(image):
        with torch.no_grad():
            inputs = detr_processor(images=image, return_tensors="pt").to(device)
            outputs = detr_model(**inputs)
            target_sizes = torch.tensor([image.size[::-1]])
            results = detr_processor.post_process_object_detection(
                outputs, threshold=0.5, target_sizes=target_sizes
            )[0]
        return results["boxes"]

    def object_images(image, boxes):
        image_arr = np.array(image)
        all_images = []
        for box in boxes:
            # DETR returns top, left, bottom, right format
            x1, y1, x2, y2 = [int(val) for val in box]
            _image = image_arr[y1:y2, x1:x2]
            all_images.append(_image)
        return all_images


    def identify_target(labels, images):
        inputs = clip_processor(
            text=labels.split(","), images=images, return_tensors="pt", padding=True
        ).to(device)
        with torch.no_grad():
            outputs = clip_model(**inputs)
        logits_per_image = outputs.logits_per_image
        most_similar_idx = torch.argmax(logits_per_image, dim=0).item()
        return most_similar_idx

    # detect object bounding boxes
    detected_objects = detect_objects(img)

    # get images of objects
    images = object_images(img, detected_objects)

    # identify target
    idx = identify_target(labels, images)

    # return bounding box of best match
    return [int(val) for val in detected_objects[idx].tolist()]


class VLMInput(BaseModel):
    path_or_url: str
    labels: str = "None"
    threshold: float = 0.01


@app.get("/{item_id}")
def test():
    return {"Hello": f"World_{item_id}"}


@app.post("/od_predict")
async def predict(data: VLMInput):
    img = load_image(data.path_or_url)
    predict_dict = object_detection_predict(img)
    print(predict_dict)
    return predict_dict


@app.post("/clip_predict")
async def predict(data: VLMInput):
    img = load_image(data.path_or_url)
    predict_dict = clip_predict(img, data.labels)
    return predict_dict


@app.post("/clip_od_predict")
async def predict(data: VLMInput):
    img = load_image(data.path_or_url)
    predict_dict = clip_od_predict(img, data.labels, data.threshold)
    return predict_dict


if __name__ == "__main__":
    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=8000)
```

### Create a Dockerfile
Create a `Dockerfile` in the same directory as your FastAPI app (`app.py`). This file will define the Docker image that includes your app and all its dependencies.

```docker
FROM us-docker.pkg.dev/deeplearning-platform-release/gcr.io/pytorch-gpu.2-2.py310

# Set the working directory in the container
WORKDIR /usr/src/app

# Install any needed packages specified in requirements.txt
COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Make port 8000 available to the world outside this container
EXPOSE 8000

# Define environment variable
# ENV MODEL_PATH=/usr/src/app/models

# COPY 
COPY . /usr/src/app
RUN ls

# Run app.py when the container launches
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
```

### Create a Requirements File
Create a `requirements.txt` file that lists the packages that your app depends on. Make sure to include fastapi, uvicorn, torch, transformers, and any other required libraries. Torch isn't included in this `requirements.txt` because it's included in the starting Docker image (i.e. the image indicated in the first `FROM` line in the `Dockerfile`).

```txt
fastapi
uvicorn[standard]
pydantic
timm
transformers==4.37.0
accelerate
```


### Build the Docker Image
From your project directory (where your `Dockerfile` and `app.py` are located), run the following command to build the Docker image
```bash
docker build -t vlm_app .
```

### Run the Docker Container
```bash
docker run -p 8000:8000 --gpus all vlm_app

```

Docker runs the container and map port 8000 of the container to port 8000 on your host, allowing us to access the FastAPI application using the browser, `requests` library or Postman. We also give the container access to all the GPUs on our system such that it can run the models on GPU using CUDA, rather than on the CPU.

### Testing `vlm_app` using `requests`

In [14]:
import requests

# The endpoint URL
url = "http://localhost:8000/od_predict"

# Example url and context
data = {
    "path_or_url": "imgs/dog1.jpg",
}

# Sending a POST request
response = requests.post(url, json=data)

# Print the response from the server
print("Status Code:", response.status_code)
print("Response:", response.json())

Status Code: 200
Response: {'dog': [42.76615524291992, 2520.119873046875, 3812.784423828125, 6082.93212890625]}


In [1]:
import requests

# The endpoint URL
url = "http://localhost:8000/clip_predict"

# Example url and context
data = {
    "path_or_url": "imgs/horserider.jpg",
    "labels": "horserider",
}

# Sending a POST request
response = requests.post(url, json=data)

# Print the response from the server
print("Status Code:", response.status_code)
print("Response:", response.json())

Status Code: 200
Response: {'horserider': 1.0}


In [2]:
import requests

# The endpoint URL
url = "http://localhost:8000/clip_od_predict"

# Example url and context
data = {
    "path_or_url": "http://images.cocodataset.org/val2017/000000039769.jpg",
    "labels": "photo of a cat",
}

# Sending a POST request
response = requests.post(url, json=data)

# Print the response from the server
print("Status Code:", response.status_code)
print("Response:", response.json())

Status Code: 200
Response: [345, 23, 640, 368]


This result shows that the model is successfully able to respond to the request. 

# Exercise (20 mins)

**1. Object Detection: Refining Object Detection Results: Finding Bounding Box Centers**

Adjust the functions value to provide the center of the object's bounding box, you will need to modifying the app.py, rebuild the docker file and validate it with request.

**2. Optimizing DockerFile: Setup (Bonus)**

Loading models from the internet on every Docker run can be slow and may not be optimal for production environments. Saving the model files and copying them into the Docker image can significantly improve performance and reliability.

Here's an updated approach to attempt:
1. Download the CLIP and DETR models using Hugging Face's transformers library and save them to a local directory, e.g., models/.
2. Update the Dockerfile to COPY the model files into the Docker image:

**3. Object Detection: Testing with non-common object (Discussion) (Bonus)** 

Both DETR (object detection) and CLIP pretrained on mainly COCO datasets, how does could this be adapted to track non-common objects.
