# API Model Integration
In this notebook we will take a look at integrating the fine-tuned VLM model for object detection into a real-world application using FastAPI. The setup involves creating an API that receives an image and caption as input and returns the predicted bounding box. As before, we will create a Docker image for the FastAPI app to deploy it.

In [1]:
# needed for DETR
! pip install timm



### Saving the Model and Tokenizer
After training your models, you can save them to a directory, commonly done using the `save_pretrained()` method provided by the Hugging Face Transformers library. Here we'll use pre-trained versions of DETR (DEtection TRansformer, discussed in more detail in Unit 6) and CLIP.

In [2]:
from transformers import (
    AutoImageProcessor,
    AutoModelForObjectDetection,
    CLIPProcessor,
    CLIPModel,
)


# DETR
detr_checkpoint = "facebook/detr-resnet-50"
detr_model = AutoModelForObjectDetection.from_pretrained(detr_checkpoint)
detr_processor = AutoImageProcessor.from_pretrained(detr_checkpoint)

detr_model_path = "detr_model.pth"

# CLIP
clip_checkpoint = "openai/clip-vit-base-patch32"
clip_model = CLIPModel.from_pretrained(clip_checkpoint)
clip_processor = CLIPProcessor.from_pretrained(clip_checkpoint)

clip_model_path = "clip_model.pth"

# Assume the rest of your model training setup is here
# ....

# After training:
detr_model.save_pretrained(detr_model_path)
detr_processor.save_pretrained(detr_model_path)

clip_model.save_pretrained(clip_model_path)
clip_processor.save_pretrained(clip_model_path)

Some weights of the model checkpoint at facebook/detr-resnet-50 were not used when initializing DetrForObjectDetection: ['model.backbone.conv_encoder.model.layer1.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing DetrForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DetrForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[]

### Integrating the Saved Model into FastAPI
Now that your model and processor are saved, you can load them from the saved directory in your FastAPI application. This will allow your API to use the fine-tuned model to run object detection. The below example code is stored in `app.py` in the `vlm_app` folder

```python
import base64
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import (
    AutoImageProcessor,
    AutoModelForObjectDetection,
    CLIPProcessor,
    CLIPModel,
)
import numpy as np
import io
from PIL import Image
import torch
import os

app = FastAPI()

# Fetch the model directory from the environment variable
model_directory = os.getenv("MODEL_PATH", "/usr/src/app/models")
detr_model_filename = "detr_model.pth"  # Specify your model filename here
clip_model_filename = "clip_model.pth"  # Specify your model filename here

# Full path to your model files
detr_model_path = os.path.join(model_directory, detr_model_filename)
clip_model_path = os.path.join(model_directory, clip_model_filename)

# Load the models
device = "cuda" if torch.cuda.is_available() else "cpu"
detr_model = AutoModelForObjectDetection.from_pretrained(
    detr_model_path, device_map=device
)
detr_processor = AutoImageProcessor.from_pretrained(detr_model_path, device_map=device)

clip_model = CLIPModel.from_pretrained(clip_model_path, device_map=device)
clip_processor = CLIPProcessor.from_pretrained(clip_model_path, device_map=device)


class VLMInput(BaseModel):
    image: str
    caption: str


def detect_objects(image):
    with torch.no_grad():
        inputs = detr_processor(images=image, return_tensors="pt").to(device)
        outputs = detr_model(**inputs)
        target_sizes = torch.tensor([image.size[::-1]])
        results = detr_processor.post_process_object_detection(
            outputs, threshold=0.5, target_sizes=target_sizes
        )[0]
    return results["boxes"]


def object_images(image, boxes):
    image_arr = np.array(image)
    all_images = []
    for box in boxes:
        # DETR returns top, left, bottom, right format
        x1, y1, x2, y2 = [int(val) for val in box]
        _image = image_arr[y1:y2, x1:x2]
        all_images.append(_image)
    return all_images


def identify_target(query, images):
    inputs = clip_processor(
        text=[query], images=images, return_tensors="pt", padding=True
    ).to(device)
    with torch.no_grad():
        outputs = clip_model(**inputs)
    logits_per_image = outputs.logits_per_image
    most_similar_idx = torch.argmax(logits_per_image, dim=0).item()
    return most_similar_idx


@app.post("/predict")
async def predict(data: VLMInput):
    image_bytes = base64.b64decode(data.image)
    im = Image.open(io.BytesIO(image_bytes))

    # detect object bounding boxes
    detected_objects = detect_objects(im)

    # get images of objects
    images = object_images(im, detected_objects)

    # identify target
    idx = identify_target(data.caption, images)

    # return bounding box of best match
    return [int(val) for val in detected_objects[idx].tolist()]


if __name__ == "__main__":
    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=8000)
```

### Create a Dockerfile
Create a `Dockerfile` in the same directory as your FastAPI app (`app.py`). This file will define the Docker image that includes your app and all its dependencies.

```docker
FROM us-docker.pkg.dev/deeplearning-platform-release/gcr.io/pytorch-gpu.2-2.py310

# Set the working directory in the container
WORKDIR /usr/src/app

COPY . /usr/src/app

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Make port 8000 available to the world outside this container
EXPOSE 8000

# Define environment variable
ENV MODEL_PATH=/usr/src/app/models

# Run app.py when the container launches
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
```

### Create a Requirements File
Create a `requirements.txt` file that lists the packages that your app depends on. Make sure to include fastapi, uvicorn, torch, transformers, and any other required libraries. Torch isn't included in this `requirements.txt` because it's included in the starting Docker image (i.e. the image indicated in the first `FROM` line in the `Dockerfile`).

```txt
fastapi
uvicorn[standard]
pydantic
timm
transformers==4.37.0
accelerate
```


### Build the Docker Image
From your project directory (where your `Dockerfile` and `app.py` are located), run the following command to build the Docker image
```bash
docker build -t vlm_app .
```

### Run the Docker Container
```bash
docker run -p 8000:8000 vlm_app
```

Docker runs the container and map port 8000 of the container to port 8000 on your host, allowing us to access the FastAPI application using the browser, `requests` library or Postman.

### Testing `vlm_app` using `requests`

In [3]:
import requests
from base64 import b64encode

# The endpoint URL
url = 'http://localhost:8000/predict'

# base64 encode image so it can be passed in json
image = b64encode(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg").content).decode("utf-8")

# Example question and context
data = {
    "image": image,
    "caption": "photo of a cat",
}

# Sending a POST request
response = requests.post(url, json=data)

# Print the response from the server
print("Status Code:", response.status_code)
print("Response:", response.json())


Status Code: 200
Response: [345, 23, 640, 368]


This result shows that the model is successfully able to respond to the request. 