# Faster Foundation Models with `torch.compile`

## Introduction to `torch.compile()`

This guide aims to provide a benchmark on the inference speed-ups introduced with `torch.compile()` with no reduction in model performance for foundation models in 🤗 Transformers.

Most used `torch.compile` modes are following:

- "default" is the default mode, which is a good balance between performance and overhead

- "reduce-overhead" reduces the overhead of python with CUDA graphs, useful for small batches, consumes a lot of memory. As of now only works for CUDA only graphs which do not mutate inputs.

If you have a lot of memory to use, the best speed-up is through `reduce-overhead`. How much speed-up one can get depends on the model, so in this tutorial we will check the most used foundation models.

## OWLv2

OWLv2 is a zero-shot object detection model released by Google Brain. We will load base version.

Let's load the model and processor for OWLv2.

In [None]:
from PIL import Image
import requests

url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg'
image = Image.open(requests.get(url, stream=True).raw)

In [None]:
from transformers import AutoProcessor, Owlv2ForObjectDetection
import torch
import numpy as np

processor = AutoProcessor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble").to("cuda")

texts = [["a photo of a bee", "a photo of a bird"]]
inputs = processor(text=texts, images=image, return_tensors="pt").to("cuda")

We can now get to benchmarking. We will benchmark the model itself and the compiled model.

In [None]:
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 30
timings=np.zeros((repetitions,1))

for _ in range(10):
    _ = model(**inputs)

with torch.no_grad():
    for rep in range(repetitions):
        torch.cuda.synchronize()
        starter.record()
        output = model(**inputs)
        ender.record()
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)
        timings[rep] = curr_time

mean_syn = np.sum(timings) / repetitions
print(mean_syn)


In [None]:
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
timings=np.zeros((repetitions,1))

compiled_model = torch.compile(model, mode="reduce-overhead").to("cuda")

for _ in range(30):
  with torch.no_grad():
    _ = compiled_model(**inputs)


with torch.no_grad():
    for rep in range(repetitions):
        torch.cuda.synchronize()
        starter.record()
        output = compiled_model(**inputs)
        ender.record()
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)
        timings[rep] = curr_time

mean_syn = np.sum(timings) / repetitions
print(mean_syn)

We got nearly 40 percent speed-up! You can try bigger models and see how much speed-up you can get.