![nebullvm nebuly AI accelerate inference optimize DeepLearning](https://user-images.githubusercontent.com/38586138/201391643-a80407e5-2c28-409c-90c9-327795cd27e8.png)

# Accelerate Stable Diffusion with Speedster


Hi and welcome 👋

In this notebook we will discover how in just a few steps you can speed up the response time of deep learning model inference using the Speedster app from the open-source library nebullvm.

With Speedster's latest API, you can speed up models up to 10 times without any loss of accuracy (option A), or accelerate them up to 20-30 times by setting a self-defined amount of accuracy/precision that you are willing to trade off to get even lower response time (option B). To accelerate your model, Speedster takes advantage of various optimization techniques such as deep learning compilers (in both option A and option B), quantization, half accuracy, and so on (option B).

Let's jump to the code.

# Installation

Install Speedster:

In [None]:
!pip install speedster

Install deep learning compilers:

In [None]:
!python -m nebullvm.installers.auto_installer --frameworks diffusers --compilers all

## Model and Dataset setup

First of all we have to choose the version of Stable Diffusion we want to optimize, speedster officially supports the most used versions:
- `CompVis/stable-diffusion-v1-4`
- `runwayml/stable-diffusion-v1-5`
- `stabilityai/stable-diffusion-2-1-base`

Other Stable Diffusion versions from the Diffusers library should work but have never been tested. If you try a version not included among these and it works, please feel free to report it to us on [Discord](https://discord.com/invite/RbeQMu886J) so we can add it to the list of supported versions. If you try a version that does not work, you can open an issue and possibly a PR on [GitHub](https://github.com/nebuly-ai/nebullvm/issues).

For this notebook, we are going to select Stable Diffusion 1.4. Let's download and load it using the diffusers API:

In [None]:
import torch
from diffusers import StableDiffusionPipeline

# Select Stable Diffusion version
model_id = "CompVis/stable-diffusion-v1-4"

device = "cuda" if torch.cuda.is_available() else "cpu"

if device == "cuda":
    # On GPU we load by default the model in half precision, because it's faster and lighter.
    pipe = StableDiffusionPipeline.from_pretrained(model_id, revision='fp16', torch_dtype=torch.float16)
else:
    pipe = StableDiffusionPipeline.from_pretrained(model_id)


We can easily test the loaded model by generating a sample image

Let's now create an example dataset with some random sentences, that will be used later for the optimization process

In [None]:
input_data = [
    "a photo of an astronaut riding a horse on mars",
    "a monkey eating a banana in a forest",
    "white car on a road surrounded by palm trees",
    "a fridge full of bottles of beer",
    "madara uchiha throwing asteroids against people"
]

## Speed up inference with Speedster

It's now time of improving a bit the performance in terms of speed. Let's use `Speedster`.

In [None]:
from speedster import optimize_model, save_model, load_model

Using Speedster is very simple and straightforward! Just use the `optimize_model` function and provide as input the model, some input data as example and the optimization time mode. Optionally a dynamic_info dictionary can be also provided, in order to support inputs with dynamic shape.

**Optimisation of stable diffusion requires a lot of RAM. If you are running this notebook on google colab, make sure to use the high RAM option, otherwise the kernel may crash. If the kernel crashes also when using the high RAM option, please try adding also `"torchscript"` to the `ignore_compilers` list.**

In [None]:
optimized_model = optimize_model(
    model=pipe,
    input_data=input_data,
    optimization_time="unconstrained",
    ignore_compilers=["torch_tensor_rt", "tvm"],  # Some compilers have issues with Stable Diffusion, so it's better to skip them.
    metric_drop_ths=0.1,
)

If running on GPU, here you should obtain a speedup of about 60% on the UNet. We run the optimization on a **3090Ti** and here are our results:
- **Original Model (PyTorch, fp16): 51,298 ms/batch**
- **Optimized Model (TensorRT, fp16): 32,164 ms/batch**

If the optimized model you obtained is not a TensorRT one, probably there was an error during the optimization. If running on colab, it could happen that the standard gpu is not enough to run the optimization, so we suggest to select a premium gpu with more memory.


If everything worked correctly, let's check the output of the optimized model

In [None]:
test_prompt = "futuristic llama with a cyberpunk city on the background"


In [None]:
optimized_model(test_prompt).images[0]

Let's run the prediction 10 times to calculate the average response time of the original model.

In [None]:
if device == "cuda":
    pipe = StableDiffusionPipeline.from_pretrained(model_id, revision='fp16', torch_dtype=torch.float16)
else:
    pipe = StableDiffusionPipeline.from_pretrained(model_id)

pipe.to(device)

In [None]:
import time

times = []

# Warmup for 2 iterations
for _ in range(2):
    with torch.no_grad():
        final_out = pipe(test_prompt).images[0]

# Benchmark
for _ in range(8):
    st = time.time()
    with torch.no_grad():
        final_out = pipe(test_prompt).images[0]
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)
print(f"Average response time for original Stable Diffusion 1.4: {original_model_time} s")

Let's run the prediction 10 times to calculate the average response time of the optimized model.

In [None]:
times = []

for _ in range(2):
    with torch.no_grad():
        final_out = optimized_model(test_prompt).images[0]

# Benchmark
for _ in range(8):
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(test_prompt).images[0]
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)
print(f"Average response time for optimized Stable Diffusion 1.4: {optimized_model_time} s")

## Save and reload the optimized model

We can easily save to disk the optimized model with the following line:

In [None]:
save_model(optimized_model, "model_save_path")

We can then load again the model:

In [None]:
optimized_model = load_model("model_save_path", pipe=pipe)

## Advanced: Further increase performance with TensorRT Plugins (GPU Only)

Reference: https://github.com/NVIDIA/TensorRT/tree/main/demo/Diffusion

To achieve the best results on GPU, we have to activate the TensorRT Plugins. `Speedster` supports their usage on Stable Diffusion models from version `0.9.0`.

First of all, install them by following this [guide](https://github.com/nebuly-ai/nebullvm/tree/main/notebooks/speedster/diffusers#setup-tensorrt-plugins-optional), then edit the cell below with the correct paths.

If you are working in the nebullvm docker image, the plugins are installed and activated by default, so the optimization results of this section should be the same as in the previous one.

In [None]:
plugin_path = "/content/TensorRT/build/out/libnvinfer_plugin.so"  # EDIT THIS PATH
tensorrt_lib_path = "/content/TensorRT-8.5.3.1/lib"  # EDIT THIS PATH

import os
if not os.path.exists(plugin_path) or not os.path.exists(tensorrt_lib_path):
    raise Exception("The paths provided above are invalid, please edit them according to your configuration.")

os.environ["LD_PRELOAD"] = plugin_path
if "LD_LIBRARY_PATH" not in os.environ:
    os.environ["LD_LIBRARY_PATH"] = ""
os.environ["LD_LIBRARY_PATH"] = os.environ["LD_LIBRARY_PATH"] + ":" + tensorrt_lib_path
os.environ["CUDA_MODULE_LOADING"] = "LAZY"

Let's repeat again the optimization process, this time we will use only TensorRT from the ONNX pipeline for faster optimization:

In [None]:
if torch.cuda.is_available() is False:
    raise Exception("You are running in a CPU-only machine, TensorRT can be used only on GPU.")

**Optimisation of stable diffusion requires a lot of RAM. If you are running this notebook on google colab, make sure you use the high RAM option, otherwise the kernel may crash.**

In [None]:
optimized_model = optimize_model(
    model=pipe,
    input_data=input_data,
    optimization_time="unconstrained",
    ignore_compilers=["torch_tensor_rt", "onnxruntime", "torchscript", "tvm"],
    metric_drop_ths=0.1,
)

If running on GPU, here you should obtain a speedup of about 160% on the UNet. We run the optimization on a **3090Ti** and here are our results:
- **Original Model (PyTorch, fp16): 51,298 ms/batch**
- **Optimized Model (TensorRT with Plugins, fp16): 19,8 ms/batch**

If again the optimized model you obtained is not a TensorRT one, probably there was an error during the optimization. If running on colab, it could happen that the standard gpu is not enough to run the optimization, so we suggest to select a premium gpu with more memory.


Let's run the prediction 10 times to calculate the average response time of the optimized model.

In [None]:
import time

times = []

for _ in range(2):
    with torch.no_grad():
        final_out = optimized_model(test_prompt).images[0]

# Benchmark
for _ in range(8):
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(test_prompt).images[0]
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)
print(f"Average response time for optimized Stable Diffusion 1.4: {optimized_model_time} s")

Finally, we can use the optimized model to generate a sample image and see the result

In [None]:
optimized_model(test_prompt).images[0]

Great! Was it easy? How are the results? Do you have any comments?
Share your optimization results and thoughts with <a href="https://discord.gg/RbeQMu886J" target="_blank"> our community on Discord</a>, where we chat about Speedster and AI acceleration.

Note that the acceleration of Speedster depends very much on the hardware configuration and your AI model. Given the same input model, Speedster can accelerate it by 10 times on some machines and perform poorly on others.

If you want to learn more about how Speedster works, look at other tutorials and performance benchmarks, check out the links below or write to us on Discord.

<center> 
    <a href="https://discord.com/invite/RbeQMu886J" target="_blank" style="text-decoration: none;"> Join the community </a> |
    <a href="https://nebuly.gitbook.io/nebuly/welcome/questions-and-contributions" target="_blank" style="text-decoration: none;"> Contribute to the library </a>
</center>

<center> 
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#key-concepts" target="_blank" style="text-decoration: none;"> How speedster works </a> •
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#documentation" target="_blank" style="text-decoration: none;"> Documentation </a> •
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#quick-start" target="_blank" style="text-decoration: none;"> Quick start </a> 
</center>