# Intro


In my professional life, I wear many hats: researcher/AI engineer/AI application developer/Software engineer/etc. Pick whatever label you like. 
The code I write eventually ends up on a production environment, supported by sophisticated DevOps infrastructure. This system 
leverages a suite of tools such as Kubernetes, Rancher, Karpenter, Helm charts, Argo CD, GitHub Actions, 
and of course AWS. I'm fortunate to work alongside an exceptional DevOps team that keeps this complex machinery running smoothly. 
While I'm not deeply involved in the nitty-gritty of DevOps and infrastructure, I'm certainly exposed to it.
I have a solid high level understanding of the underlying infrastructure that powers my code, including job scheduling, worker management, 
API interactions, and resource allocation for CPU and memory. I firmly believe that understanding the operational 
environment, at least to some level, is crucial for effective development.


On the other hand, I also crave the simplicity of building and tinkering without infrastructure concerns, especially in my free time. 
I shouldn't need a DevOps team for personal projects. Ideally, I'd work directly with Python code using just my IDE and terminal. 
I'd rather avoid writing another YAML file or learning yet another framework. The last thing I want is to worry about spinning up instances, 
managing IAM roles, installing CUDA drivers, or juggling multiple microservices and containers. What I seek is a streamlined development experience 
that lets me focus on creativity and problem-solving, not infrastructure management.



This is where Modal enters the picture. I'm genuinely excited about [Modal](https://modal.com/) and consider it 
the most impressive platform I've encountered for running code without infrastructure concerns. Modal is a serverless 
platform designed for Data/ML/AI teams that seamlessly bridges the gap between local development and cloud execution. 
The primary interface is a Python SDK, where decorators are used to quickly move function execution into the cloud.
You write your code as if it were running locally, and Modal effortlessly deploys and runs it in the cloud. This 
approach offers the best of both worlds: the simplicity of local development with the power and scalability of 
cloud computing. 


Modal didn't simply create a wrapper on top of Kubernetes or Docker.
While I won't even pretend to understand the engineering behind it,
it's clearly their secret sauce. From what I've read and heard,
they've built their own systems from scratch in Rust, including a container runtime, custom file system, custom image builder, and custom job scheduler.
This allows for launching containers in the cloud within seconds. 


I should note that I have only used Modal for personal projects and tinkering around with various ideas. 
However, I anticipate incorporating it more into my professional work, particularly for
research projects and proofs of concept. Looking ahead, I can envision leveraging Modal directly in our production environment as well. 
It seems particularly well-suited for deploying complex AI models that require specific container configurations 
and GPU resources, especially in scenarios with unpredictable or spiky traffic patterns.



If you want to learn more about the history of Modal or keep up with the latest news, I recommend the following resources:

- [Modal Website](https://modal.com/)
- [Modal X Account](https://x.com/modal_labs)
- [Modal Slack Account](https://modal.com/slack) (They are so helpful and responsive on Slack)
- [Charles Frye X Account](https://x.com/charles_irl) (AI Engineer at Modal)
- [Erik Bernhardsson X Account](https://x.com/bernhardsson) (CEO at Modal)
- [1 to 100: Modal Labs](https://www.youtube.com/watch?v=MGVeavVJiWw) (Interview with Erik Bernhardsson)
- [Why you should join Modal](https://whyyoushouldjoin.substack.com/p/modal) (Article)
- [What I have been working on: Modal](https://erikbern.com/2022/12/07/what-ive-been-working-on-modal.html) (Older article with relevant background)

## Why am I writing this Post?


I could simply direct you to the Modal Documentation, which is exceptionally comprehensive and well-crafted. 
In fact, it's so good that I doubt I could do it justice in a single post. However, I'm currently investing time 
in learning Modal, and what better way to solidify my understanding than by writing about it? Even if it means 
repeating some of the information in the documentation, it will still be a valuable exercise. Moreover, I'm eager 
to spread the word about this game-changing platform that I believe is still flying under the radar for many developers. 
By sharing my experiences and insights, I hope to contribute to the growing community of Modal enthusiasts.


# Setting Up Modal

- [getting started documentation](https://modal.com/docs/guide)

```
# create an account at modal.com
pip install modal
modal setup
```

🚀✨ That is like zero friction! ✨🚀

# Hello Modal

Okay let’s write our first function and run it in the cloud.

In [2]:
#| echo: false

from IPython.display import display, Markdown

def import_python_as_markdown(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
    return f"```python\n{content}\n```"

In [12]:
#| echo: false
file_path = 'hello_modal.py'
markdown_content = import_python_as_markdown(file_path)
display(Markdown(markdown_content))

```python
import modal

app = modal.App("hello-modal")


@app.function()
def f(i):
    print(f"hello modal! {i} + {i} is {i+i}\n")
    return i


@app.local_entrypoint()
def main():
    print("This is running locally")
    print(f.local(1))

    print("This is running remotely on Modal")
    print(f.remote(2))

    print("This is running in parallel and remotely on Modal")
    total = 0
    for ret in f.map(range(2500)):
        total += ret
    print(total)

```

We decorated our function with the primary logic and then decorated the entry point.
We can call the function locally, remotely, in parallel and remotely on Modal.
Here is a video showing the output when running the code.
We run it with this command: 


`modal run hello_modal.py`

{{< video https://www.youtube.com/watch?v=-BgAiGW4o5c >}}

Take a moment to let that sink in! We can run the code on a remote server and see the output and print statements locally. 
Imagine trying to do that with a traditional server where you have to log in and manually copy the logs. 
This is a simple function, but the ability to run it remotely on Modal and get the output locally is quite impressive. 
Modal handles spinning up containers and managing everything else seamlessly.

## Shell into your container

We will see in later examples how to customize the environment of the container.
But even with this simple example we can shell into the default container and poke around.
There are numerous ways to [develop and debug your application with Modal](https://modal.com/docs/guide/developing-debugging#developing-and-debugging).

Here we will use the `modal shell` command to quickly create a container and shell into it.

```
modal shell hello_modal.py::f
```

This video shows how easy it is to shell into the container.

{{< video https://www.youtube.com/watch?v=5yw29jQEn3E >}}

By shelling into the container you get direct access to that isolated environment.
You can inspect the file system, test the installation of additional dependencies, and generally poke around to ensure
your application is configured correctly.


# Image Generation with Flux Models from Black Forest Labs

Let's dive into our first "real" and exciting example. If you haven't heard already, the new Flux image generation models from [Black Forest Labs](https://blackforestlabs.ai/) are truly impressive. One of the easiest ways to try them out is through [Replicate](https://replicate.com/stability-ai/sdxl).

What's particularly appealing about the two smaller Flux models is that their weights are open-source and available for download. Running these models using the transformers and diffusers libraries is relatively straightforward. You can find an example in the model card [here](https://huggingface.co/black-forest-labs/FLUX.1-schnell#diffusers) - it only takes a handful of lines of code!

However, there's a catch: you need access to a GPU environment with CUDA installed. This can be a  barrier for developers, including myself, who don't have access to a local GPU. This is where Modal can really shine, providing easy access to GPU resources in the cloud within an isolated environment. You can mess around within the isolated environment without worrying about messing up your local machine. 

Here is some code to create a simple endpoint hosted through Modal that allows you to generate images.
You can read the Modal documentation for all the details but I will call out some of the important features here.

- We defined a specific container with `Image.debian_slim(python_version="3.11").run_commands(.....`
    - I did not know what to put here initially. I just started with a blank slate and then shelled into the container and tried running snippets of code until I figured out what I needed. It's an iterative process where you build up your container with whatever dependencies you need.
- We use the [Modal class syntax](https://modal.com/docs/guide/lifecycle-functions). 
    - `@enter` - Called when a new container is started. Useful for loading weights into memory for example.
    - `@build` - Code that runs as a part of the container image build process. Useful for downloading weights. 
    - In the case of Hugging Face diffusers models, the weights only have to be downloaded once and future containers will only need to load the weights into memory. This means the initial build proccess takes longer but subsequent builds are much faster. Especially since Modal has done all the heavy lifting and enginnering to make containers load very fast.
- `@modal.web_endpoint(method="POST", docs=True)` is used to create a web server using FastAPI under the hood. See [here](https://modal.com/docs/guide/webhooks) for more information on creating web endpoints.
- Modal has multiple ways to deal with secrets and environment variables. See [here](https://modal.com/docs/guide/secrets) for more information. Here I am making use of `Secret.from_dotenv()` to load the Hugging Face token from a .env file.
- `gpu="A100"` - [GPU acceleration!](https://modal.com/docs/guide/gpu#gpu-acceleration). 


In [6]:
#| echo: false
markdown_content = import_python_as_markdown('flux.py')
display(Markdown(markdown_content))

```python
import modal
from modal import Image, build, enter
import os
from dotenv import load_dotenv

load_dotenv()
app = modal.App("black-forest-labs-flux")

image = Image.debian_slim(python_version="3.11").run_commands(
    "apt-get update && apt-get install -y git",
    "pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124",
    "pip install transformers",
    "pip install accelerate",
    "pip install sentencepiece",
    "pip install git+https://github.com/huggingface/diffusers.git",
    "pip install python-dotenv",
    f'huggingface-cli login --token {os.environ["HUGGING_FACE_ACCESS_TOKEN"]}',
)


@app.cls(image=image, secrets=[modal.Secret.from_dotenv()], gpu="A100", cpu=4, timeout=600, container_idle_timeout=300)
class Model:
    @build()
    @enter()
    def setup(self):
        import torch
        from diffusers import FluxPipeline
        from transformers.utils import move_cache

        # black-forest-labs/FLUX.1-schnell
        # black-forest-labs/FLUX.1-dev
        self.model = "black-forest-labs/FLUX.1-schnell"
        self.pipe = FluxPipeline.from_pretrained(self.model, torch_dtype=torch.bfloat16).to("cuda")
        move_cache()

    @modal.web_endpoint(method="POST", docs=True)
    def f(self, data: dict):
        import torch
        import random
        from io import BytesIO
        import base64

        prompts = data["prompts"]
        fnames = data["fnames"]
        num_inference_steps = data.get("num_inference_steps", 4)
        seed = data.get("seed", random.randint(1, 2**63 - 1))
        guidance_scale = data.get("guidance_scale", 3.5)

        results = []
        for prompt, fname in zip(prompts, fnames):
            image = self.pipe(
                prompt,
                output_type="pil",
                num_inference_steps=num_inference_steps,
                generator=torch.Generator("cpu").manual_seed(seed),
                guidance_scale=guidance_scale,
            ).images[0]

            # Convert PIL image to bytes
            buffered = BytesIO()
            image.save(buffered, format="PNG")
            img_str = base64.b64encode(buffered.getvalue()).decode()

            results.append(
                {
                    "filename": f"{fname}_guidance_scale_{guidance_scale}_num_inference_steps_{num_inference_steps}_seed_{seed}_model_{self.model.replace('/', '_')}.png",
                    "image": img_str,
                }
            )

        return results


```



You can serve it for testing with this command:

```
modal serve flux.py
```


{{< video https://www.youtube.com/watch?v=692hHe6Irjg >}}

I sped up the video but here are some takeaways:

- The app starts immediately with 0 containers so the cost is zero. Containers are only spun up when the first request comes in.
- I had already built the container image so this time around the model weights were already within the container image. On the first request to the endpoint the container spins up and the model weights are loaded into memory. The container boots up in about 25 seconds.
- Once the weights are loaded into memory, subsequent requests are much faster.
- `container_idle_timeout` is set to 300 seconds. This means the container will be terminated after 300 seconds of inactivity so it scales down to 0 containers. But the endpoint application is still running and can scale back up when the next request comes in.


Here is some inference code so we can hit the endpoint and download the images locally. This code will work as long as the endpoint is running.






In [7]:
#| echo: false
markdown_content = import_python_as_markdown('flux_inference.py')
display(Markdown(markdown_content))

```python
def main():
    import requests
    import base64
    import os

    os.makedirs("images", exist_ok=True)
    # Your API endpoint URL
    API_URL = "https://drchrislevy--black-forest-labs-flux-model-f-dev.modal.run"  # Replace with your actual Modal app URL

    # Sample data
    data = {
        "prompts": [
            "A serene mountain landscape at sunset",
            "A futuristic cityscape with flying cars",
            "An underwater scene with colorful coral reefs",
            "A steampunk-inspired clockwork dragon",
            "A bioluminescent forest at midnight",
            "An ancient library filled with floating books",
            "A surreal Salvador Dali-inspired melting cityscape",
            "A cyberpunk street market in neon-lit rain",
            "A whimsical tea party on a giant mushroom",
            "An intergalactic spaceport with alien travelers",
        ],
        "fnames": [
            "mountain_sunset",
            "future_city",
            "underwater_coral",
            "steampunk_dragon",
            "bioluminescent_forest",
            "floating_library",
            "melting_cityscape",
            "cyberpunk_market",
            "mushroom_teaparty",
            "alien_spaceport",
        ],
        "num_inference_steps": 4,
        "guidance_scale": 7,
    }

    # Make the API request
    response = requests.post(API_URL, json=data)

    if response.status_code == 200:
        results = response.json()

        for result in results:
            filename = result["filename"]
            img_data = base64.b64decode(result["image"])

            with open(os.path.join("images", filename), "wb") as f:
                f.write(img_data)
            print(f"Saved: {filename}")
    else:
        print(f"Error: {response.status_code}")
        print(response.text)

    print("All images have been downloaded to the 'images/' folder.")
```

Here is a video of running the inference code, inspecting the Modal application dashboard, and viewing downloaded images on my local machine.
The video is also sped up.

{{< video https://www.youtube.com/watch?v=MI91tqjT9z4 >}}


One very awesome thing about Modal is that it scales automatically with the number of requests.
Containers are spun up and down dynamically based on the number of requests. You can tweak parameters to control the behavior of the scaling, and you can refer to the [documentation](https://modal.com/docs/guide/concurrent-inputs) for the details. Let's illustrate this with running the large Flux model, `"black-forest-labs/FLUX.1-dev"`.

Im going to change the request payload to

```python
# Sample data
data = {
    "prompts": [
        "Futuristic spaceship wreckage overgrown with lush forest vegetation",
        "Pristine tropical island with crystal-clear blue waters and white sandy beaches",
        "Abandoned, overgrown streets of post-apocalyptic Boston from The Last of Us",
    ],
    "fnames": ["sci_fi_forest_ship", "tropical_island_paradise", "last_of_us_boston"],
    "num_inference_steps": 50,
    "guidance_scale": 3.5,
}
```

In the video I'm going to kick off 10 requests to the endpoint in 10 shells at the same time.
You will see the containers spin up and then back down automatically. For this demo
I also changed `container_idle_timeout` to 10 seconds so the containers are terminated quickly.
The video is sped up at 5X the speed.


{{< video https://www.youtube.com/watch?v=K9vDW8J440k >}}

# TODO

````
allow_concurrent_inputs=1,
concurrency_limit=2,
````