# Project¬†5: **Build a Multi-Modal Generation Agent**

Welcome to the final project! In this project, you'll use open-source text-to-image and text-to-video models to generate content. Next, you'll build a **unified multi-modal agent** similar to modern chatbots, where a single agent can support general questions, image generation, and video generation requests.

By the end of this project, you'll understand how to integrate multiple model types under one  routing system capable of deciding what modality to use based on the user's intent.



## Learning Objectives

* Use **Text-to-Image** models to generate images from a text.
* Generate short clips with a **Text-to-Video** model
* Build a **Multi-Modal Agent** that answers questions and routes media requests
* Build a simple **Gradio** UI and interact with the multi-modal agent

## Roadmap
1. Environment setup
2. Text‚Äëto‚ÄëImage
3. Text‚Äëto‚ÄëVideo
4. Multimodal Agent
5. Gradio UI
6. Celebrate

## 1 - Environment Setup

In this project, we'll use open-source Text-to-Image and Text-to-Video models to generate visuals from natural-language prompts. These models are computationally heavy and perform best on GPUs, so we recommend running this notebook in Google Colab or another GPU-enabled environment. We'll load all models from Hugging Face, which requires authentication.

Before continuing:
1. Open this project in Google Colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/Rheaxu/ai-engineer-projects/blob/main/Multi_Modal_Gen_Agent/multimodal_agent.ipynb)
2. Create a Hugging Face account and generate an access token at huggingface.co/settings/tokens
3. Paste your token in the field below to log in.
4. In the Colab environment, enable GPU acceleration by selecting Runtime ‚Üí Change runtime type ‚Üí GPU.

In [None]:
from huggingface_hub import login

# Set HUGGING_FACE_API_KEY
login(token="HUGGING_FACE_API_KEY")

<fieldset>
  <legend><b>Note</b></legend>

# AI Development Libraries: Overview

The combination of `torch`, `diffusers`, and `transformers` represents the standard "Hugging Face" stack for running state-of-the-art Generative AI.

---

## 1. torch (PyTorch)
**The Mathematical Engine.**
PyTorch is an open-source machine learning library primarily developed by Meta's AI Research lab.

* **Core Function:** It provides the data structures (Tensors) and the computational graphs needed to run deep learning models.
* **GPU Acceleration:** It allows code to run on NVIDIA GPUs via CUDA, which is essential for the speed required in AI generation.
* **Role:** It is the "foundation" that `diffusers` and `transformers` are built upon.

## 2. diffusers
**The Diffusion Pipeline.**
Developed by Hugging Face, this library is the go-to tool for diffusion models (like Stable Diffusion).

* **Core Function:** It simplifies the complex math of "denoising" (turning random noise into a coherent image).
* **Features:** It provides pre-built "pipelines" that combine the U-Net, VAE, and Schedulers into a few lines of code.
* **Role:** It manages the actual image-creation logic.



## 3. transformers
**The Language Interpreter.**
Another Hugging Face library, this is the industry standard for Natural Language Processing (NLP).

* **Core Function:** It provides the "Text Encoders" (like CLIP or T5).
* **Role:** In a text-to-image workflow, the computer doesn't "see" your words. The `transformers` library converts your text prompt into a vector (a string of numbers) that the `diffusers` library can then use as a guide.

## 4. gc (Garbage Collector)
**The Memory Manager.**
A built-in Python module used for memory management.

* **Core Function:** It manually triggers the release of memory that is no longer being used by the program.
* **Role:** Because AI models are massive, they often fill up your VRAM/RAM. Developers use `gc.collect()` to clear out "zombie" data to prevent the dreaded `Out of Memory (OOM)` errors.

---

### Comparison Summary

| Library | Category | Primary Job |
| :--- | :--- | :--- |
| **torch** | Framework | Low-level math and GPU hardware interface. |
| **diffusers** | Generative AI | Image/Video generation via diffusion. |
| **transformers** | NLP | Understanding and encoding the text prompt. |
| **gc** | Utility | Preventing memory crashes by cleaning up. |

---

</fieldset>

In [None]:
import torch, diffusers, transformers, os, random, gc
print('torch', torch.__version__, '| CUDA:', torch.cuda.is_available())

## 2 - Text-to-Image (T2I)
T2I models translate natural-language descriptions into images. They are typically based on diffusion models, which gradually refine random noise into a coherent picture guided by the text prompt. In this section, you'll load and test one such model to generate images directly from text inputs.

### 2.1: Load a T2I Model
We'll use `Stable Diffusion XL` (SDXL) by `Stability AI`, one of the open-source diffusion models. It provides high-quality, detailed image generation with relatively efficient inference compared to earlier versions.

You'll load the model from Hugging Face using the diffusers library, which simplifies running diffusion-based pipelines. To learn more about diffusers, read: https://huggingface.co/docs/diffusers/main/index


<fieldset>
  <legend><b>Note</b></legend>

# Improve model efficiency
Purely loading the `stabilityai/stable-diffusion-xl-base-1.0` model  will cause out of memory. Here are some ways to improve the efficiency.

## Method 1: Load the model with FP16 precision
By default, neural network models, including deffusion models, are trained and saved with float 32, which means that each weight or each parameter of the neural network is 4 bytes.

To make it more efficient, we ask pytorch (through Hugging Face) to set the parameters to flat16 (i.e. 2 bytes).

## Method 2: Attention
Another way is to make the attention more efficient. As sequence size increases(e.g. in high resolution videos or images), the attention ususally becomes the bottleneck.
### Understanding the Attention Bottleneck

In Generative AI, **Attention** is the mechanism that allows models to understand relationships between different parts of an input (like words in a prompt or pixels in an image). However, it is also the primary reason high-resolution generation is so computationally expensive.

---

### 1. What is Attention?
Attention allows a model to assign "importance" to different parts of an input.

* **In Text:** In the sentence "The **bank** of the river," the model uses attention to link "bank" to "river" so it knows we aren't talking about money.
* **In Images:** To draw a person's left eye, the model "attends" to the right eye to ensure they are the same color, size, and alignment.



---

### 2. The Quadratic Scaling Problem ($O(n^2)$)
The "bottleneck" exists because of how the math scales. In **Full Self-Attention**, every single token must be compared against every other token.

* If you have **$n$** tokens, you must perform **$n \times n$** operations.
* This is known as **Quadratic Complexity**.

| Sequence Length ($n$) | Total Operations ($n^2$) | Growth Factor |
| :--- | :--- | :--- |
| 1,000 | 1,000,000 | Baseline |
| 2,000 | 4,000,000 | **4x** more work |
| 4,000 | 16,000,000 | **16x** more work |

[Image showing a graph of quadratic growth ($n^2$) vs linear growth ($n$) to illustrate computational scaling]

---

### 3. Why High-Resolution is the Bottleneck
When we move from text to high-resolution images or video, the number of tokens ($n$) explodes.

#### **The Pixel-to-Token Explosion**
In models like Stable Diffusion, the image is broken into "patches" (tokens).
* **$512 \times 512$ Image:** Roughly 1,000 patches. This fits on most consumer GPUs.
* **$1024 \times 1024$ Image:** This has **4x** as many pixels. Because of $n^2$ scaling, the attention mechanism requires **16x more memory and computation**.
* **Video Generation:** Video adds the dimension of **time**. If a 1-second video has 24 frames, you are essentially trying to run attention on 24 images simultaneously.

#### **The Pixel-to-Token Explosion**
Attention is what makes the generation "coherent." Without it, the model might draw a hand in one corner of an image and a body in another, but they wouldn't connect properly because the pixels wouldn't "know" about each other's existence.

---

## 4. Demonstrating the Bottleneck (Python/PyTorch)
This script simulates how the memory usage for the **Attention Matrix** ($QK^T$) grows as the sequence length increases.

```python
import torch
import gc

def measure_attention_memory(seq_len):
    gc.collect()
    torch.cuda.empty_cache()
    
    d_model = 64 # Feature dimension
    
    try:
        # Create Query and Key tensors
        query = torch.randn(1, 1, seq_len, d_model, device="cuda")
        key = torch.randn(1, 1, seq_len, d_model, device="cuda")
        
        torch.cuda.reset_peak_memory_stats()
        
        # This step creates the [seq_len, seq_len] matrix (The Bottleneck)
        attention_matrix = torch.matmul(query, key.transpose(-1, -2))
        
        mem_used = torch.cuda.max_memory_allocated() / (1024 ** 2)
        print(f"Seq Length: {seq_len:>5} | GPU Memory: {mem_used:>8.2f} MB")
        
    except RuntimeError:
        print(f"Seq Length: {seq_len} | FAILED: Out of Memory")

lengths = [1024, 2048, 4096, 8192, 16384]
for l in lengths:
    measure_attention_memory(l)
```
## 5. How we bypass the bottleneck

To generate high-resolution images or long videos without crashing the GPU, researchers use several key architectural strategies:

### A. Latent Diffusion
Instead of performing attention on raw pixels (e.g., $512 \times 512 = 262,144$ tokens), the image is compressed by a **VAE (Variational Autoencoder)** into a "Latent Space." For example, a $512 \times 512$ image becomes a $64 \times 64$ grid of data. This reduces $n$ from **262,144** to just **4,096**, making the $n^2$ calculation significantly cheaper.

### B. FlashAttention
This is a memory-efficient way to calculate attention. Standard attention tries to write the entire $n \times n$ matrix to the GPU's High Bandwidth Memory (HBM). **FlashAttention** breaks the matrix into smaller blocks and calculates them in the GPU's fast cache (SRAM), meaning the giant bottleneck matrix never actually has to exist in its full form in the main VRAM.

### C. Tiled VAE & Windowed Attention
* **Tiled VAE:** If an image is too large to decode at once, the model breaks it into overlapping "tiles," processes them individually, and stitches them back together.
* **Windowed/Local Attention:** Instead of every pixel looking at every other pixel in the entire image, pixels only look at their immediate neighbors (a local window). This changes the math from **Quadratic** ($O(n^2)$) to **Linear** ($O(n)$).

### D. Spatio-Temporal Decoupling (For Video)
In video models like Sora or Stable Video Diffusion, the model often separates "Spatial Attention" (looking at pixels within one frame) from "Temporal Attention" (looking at the same pixel across different frames). This prevents the model from having to compare every pixel in Frame 1 to every pixel in Frame 100 simultaneously.

</fieldset>

In [None]:
from diffusers import DiffusionPipeline
# Define the Stable Diffusion XL model ID from Hugging Face and load the pre-trained model
model_id = "stabilityai/stable-diffusion-xl-base-1.0"

# Load the model with FP16 precision for efficiency
pipe_img = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, variant="fp16").to("cuda" if torch.cuda.is_available() else "cpu")
pipe_img.enable_attention_slicing()


### 2.2: Generate an image

In [None]:
# Generate and display an image from a text prompt using the loaded pipeline
prompt = "cinematic photograph of a futuristic neon cityscape at dusk, 35mm lens."
image = pipe_img(prompt).images[0]
image

### 2.3: Experimenting with "inference_steps"

The number of inference steps determines how many refinement passes the diffusion model makes. Fewer steps give quicker but less detailed images, while more steps improve clarity and structure at the cost of speed.

Try generating images with different step counts and compare the results.

In [None]:
import matplotlib.pyplot as plt

# Generate an image for different values of num_inference_steps (e.g., 10, 25, 50) and compare sharpness and detail
images = []

prompt = "cinematic photograph of a futuristic neon cityscape at dusk, 35mm lens."
step_list = [5, 15, 30]

for steps in step_list:
    image = pipe_img(prompt, num_inference_steps=steps, guidance_scale=7.5).images[0]
    images.append((steps, image))

# Plot results side-by-side
plt.figure(figsize=(12, 4))
for i, (steps, img) in enumerate(images, 1):
    plt.subplot(1, len(images), i)
    plt.imshow(img)
    plt.axis("off")
    plt.title(f"{steps} steps")
plt.tight_layout()
plt.show()


### 2.4 (Optional): Visualizing the Diffusion Process
Diffusion models start from random noise and iteratively refine it into an image that matches the prompt. If you are curious, visualize all intermediate steps to see how the noise gradually turns into a coherent picture.

In [None]:
import torch
import matplotlib.pyplot as plt

# Step 1: Run the pipeline with 50 inference steps
# Step 2: Capture intermediate latents or images during generation
# Step 3: Plot them sequentially to show noise evolving into structure
"""
YOUR CODE HERE (~10-12 lines)
"""

### 2.5 (Optional): Experiment with other models.
Different text-to-image models vary in speed, style, and visual quality. Try swapping in other open-source diffusion models and compare how their outputs differ in detail, realism, or artistic tone.

You can browse available models on Hugging Face here: https://huggingface.co/models?library=diffusers

In [None]:
# Step 1: Replace model_id with another text-to-image model from Hugging Face
# Step 2: Reload the pipeline and generate a few test images
# Step 3: Compare image quality, color balance, and prompt fidelity
"""
YOUR CODE HERE
"""

## 3 - Text-to-Video (T2V)
T2V models extend the idea of diffusion from still images to moving sequences. Instead of generating one frame, they create a series of coherent frames that depict motion consistent with the text prompt. These models are computationally heavier and often generate short clips (typically 2-10 seconds).

In this section, you'll load an open-source video diffusion model and prepare it for generation.

### 3.1: Load a T2V model

We'll use the model `damo-vilab/text-to-video-ms-1.7b`, which can produce short video clips from text prompts. This model benefits from a specialized scheduler (DPMSolverMultistepScheduler) that improves stability and speed during sampling.

In [None]:
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler

video_model_id = 'damo-vilab/text-to-video-ms-1.7b'

# Load the model with FP16 precision for efficiency
pipe_vid = DiffusionPipeline.from_pretrained(video_model_id, torch_dtype=torch.float16, variant="fp16").to("cuda" if torch.cuda.is_available() else "cpu")


In [None]:
pipe_vid.scheduler = DPMSolverMultistepScheduler.from_config(pipe_vid.scheduler.config)
pipe_vid.enable_model_cpu_offload()

### 3.2: Generate a clip
Create a short video clip from a text prompt using a text-to-video model.

In [None]:
# Step 1: Write a text prompt describing the video you want to generate
# Step 2: Run the text-to-video pipeline with your chosen prompt
prompt = "astronaut walking on Mars at sunrise"
vid_frames = pipe_vid(prompt, num_inference_steps=25, num_frames=16).frames[0]
print(vid_frames.shape)

### 3.3: Frame inspection
Inspect a single frame to sanity-check colors, resolution, and subject positioning before writing a full video.

In [None]:
import numpy as np
from PIL import Image

# Step 1: Select one frame from vid_frames (e.g., index 0)
# Step 2: Convert float [0,1] frame to uint8 [0,255]
# Step 3: Display as a PIL image
Image.fromarray(np.array(vid_frames[0] * 255, dtype=np.uint8))
# Alternatively
# Image.fromarray((vid_frames[0]*255).astype(np.uint8))

### 3.4: Convert frames to MP4
Write the generated frames to an MP4 file so you can preview and share the result.

In [None]:
# Step 1: Use diffusers.utils.export_to_video to write vid_frames to an MP4
# Step 2: Capture and print the saved video path
from diffusers.utils import export_to_video

video_path = export_to_video(vid_frames)
print(video_path)

### 3.5: Video inspection
Play the saved video inside the notebook to check motion and temporal consistency.

In [None]:
# Display the saved MP4 inline
from IPython.display import Video

Video(video_path, embed=True)

### 3.6 (Optional): Experiment with different configs
Increase `num_frames` or decrease `num_inference_steps` to experiment with clip length versus quality.

## 4 - Multimodal Generation Agent
Now that you have text-to-image, text-to-video, and basic LLM question answering, you will build a single agent that routes user requests to the right capability. The agent will read a prompt, infer intent (chat vs image vs video), and return the appropriate output.

### 4.1: Load an LLM for generic queries
Use a small LLM as the default chat brain. We will start with `gemma-3-1b-it` and keep the loading logic simple. You can swap to another compact chat model later.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch, textwrap, json, re

# Load google/gemma-3-1b-it using Hugging Face

model_id = "google/gemma-3-1b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)
gemma_llm = pipeline("text-generation", model=model, tokenizer=tokenizer)

### 4.2: Build a routing mechanism to route requests

In [None]:
def generate_media(prompt: str, mode: str):
    # Produce either an image or a short video clip from a text prompt.
    if mode == "image":
      return pipe_img(prompt).images[0]
    else:
      frames = pipe_vid(prompt, num_inference_steps=25, num_frames=16).frames[0]
      return frames

def llm_generate(prompt, max_new_tokens=64, temperature=0.7):
    # Return a response to the prompt with the loaded gemma
    outputs = gemma_llm(prompt, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature)
    return outputs[0][0]["generated_text"][-1]["content"]

In [None]:
def classify_prompt(prompt: str):
    """Classify the user prompt into QA, image, or video."""

    # Step 1: Define a system prompt explaining how to classify requests (qa, image, video)
    # Step 2: Format the user message and system message as input to the LLM
    # Step 3: Generate a response with llm_generate() and parse it using regex
    # Step 4: Extract fields "type" and "expanded_prompt" from the LLM response
    # Step 5: Return a dict with classification results or default to {"type": "qa"} on failure

    system = textwrap.dedent("""You are a routing assistant for a multimodal generation system.
        Decide whether the USER request is:
          ‚Ä¢ a factual or conversational question  ‚Üí  type = "qa"
          ‚Ä¢ an IMAGE generation request          ‚Üí  type = "image"
          ‚Ä¢ a VIDEO generation request           ‚Üí  type = "video"
        If it is for image or video, produce an improved, vivid, detailed `expanded_prompt`.
        Respond ONLY in this format: {"type": "...", "expanded_prompt": "..."}
    """)

    messages = [
      [
          {
              "role": "system",
              "content": [{"type": "text", "text": system},]
          },
          {
              "role": "user",
              "content": [{"type": "text", "text": prompt},]
          },
      ],
    ]
    response = llm_generate(messages, temperature=0.2)
    match = re.search(r'"type"\s*:\s*"([^"]+)"\s*,\s*"expanded_prompt"\s*:\s*"([^"]+)', response)
    if match:
        try:
            result = {
              "type": match.group(1),
              "expanded_prompt": match.group(2)
            }
            return result
        except Exception:
            pass
    # fallback
    return {"type": "qa"}


<fieldset>
  <legend><b>Explain the Regex in the above cell</b></legend>

## Explain the regex:
```
match = re.search(r'"type"\s*:\s*"([^"]+)"\s*,\s*"expanded_prompt"\s*:\s*"([^"]+)', response)
```

The line of code uses **Regular Expressions (Regex)** to extract values from a string (likely an LLM response). It is specifically looking for a pattern that resembles a JSON object containing the keys `"type"` and `"expanded_prompt"`.

---

### 1. The Regex Pattern Explained
`r'"type"\s*:\s*"([^"]+)"\s*,\s*"expanded_prompt"\s*:\s*"([^"]+)'`

| Segment | Meaning |
| :--- | :--- |
| `r'...'` | **Raw string:** Tells Python to treat backslashes literally (standard for regex). |
| `"type"` | Matches the literal text `"type"`. |
| `\s*:\s*` | Matches a colon, allowing for any amount of **whitespace** before or after it. |
| `"([^"]+)"` | **Capture Group 1:** Matches the value inside quotes. `[^"]+` means "one or more characters that are NOT a double quote." |
| `\s*,\s*` | Matches the comma between the two JSON keys, allowing for whitespace. |
| `"expanded_prompt"` | Matches the literal text `"expanded_prompt"`. |
| `\s*:\s*` | Matches the colon and surrounding whitespace. |
| `"([^"]+)` | **Capture Group 2:** Matches the value of the expanded prompt until the next quote. |

---

### 2. Why use this instead of `json.loads()`?
In AI engineering, we often use Regex instead of standard JSON parsing for two reasons:

1. **Handling "Chatty" LLMs:** Models often wrap their JSON in conversational text (e.g., *"Here is the data: { ... }"*). `json.loads()` would throw an error, but `re.search()` will ignore the extra text and find the data inside.
2. **Resilience:** If the LLM forgets to close the final curly brace `}`, the Regex can still extract the specific fields it found.

---

### 3. How to use the results in Python
After running the search, you access the data using the `.group()` method:

```python
import re

response = 'The model output is: {"type": "image_gen", "expanded_prompt": "A futuristic city"}'
match = re.search(r'"type"\s*:\s*"([^"]+)"\s*,\s*"expanded_prompt"\s*:\s*"([^"]+)', response)

if match:
    # group(1) is the first ([^"]+) - the "type"
    gen_type = match.group(1)
    # group(2) is the second ([^"]+) - the "expanded_prompt"
    prompt = match.group(2)
    
    print(f"Detected Type: {gen_type}")
    print(f"Extracted Prompt: {prompt}")
```  
### 4. Limitations & Risks

* **Escaped Quotes:** This regex will break if your prompt contains escaped quotes (e.g., `"expanded_prompt": "A photo of a \"cool\" car"`). It will stop reading at the quote before `cool`.
* **Ordering:** This specific regex assumes `"type"` comes **before** `"expanded_prompt"`. If the LLM swaps the order, the match will fail.
* **Nested Objects:** If the JSON structure is deeper than one level, this simple regex will likely fail to capture the correct data.

</fieldset>

### 4.3: Build the multimodal agent
This agent takes a single user prompt, sends it to the `classify_prompt` to determine what kind of task it is, and then calls the appropriate module:
- QA: use the chat LLM to generate an answer
- Image: use the text-to-image generator
- Video: use the text-to-video generator

Start with a simple version first. You can improve it later by adding better prompts, guardrails, and citation handling.

In [None]:
def multimodal_agent(user_prompt: str):
    # Step 1: Classify the request
    # Step 2: Route the prompt and generate output

    decision = classify_prompt(user_prompt)
    kind = decision.get('type', 'qa')
    if kind == 'qa':
        system = "You are a helpful assistant."
        messages = [
            [
                {
                    "role": "system",
                    "content": [{"type": "text", "text": system},]
                },
                {
                    "role": "user",
                    "content": [{"type": "text", "text": user_prompt},]
                },
            ],
        ]
        return llm_generate(messages)
    else:
        return generate_media(decision['expanded_prompt'], mode=kind)

### 4.4: Test the agent
Now let's test your multimodal agent end to end. Each prompt will automatically be routed to the correct capability: text Q&A, image generation, or video generation, and display the corresponding output.

In [None]:
from diffusers.utils import export_to_video
from IPython.display import display, Video

# Step 1: Define a few diverse prompts (QA, image, video)
# Step 2: For each prompt, call multimodal_agent and inspect the returned result

for p in [
    "What's the capital of Iceland?",
    "Generate an image of a neon dragon flying over Tokyo at night",
    "Create a short video of a paper plane folding itself"
]:
    result = multimodal_agent(p)
    print('\nPROMPT:', p)
    if isinstance(result, str):
        print(result)
    else:
        if hasattr(result, 'save'):
            display(result)
        else:
            vid = export_to_video(result)
            print(f"video path: {vid}")
            display(Video(vid, embed=True))

Replace the sample queries with your own and verify that the agent chooses the correct generation path.

## 5 - Interactive Web UI

Launch a simple Gradio web interface so you (or your users) can play with the multimodal agent from the browser.


In [None]:
import gradio as gr
with gr.Blocks() as demo:
    gr.Markdown('# Multimodal Agent')
    inp = gr.Textbox(placeholder='Ask or create...')
    btn = gr.Button('Submit')
    out_text = gr.Markdown()
    out_img = gr.Image()
    out_vid = gr.Video()

    def handle(prompt):
        res = multimodal_agent(prompt)
        if isinstance(res, str):
            return res, None, None
        elif hasattr(res, 'save'):
            return '', res, None
        else:
            vid = export_to_video(res)
            return '', None, vid

    btn.click(handle, inp, [out_text, out_img, out_vid])

demo.launch()

After the UI launches, open the link and generate your own images and videos directly from the browser.

## üéâ Congratulations!

* You have built a **multi-modal agent** capable of understanding various requests, and routing them to the proper model.
* Try experimenting with other T2I and T2V models.
* Try making your system more efficient. For example, load a separate lightweight llm for routing, and a more capable llm for QA.


üëè **Great job!** Take a moment to celebrate. The techniques you implemented here power many production agents and chatbots.