# Vision Agents

In this repo we will implement a multimodal agentic system. 


Specifically, we continue with the arvix agent we implemented in [voice agents/](../voice_agents/) to research for interesting papers on arxiv. 

Then, once a paper (a single one for simplicity) is selected, the system will produce a summary: both written and visual. For the latter, we use Gemini's NanoBanana model, which has been known to produce great "whiteboard summaries" (check example below).

After the image generation node, we have an image reviewer that checks the quality of the generated image. Since text summary and visual summary node run in parallel, we need a 'fan out' node (`create_report`) and a 'fan-in' node (`reduce`).



<img src="../projects/vision_agents/graph_plot/arxiv_20260107_163830.png" width=400>


## Multimodal Inputs

Multimodality refers to the ability to work with data that comes in different forms, such as text, audio, images, and video. LangChain includes standard types for these data that can be used across providers.

Chat models can accept multimodal data as input and generate it as output. 

For LangChain, we need to structure additional input as content blocks, like this: 

```python

# From base64 data
message = {
    "role": "user",
    "content": [
        {"type": "text", "text": "Describe the content of this image."},
        {
            "type": "image",
            "base64": "AAAAIGZ0eXBtcDQyAAAAAGlzb21tcDQyAAACAGlzb2...",
            "mime_type": "image/jpeg",
        },
    ]
}
```

Content blocks are just a list of typed dictionaries.

Find all examples here: [link](https://docs.langchain.com/oss/python/langchain/messages#multimodal)

In our implementation we attach images in input like this (functions defined in [`utils.py`](../projects/vision_agents/src/utils.py)):

```python

def add_imgs(state: MyState, mime_type: Literal["image/jpeg", "image/png"]) -> HumanMessage:
    """
    Helper to create multimodal message from state

    Args:
        state (MyState): The state of the graph
        mime_type (Literal["image/jpeg", "image/png"]): The mime type of the images
    Returns:
        message (HumanMessage): The multimodal message
    """    
    msg = "Here are the images to review"
    
    content_blocks = [{"type": "text", "text": msg}]   # it is a list of typed dicts, see https://docs.langchain.com/oss/python/langchain/messages#multimodal
    
    # Add images
    for img_b64 in state.get("generated_images", []):
        content_blocks.append({
            "type": "image",
            "base64": img_b64,
            "mime_type": mime_type
        })

    # construct the messages as HumanMessage(content_blocks=...)
    message = HumanMessage(content_blocks=content_blocks)  # v1 format, see https://docs.langchain.com/oss/python/langchain/messages#multimodal
    
    return message   # NOTE: returns msg as is, then you need to wrap it in a list!
```

and pdf files like this: 

```python
def add_pdfs(state: MyState) -> HumanMessage:
    """
    Helper to add the pdf to the input message

    Args:
        state (MyState): The state of the graph

    Returns:
        message (HumanMessage): The message with the pdf
    """    
    msg = "Summarize the content of this document"
    content_blocks = [{"type": "text", "text": msg}]   # it is a list of typed dicts, see https://docs.langchain.com/oss/python/langchain/messages#multimodal
    
    for pdf_path in state.get("downloaded_papers_paths", []):
        with open(pdf_path, "rb") as f:
            pdf_b64 = base64.b64encode(f.read()).decode("utf-8")
        content_blocks.append({
            "type": "file",
            "base_64": pdf_b64,
            "mime_type": "application/pdf"
        })
    message = HumanMessage(content_blocks=content_blocks)  # v1 format, see https://docs.langchain.com/oss/python/langchain/messages#multimodal
    return message   # NOTE: returns msg as is, then you need to wrap it in a list!
```

## Generative Models Without LangChain

Now, another interesting point is that sometimes we may find ourselves limited in what LangChain can offer in terms of compatibility with llm providers.

Of course LangChain/LangGraph have their strengths (otherwise this course would not make a lot of sense) and one of the main pros of using LangChain is that we can manage different llm providers in a unified interface. 

But this can leave some 'blind spots' for the most reccent models or for soome specific applications: for example, using LangChain's ChatOpenAI wrapper to call Gemini's NanoBanana makes it hard to access the images generated by the model. 

It's simpler to use the Gemini API through the OpenAI sdk, or maybe through Gemini/Openrouter's APIs. You can see [here](https://openrouter.ai/google/gemini-3-pro-image-preview/api) that we have several choices.

In our application we call nanobanana like this:

```python
def nanobanana_generate(state: MyState, nanobanana_prompt: str) -> list[str]:
    """
    Generates an image from a PDF using the Nanobanana model.

    Args:
        state (MyState): The state of the graph
        nanobanana_prompt (str): The prompt for the Nanobanana model
    Returns:
        image_urls (list[str]): The list of image URLs

    Raises:
        RuntimeError: If the image generation fails
    """
    client = OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key=os.getenv("OPENROUTER_API_KEY")
    )
    # Find example.jpg using glob - search from the src directory
    utils_dir = Path(__file__).parent
    example_files = list(utils_dir.glob("**/example.jpg"))
    if not example_files:
        raise FileNotFoundError("Could not find example.jpg in the repository")
    example_file_path = example_files[0]
    with open(example_file_path, "rb") as f:
        example_img = base64.b64encode(f.read()).decode("utf-8")
    response = client.chat.completions.create(
        model="google/gemini-3-pro-image-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text", 
                        "text": nanobanana_prompt
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:application/pdf;base64,{state.get('pdf_base64')}"
                        }
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{example_img}"
                        }
                    }
                ]
            }
        ],
        extra_body={"modalities": ["image", "text"]}
    )

    image_urls = []
    response = response.choices[0].message
    if response.images:
        for image in response.images:
            image_url = image['image_url']['url']  # Base64 data URL
            image_urls.append(image_url)
    else:
        raise RuntimeError("Failed to generate image")

    return image_urls
```

You can see that we are actually putting together several things in this function: 

- we are using multimodal inputs, as we want the model to both: 
    a) see the example image before generating;
    b) read the pdf of which we need a summary image;

- we are getting the pdf in base64 encoding from state (we had already encoded it when downloading the pdf in the tools, and save the base64 form to state);
- we are using openai's sdk
- we are passing the prompt as the input text to the model (see next section)

## Adding Images (or Audios) to System Prompts

We already saw how important prompts are in agentic workflows. So it's natural that when using a model with multimodal input we'd want to have a **multimodal prompt**, by adding images or audios to our system prompt. 

How can we do that? 

Well, if we understand that a system prompt is just a message that is always preprended to the other input messages, we automaticlly know the answer already: we just need to construct our messages in a way that incorporates both the textual part of the prompt **and** the image/audio parts, with content blocks - exactly as we do for normal messages.

That is exactly what we do in the `nanobanana_generate()` function:

```python
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text", 
                        "text": nanobanana_prompt
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:application/pdf;base64,{state.get('pdf_base64')}"
                        }
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{example_img}"
                        }
                    }
                ]
            }
        ],
```

where the nanobanana prompt is:

In [1]:
nanobanana_prompt = """
You are an expert technical educator. You are provided with a PDF scientific paper. 
Your task is to synthesize the core concepts and key insights of this document into a SINGLE, high-resolution blackboard visual summary.

Visual Style Requirements:

- **Surface**: Professional magnetic whiteboard (bright white, slight glossy reflection).
- **Ink**: Bold, wet-erase markers in Deep Blue, Emerald Green, and Safety Orange.
- **Layout**: 
    - Use 'Notes App Chic' hierarchy. 
    - Center: A central 'Main Concept' box with a brain or robot icon.
    - Left Column: 'Core Principles' in Green. Use code-like syntax (e.g., <input> tags).
    - Right Column: 'Multimodal Context' in Blue with hand-drawn icons for files and eyes.
    - Bottom: An 'Example Template' section using a purple frame.
- **Annotations**: Add 'The Engineering Mindset' box in Orange at the bottom right with a warning icon.
- **Text**: Neat, professional handwriting. No generic fonts.

You will get an example image of the visual style you should follow. 
"""