In [1]:
import dotenv
dotenv.load_dotenv()

True

In [2]:
LLAVA_MODE = "remote" # Either "local" or "remote"
assert LLAVA_MODE in ["local", "remote"]

In [3]:
import requests
import json
import os

from typing import Any, Callable, Dict, List, Optional, Tuple, Type, Union

import autogen
from autogen import AssistantAgent, Agent, UserProxyAgent, ConversableAgent
from termcolor import colored
import random

In [14]:
if LLAVA_MODE == "remote":
    import replicate
    
    llava_config_list = [
        {
            "model": "whatever, will be ignored for remote", # The model name doesn't matter here right now.
            "api_key": "None", # Note that you have to setup the API key with os.environ["REPLICATE_API_TOKEN"] 
            "base_url": "yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591",
        }
    ]

In [15]:
from autogen.agentchat.contrib.llava_agent import llava_call

In [16]:
rst = llava_call("Describe this AutoGen framework <img https://raw.githubusercontent.com/microsoft/autogen/main/website/static/img/autogen_agentchat.png> with bullet points.",
          llm_config={
              "config_list": llava_config_list,
              "temperature": 0
          })

print(rst)

* AutoGen framework is a tool for creating and managing conversational agents.
* It allows for the creation of conversational agents using a visual interface, where users can design the agent's dialogue flow and responses.
* The framework supports multiple conversational agents, enabling them to interact with each other and share information.
* The visual interface provides a clear representation of the agent's dialogue flow, making it easier to understand and modify the conversational agent's behavior.
* The framework allows for the integration of different conversational agents, enabling them to work together and provide more comprehensive and personalized interactions with users.
* The visual representation of the agent's dialogue flow and the ability to integrate multiple agents make the AutoGen framework a powerful tool for creating and managing conversational agents.


In [4]:
import os

GOOG_API_KEY = os.getenv("GOOGLE_API_KEY")

CFIG_LIST = [
    {
        "model": "gemini-pro-vision",
        "api_key": GOOG_API_KEY,
        "api_type": "google"
    }
]

os.environ["CFIG_LIST"] = json.dumps(CFIG_LIST)

gemini_vision_config = autogen.config_list_from_json(
    "CFIG_LIST",
    filter_dict={"model": ["gemini-pro-vision"]}
)

In [5]:
gemini_vision_config

[{'model': 'gemini-pro-vision',
  'api_key': 'AIzaSyCMMV4BEqL41M5ejmGOZgnDMgu-rO3w9w0',
  'api_type': 'google'}]

In [11]:
from autogen.agentchat.contrib.llava_agent import LLaVAAgent
from prompting import INITIAL_PROMPT
from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent

image_agent = MultimodalConversableAgent("Gemini Vision", 
                           llm_config={"config_list": gemini_vision_config, "seed": 42}, 
                           max_consecutive_auto_reply=1,
                           code_execution_config=False
                        )

user_proxy = UserProxyAgent("user_proxy", 
                            human_input_mode="NEVER",
                            max_consecutive_auto_reply=0)

# Ask the question with an image
user_proxy.initiate_chat(image_agent, message = f"""
{INITIAL_PROMPT}
Where is this image located?
<img https://ik.imagekit.io/sfwall/NY5_J2A4Si6Z6.png?updatedAt=1702987667392>
""")

[33muser_proxy[0m (to Gemini Vision):



The you will receive a picture of a location in the world, note the important features in it.
Write open streetmap queires to find these locations.
In your queries, make use of OSM's proximity features, and use regular expressions over exact string matches, and always use them
when you're not exactly sure of the specific wordings of signs etc.
Finally, output a single query to find the image's likely location, including the headers and bounding boxes, etc.
If you are able to guess the general region of the image. Please include that in the OSM query.

A good example of a query would be:

```
area["name"~".*Washington.*"];
way["name"~"Monroe.*St.*NW"](area) -> .mainway;

(
  nwr(around.mainway:500)["name"~"Korean.*Steak.*House"];

  // Find nearby businesses with CA branding
  nwr(around.mainway:500)["name"~"^CA.*"];
  
  // Look for a sign with the words "Do not block"
  node(around.mainway:500)["traffic_sign"~"Do not block"];
);

out center;


Model gemini-pro-vision not found. Using cl100k_base encoding.


[33mGemini Vision[0m (to user_proxy):

 This image is located in New York City. The street signs say "East 96th Street" and "2nd Avenue". There are also signs for "Hammarskjold Plaza" and "John Finley Walk".

```
area["name"~"New York City"];
way["name"~"2nd Ave.*"](area) -> .mainway;
way["name"~"E 96th St.*"](area) -> .crossway;
(
  nwr(around.crossway:500)["name"~"Hammarskjold.*Plaza"];
  nwr(around.crossway:500)["name"~"John Finley.*Walk"];
);
out center;
```

--------------------------------------------------------------------------------
