# Chat with Multimodal Models: LLaVA

This notebook uses **LLaVA** as an example for the multimodal feature. More information about LLaVA can be found in their [GitHub page](https://github.com/haotian-liu/LLaVA)


### Before everything starts, install AutoGen with the `lmm` option
```bash
pip install "pyautogen[lmm]>=0.2.3"
```

In [1]:
# We use this variable to control where you want to host LLaVA, locally or remotely?
# More details in the two setup options below.
import json
import os
import random
import time
from typing import Any, Callable, Dict, List, Optional, Tuple, Type, Union

import matplotlib.pyplot as plt
import requests
from PIL import Image
from termcolor import colored
import numpy as np

import autogen
from autogen import Agent, AssistantAgent, ConversableAgent, UserProxyAgent
from autogen.agentchat.contrib.llava_agent import LLaVAAgent, llava_call
import glob
from natsort import natsorted

from yolov7_package import Yolov7Detector
import cv2
import inspect
import yolov7_package

LLAVA_MODE = "local"  # Either "local" or "remote"
assert LLAVA_MODE in ["local", "remote"]

<a id="local"></a>
## [Option 2] Setup LLaVA Locally


## Install the LLaVA library

Please follow the LLaVA GitHub [page](https://github.com/haotian-liu/LLaVA/) to install LLaVA.


#### Download the package
```bash
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
```

#### Install the inference package
```bash
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
```



Some helpful packages and dependencies:
```bash
conda install -c nvidia cuda-toolkit
```


### Launch

In one terminal, start the controller first:
```bash
python -m llava.serve.controller --host 0.0.0.0 --port 10000
```


Then, in another terminal, start the worker, which will load the model to the GPU:
```bash
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b
```

In [2]:
# Run this code block only if you want to run LlaVA locally
if LLAVA_MODE == "local":
    llava_config_list = [
        {
            "model": "llava-v1.5-13b",
            "api_key": "None",
            "base_url": "http://0.0.0.0:10000",
        }
    ]

Within the user proxy agent, we can decide to activate the human input mode or not (for here, we use human_input_mode="NEVER" for conciseness). This allows you to interact with LLaVA in a multi-round dialogue, enabling you to provide feedback as the conversation unfolds.

In [20]:
kb_data = """{'00011': {'building_0': {'b_box': [28, 177, 70, 231], 'color': (0, 0, 0), '3d_cord': [-20.999107408509367, -65.09723296637905, 117.63]}, 'building_1': {'b_box': [0, 0, 110, 105], 'color': (0, 0, 0), '3d_cord': [-22.51538232668848, -10.49146087473966, 19.81]}, 'building_2': {'b_box': [0, 173, 28, 245], 'color': (0, 0, 0), '3d_cord': [-25.423623921451945, -92.72145194882475, 125.66]}, 'building_3': {'b_box': [0, 93, 84, 159], 'color': (0, 0, 0), '3d_cord': [-24.04879500148765, -22.51841713775662, 36.74]}, 'building_5': {'b_box': [9, 353, 85, 472], 'color': (0, 0, 0), '3d_cord': [39.47277595953585, -18.807616780720025, 39.02]}, 'building_6': {'b_box': [21, 228, 70, 256], 'color': (0, 0, 0), '3d_cord': [-1.2783100267777445, -60.71972627194286, 107.41]}, 'building_11': {'b_box': [0, 153, 31, 177], 'color': (0, 0, 0), '3d_cord': [-104.69503124070216, -168.06307646533767, 231.5]}, 'vegetation_0': {'b_box': [0, 231, 111, 480], 'color': (0, 0, 0), '3d_cord': [12.859982148170186, -13.72555786968164, 20.78]}, 'vehicles_0': {'b_box': [125, 201, 223, 297], 'color': (0, 0, 0), '3d_cord': [0.12079738173162746, 0.9905385301993452, 4.06]}, 'vehicles_1': {'b_box': [138, 0, 248, 97], 'color': (0, 0, 0), '3d_cord': [-3.966319547753645, 1.0730139839333532, 3.22]}}, '00015': {'building_0': {'b_box': [24, 184, 71, 247], 'color': (0, 0, 0), '3d_cord': [-15.62249330556382, -63.79184766438559, 109.39]}, 'building_1': {'b_box': [0, 0, 64, 62], 'color': (0, 0, 0), '3d_cord': [-20.2719428741446, -10.184468908063076, 16.3]}, 'building_2': {'b_box': [0, 186, 24, 270], 'color': (0, 0, 0), '3d_cord': [-11.904552216602202, -87.53347218089854, 117.68]}, 'building_3': {'b_box': [0, 60, 99, 159], 'color': (0, 0, 0), '3d_cord': [-21.672835465635224, -15.041594763463253, 27.18]}, 'building_5': {'b_box': [0, 407, 92, 480], 'color': (0, 0, 0), '3d_cord': [41.00029753049687, -17.376316572448676, 32.81]}, 'building_6': {'b_box': [7, 243, 70, 275], 'color': (0, 0, 0), '3d_cord': [9.304849747099077, -58.15531091936923, 97.73]}, 'building_11': {'b_box': [0, 167, 24, 189], 'color': (0, 0, 0), '3d_cord': [-80.13329366260041, -170.9510264802142, 224.44]}, 'vegetation_0': {'b_box': [0, 264, 93, 366], 'color': (0, 0, 0), '3d_cord': [9.956322523058613, -14.866289794703958, 22.92]}, 'vehicles_0': {'b_box': [125, 192, 221, 287], 'color': (0, 0, 0), '3d_cord': [-0.02398095804819994, 0.9832192799761976, 4.03]}}, '00020': {'building_0': {'b_box': [22, 138, 71, 229], 'color': (0, 0, 0), '3d_cord': [-38.07807200238024, -58.345432906872944, 103.21]}, 'building_2': {'b_box': [0, 166, 21, 255], 'color': (0, 0, 0), '3d_cord': [-25.30092234454031, -83.8925319845284, 111.89]}, 'building_3': {'b_box': [0, 0, 111, 112], 'color': (0, 0, 0), '3d_cord': [-22.845700684320143, -9.807497768521273, 19.39]}, 'building_5': {'b_box': [0, 363, 86, 440], 'color': (0, 0, 0), '3d_cord': [38.53555489437667, -21.62963403748884, 41.78]}, 'building_6': {'b_box': [3, 225, 70, 258], 'color': (0, 0, 0), '3d_cord': [0.0, -55.98012496280869, 92.23]}, 'building_11': {'b_box': [0, 144, 24, 167], 'color': (0, 0, 0), '3d_cord': [-111.00934245760189, -161.3507884558167, 216.92]}, 'pedestrian_3': {'b_box': [84, 5, 117, 17], 'color': (0, 0, 0), '3d_cord': [-18.107110978875333, -2.8341565010413565, 13.23]}, 'vegetation_0': {'b_box': [0, 249, 101, 435], 'color': (0, 0, 0), '3d_cord': [11.286283844094019, -15.415412079738172, 23.13]}, 'vehicles_0': {'b_box': [125, 193, 222, 287], 'color': (0, 0, 0), '3d_cord': [-0.02398095804819994, 0.9832192799761976, 4.03]}, 'vehicles_6': {'b_box': [80, 33, 91, 60], 'color': (0, 0, 0), '3d_cord': [-33.310086283844086, -8.837369830407615, 28.56]}}}"""

kb_data_subset = """{'00011': {'vehicles_0': {'b_box': [125, 201, 223, 297], 'color': (0, 0, 0), '3d_cord': [0.12079738173162746, 0.9905385301993452, 4.06]}, 'vehicles_1': {'b_box': [138, 0, 248, 97], 'color': (0, 0, 0), '3d_cord': [-3.966319547753645, 1.0730139839333532, 3.22]}}, '00015': {'vehicles_0': {'b_box': [125, 192, 221, 287], 'color': (0, 0, 0), '3d_cord': [-0.02398095804819994, 0.9832192799761976, 4.03]}}, '00020': {'pedestrian_3': {'b_box': [84, 5, 117, 17], 'color': (0, 0, 0), '3d_cord': [-18.107110978875333, -2.8341565010413565, 13.23]}, 'vehicles_0': {'b_box': [125, 193, 222, 287], 'color': (0, 0, 0), '3d_cord': [-0.02398095804819994, 0.9832192799761976, 4.03]}, 'vehicles_6': {'b_box': [80, 33, 91, 60], 'color': (0, 0, 0), '3d_cord': [-33.310086283844086, -8.837369830407615, 28.56]}}}"""

In [21]:
video_agent = LLaVAAgent(
    name="video-explainer",
    max_consecutive_auto_reply=10,
    llm_config={"config_list": llava_config_list, "temperature": 0.5, "max_new_tokens": 100000},
)

user_proxy = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    code_execution_config={
        "last_n_messages": 3,
        "work_dir": "groupchat",
        "use_docker": False,
    },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.
    human_input_mode="NEVER",  # Try between ALWAYS or NEVER
    max_consecutive_auto_reply=0,
)

# Ask the question with an image


In [None]:
user_proxy.initiate_chat(
    video_agent,
    message=f"""
        Here is a dictionary containing the spatial information for 3 sampled frames from 10 consecutive frames 
        (frames 11-20) 
        of a video. Each frame is numbered in the format "00011", "00015", and so on. The 
        dictionary provides information about the objects present in each frame. For every object, 
        the dictionary includes the bounding box (identified by the "b_box" key), which indicates 
        the object's location within the 2D video frame. In addition to the bounding box, the 3D 
        coordinates of the object, estimated from the depth map, are provided for each object. 
        These coordinates are in the format [x, y, z], where "z" represents the distance of the 
        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, 
        coincides with the camera's location in each frame. Based on this spatial information, 
        provide a detailed description of the events occurring within the video's 10 frames, 
        including scene composition and the relationships between objects (e.g., if two objects move closer 
        together or farther apart). Here is the dictionary: {kb_data}.
    """,
)

In [22]:
user_proxy.initiate_chat(
    video_agent,
    message=f"""
        Here is a dictionary containing the spatial information for 3 sampled frames from 10 consecutive frames 
        (frames 11-20) 
        of a video. Each frame is numbered in the format "00011", "00015", and so on. The 
        dictionary provides information about the objects present in each frame. For every object, 
        the dictionary includes the bounding box (identified by the "b_box" key), which indicates 
        the object's location within the 2D video frame. In addition to the bounding box, the 3D 
        coordinates of the object, estimated from the depth map, are provided for each object. 
        These coordinates are in the format [x, y, z], where "z" represents the distance of the 
        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, 
        coincides with the camera's location in each frame. Based on this spatial information, 
        provide a detailed description of the events occurring within the video's 10 frames, 
        including scene composition and the relationships between objects (e.g., if two objects move closer 
        together or farther apart). Here is the dictionary: {kb_data_subset}.
    """,
)

[33mUser_proxy[0m (to video-explainer):


        Here is a dictionary containing the spatial information for 3 sampled frames from 10 consecutive frames 
        (frames 11-20) 
        of a video. Each frame is numbered in the format "00011", "00015", and so on. The 
        dictionary provides information about the objects present in each frame. For every object, 
        the dictionary includes the bounding box (identified by the "b_box" key), which indicates 
        the object's location within the 2D video frame. In addition to the bounding box, the 3D 
        coordinates of the object, estimated from the depth map, are provided for each object. 
        These coordinates are in the format [x, y, z], where "z" represents the distance of the 
        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, 
        coincides with the camera's location in each frame. Based on this spatial information, 
        provide a detailed description of t

ChatResult(chat_history=[{'content': '\n        Here is a dictionary containing the spatial information for 3 sampled frames from 10 consecutive frames \n        (frames 11-20) \n        of a video. Each frame is numbered in the format "00011", "00015", and so on. The \n        dictionary provides information about the objects present in each frame. For every object, \n        the dictionary includes the bounding box (identified by the "b_box" key), which indicates \n        the object\'s location within the 2D video frame. In addition to the bounding box, the 3D \n        coordinates of the object, estimated from the depth map, are provided for each object. \n        These coordinates are in the format [x, y, z], where "z" represents the distance of the \n        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, \n        coincides with the camera\'s location in each frame. Based on this spatial information, \n        provide a detailed descript

In [23]:
user_proxy.send(
    message=f"""consider vehicles_0 and detect in which direction the car is moving""",
    recipient=video_agent,
)

[33mUser_proxy[0m (to video-explainer):

consider vehicles_0 and detect in which direction the car is moving

--------------------------------------------------------------------------------
[31m
>>>>>>>> USING AUTO REPLY...[0m
[34mYou are an AI agent and you can view images.
###Human: 
        Here is a dictionary containing the spatial information for 3 sampled frames from 10 consecutive frames 
        (frames 11-20) 
        of a video. Each frame is numbered in the format "00011", "00015", and so on. The 
        dictionary provides information about the objects present in each frame. For every object, 
        the dictionary includes the bounding box (identified by the "b_box" key), which indicates 
        the object's location within the 2D video frame. In addition to the bounding box, the 3D 
        coordinates of the object, estimated from the depth map, are provided for each object. 
        These coordinates are in the format [x, y, z], where "z" represents the distan

ChatResult(chat_history=[{'content': '\n        Here is a dictionary containing the spatial information for 3 sampled frames from 10 consecutive frames \n        (frames 11-20) \n        of a video. Each frame is numbered in the format "00011", "00015", and so on. The \n        dictionary provides information about the objects present in each frame. For every object, \n        the dictionary includes the bounding box (identified by the "b_box" key), which indicates \n        the object\'s location within the 2D video frame. In addition to the bounding box, the 3D \n        coordinates of the object, estimated from the depth map, are provided for each object. \n        These coordinates are in the format [x, y, z], where "z" represents the distance of the \n        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, \n        coincides with the camera\'s location in each frame. Based on this spatial information, \n        provide a detailed descript