# Chat with Multimodal Models: LLaVA

This notebook uses **LLaVA** as an example for the multimodal feature. More information about LLaVA can be found in their [GitHub page](https://github.com/haotian-liu/LLaVA)


### Before everything starts, install AutoGen with the `lmm` option
```bash
pip install "pyautogen[lmm]>=0.2.3"
```

In [1]:
# We use this variable to control where you want to host LLaVA, locally or remotely?
# More details in the two setup options below.
import json
import os
import random
import time
from typing import Any, Callable, Dict, List, Optional, Tuple, Type, Union

import matplotlib.pyplot as plt
import requests
from PIL import Image
from termcolor import colored
import numpy as np

import autogen
from autogen import Agent, AssistantAgent, ConversableAgent, UserProxyAgent
from autogen.agentchat.contrib.llava_agent import LLaVAAgent, llava_call
import glob
from natsort import natsorted

from yolov7_package import Yolov7Detector
import cv2
import inspect
import yolov7_package

LLAVA_MODE = "local"  # Either "local" or "remote"
assert LLAVA_MODE in ["local", "remote"]

<a id="local"></a>
## [Option 2] Setup LLaVA Locally


## Install the LLaVA library

Please follow the LLaVA GitHub [page](https://github.com/haotian-liu/LLaVA/) to install LLaVA.


#### Download the package
```bash
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
```

#### Install the inference package
```bash
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
```



Some helpful packages and dependencies:
```bash
conda install -c nvidia cuda-toolkit
```


### Launch

In one terminal, start the controller first:
```bash
python -m llava.serve.controller --host 0.0.0.0 --port 10000
```


Then, in another terminal, start the worker, which will load the model to the GPU:
```bash
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b
```

In [2]:
# Run this code block only if you want to run LlaVA locally
if LLAVA_MODE == "local":
    llava_config_list = [
        {
            "model": "llava-v1.5-13b",
            "api_key": "None",
            "base_url": "http://0.0.0.0:10000",
        }
    ]

Within the user proxy agent, we can decide to activate the human input mode or not (for here, we use human_input_mode="NEVER" for conciseness). This allows you to interact with LLaVA in a multi-round dialogue, enabling you to provide feedback as the conversation unfolds.

In [27]:
kb_data = """
    Frame: 11:
    The b_box of building_0 in frame 11 is [28, 177, 70, 231]
    The color of building_0 in frame 11 is (0, 0, 0)
    The 3d_cord of building_0 in frame 11 is [-20.999107408509367, -65.09723296637905, 117.63]
    The b_box of building_1 in frame 11 is [0, 0, 110, 105]
    The color of building_1 in frame 11 is (0, 0, 0)
    The 3d_cord of building_1 in frame 11 is [-22.51538232668848, -10.49146087473966, 19.81]
    The b_box of building_2 in frame 11 is [0, 173, 28, 245]
    The color of building_2 in frame 11 is (0, 0, 0)
    The 3d_cord of building_2 in frame 11 is [-25.423623921451945, -92.72145194882475, 125.66]
    The b_box of building_3 in frame 11 is [0, 93, 84, 159]
    The color of building_3 in frame 11 is (0, 0, 0)
    The 3d_cord of building_3 in frame 11 is [-24.04879500148765, -22.51841713775662, 36.74]
    The b_box of building_5 in frame 11 is [9, 353, 85, 472]
    The color of building_5 in frame 11 is (0, 0, 0)
    The 3d_cord of building_5 in frame 11 is [39.47277595953585, -18.807616780720025, 39.02]
    The b_box of building_6 in frame 11 is [21, 228, 70, 256]
    The color of building_6 in frame 11 is (0, 0, 0)
    The 3d_cord of building_6 in frame 11 is [-1.2783100267777445, -60.71972627194286, 107.41]
    The b_box of building_11 in frame 11 is [0, 153, 31, 177]
    The color of building_11 in frame 11 is (0, 0, 0)
    The 3d_cord of building_11 in frame 11 is [-104.69503124070216, -168.06307646533767, 231.5]
    The b_box of vegetation_0 in frame 11 is [0, 231, 111, 480]
    The color of vegetation_0 in frame 11 is (0, 0, 0)
    The 3d_cord of vegetation_0 in frame 11 is [12.859982148170186, -13.72555786968164, 20.78]
    The b_box of vehicles_0 in frame 11 is [125, 201, 223, 297]
    The color of vehicles_0 in frame 11 is (0, 0, 0)
    The 3d_cord of vehicles_0 in frame 11 is [0.12079738173162746, 0.9905385301993452, 4.06]
    The b_box of vehicles_1 in frame 11 is [138, 0, 248, 97]
    The color of vehicles_1 in frame 11 is (0, 0, 0)
    The 3d_cord of vehicles_1 in frame 11 is [-3.966319547753645, 1.0730139839333532, 3.22]
    Frame: 15:
    The b_box of building_0 in frame 15 is [24, 184, 71, 247]
    The color of building_0 in frame 15 is (0, 0, 0)
    The 3d_cord of building_0 in frame 15 is [-15.62249330556382, -63.79184766438559, 109.39]
    The b_box of building_1 in frame 15 is [0, 0, 64, 62]
    The color of building_1 in frame 15 is (0, 0, 0)
    The 3d_cord of building_1 in frame 15 is [-20.2719428741446, -10.184468908063076, 16.3]
    The b_box of building_2 in frame 15 is [0, 186, 24, 270]
    The color of building_2 in frame 15 is (0, 0, 0)
    The 3d_cord of building_2 in frame 15 is [-11.904552216602202, -87.53347218089854, 117.68]
    The b_box of building_3 in frame 15 is [0, 60, 99, 159]
    The color of building_3 in frame 15 is (0, 0, 0)
    The 3d_cord of building_3 in frame 15 is [-21.672835465635224, -15.041594763463253, 27.18]
    The b_box of building_5 in frame 15 is [0, 407, 92, 480]
    The color of building_5 in frame 15 is (0, 0, 0)
    The 3d_cord of building_5 in frame 15 is [41.00029753049687, -17.376316572448676, 32.81]
    The b_box of building_6 in frame 15 is [7, 243, 70, 275]
    The color of building_6 in frame 15 is (0, 0, 0)
    The 3d_cord of building_6 in frame 15 is [9.304849747099077, -58.15531091936923, 97.73]
    The b_box of building_11 in frame 15 is [0, 167, 24, 189]
    The color of building_11 in frame 15 is (0, 0, 0)
    The 3d_cord of building_11 in frame 15 is [-80.13329366260041, -170.9510264802142, 224.44]
    The b_box of vegetation_0 in frame 15 is [0, 264, 93, 366]
    The color of vegetation_0 in frame 15 is (0, 0, 0)
    The 3d_cord of vegetation_0 in frame 15 is [9.956322523058613, -14.866289794703958, 22.92]
    The b_box of vehicles_0 in frame 15 is [125, 192, 221, 287]
    The color of vehicles_0 in frame 15 is (0, 0, 0)
    The 3d_cord of vehicles_0 in frame 15 is [-0.02398095804819994, 0.9832192799761976, 4.03]
    Frame: 20:
    The b_box of building_0 in frame 20 is [22, 138, 71, 229]
    The color of building_0 in frame 20 is (0, 0, 0)
    The 3d_cord of building_0 in frame 20 is [-38.07807200238024, -58.345432906872944, 103.21]
    The b_box of building_2 in frame 20 is [0, 166, 21, 255]
    The color of building_2 in frame 20 is (0, 0, 0)
    The 3d_cord of building_2 in frame 20 is [-25.30092234454031, -83.8925319845284, 111.89]
    The b_box of building_3 in frame 20 is [0, 0, 111, 112]
    The color of building_3 in frame 20 is (0, 0, 0)
    The 3d_cord of building_3 in frame 20 is [-22.845700684320143, -9.807497768521273, 19.39]
    The b_box of building_5 in frame 20 is [0, 363, 86, 440]
    The color of building_5 in frame 20 is (0, 0, 0)
    The 3d_cord of building_5 in frame 20 is [38.53555489437667, -21.62963403748884, 41.78]
    The b_box of building_6 in frame 20 is [3, 225, 70, 258]
    The color of building_6 in frame 20 is (0, 0, 0)
    The 3d_cord of building_6 in frame 20 is [0.0, -55.98012496280869, 92.23]
    The b_box of building_11 in frame 20 is [0, 144, 24, 167]
    The color of building_11 in frame 20 is (0, 0, 0)
    The 3d_cord of building_11 in frame 20 is [-111.00934245760189, -161.3507884558167, 216.92]
    The b_box of pedestrian_3 in frame 20 is [84, 5, 117, 17]
    The color of pedestrian_3 in frame 20 is (0, 0, 0)
    The 3d_cord of pedestrian_3 in frame 20 is [-18.107110978875333, -2.8341565010413565, 13.23]
    The b_box of vegetation_0 in frame 20 is [0, 249, 101, 435]
    The color of vegetation_0 in frame 20 is (0, 0, 0)
    The 3d_cord of vegetation_0 in frame 20 is [11.286283844094019, -15.415412079738172, 23.13]
    The b_box of vehicles_0 in frame 20 is [125, 193, 222, 287]
    The color of vehicles_0 in frame 20 is (0, 0, 0)
    The 3d_cord of vehicles_0 in frame 20 is [-0.02398095804819994, 0.9832192799761976, 4.03]
    The b_box of vehicles_6 in frame 20 is [80, 33, 91, 60]
    The color of vehicles_6 in frame 20 is (0, 0, 0)
    The 3d_cord of vehicles_6 in frame 20 is [-33.310086283844086, -8.837369830407615, 28.56]
"""

kb_data_subset = """
    Frame: 11:
    The b_box of vehicles_0 in frame 11 is [125, 201, 223, 297]
    The color of vehicles_0 in frame 11 is (0, 0, 0)
    The 3d_cord of vehicles_0 in frame 11 is [0.12079738173162746, 0.9905385301993452, 4.06]
    The b_box of vehicles_1 in frame 11 is [138, 0, 248, 97]
    The color of vehicles_1 in frame 11 is (0, 0, 0)
    The 3d_cord of vehicles_1 in frame 11 is [-3.966319547753645, 1.0730139839333532, 3.22]
    Frame: 15:
    The b_box of vehicles_0 in frame 15 is [125, 192, 221, 287]
    The color of vehicles_0 in frame 15 is (0, 0, 0)
    The 3d_cord of vehicles_0 in frame 15 is [-0.02398095804819994, 0.9832192799761976, 4.03]
    Frame: 20:
    The b_box of pedestrian_3 in frame 20 is [84, 5, 117, 17]
    The color of pedestrian_3 in frame 20 is (0, 0, 0)
    The 3d_cord of pedestrian_3 in frame 20 is [-18.107110978875333, -2.8341565010413565, 13.23]
    The b_box of vehicles_0 in frame 20 is [125, 193, 222, 287]
    The color of vehicles_0 in frame 20 is (0, 0, 0)
    The 3d_cord of vehicles_0 in frame 20 is [-0.02398095804819994, 0.9832192799761976, 4.03]
    The b_box of vehicles_6 in frame 20 is [80, 33, 91, 60]
    The color of vehicles_6 in frame 20 is (0, 0, 0)
    The 3d_cord of vehicles_6 in frame 20 is [-33.310086283844086, -8.837369830407615, 28.56]
"""

In [28]:
video_agent = LLaVAAgent(
    name="video-explainer",
    max_consecutive_auto_reply=10,
    llm_config={"config_list": llava_config_list, "temperature": 0.5, "max_new_tokens": 100000},
)

user_proxy = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    code_execution_config={
        "last_n_messages": 3,
        "work_dir": "groupchat",
        "use_docker": False,
    },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.
    human_input_mode="NEVER",  # Try between ALWAYS or NEVER
    max_consecutive_auto_reply=0,
)

# Ask the question with an image


In [31]:
user_proxy.initiate_chat(
    video_agent,
    message=f"""
        Here are some facts describing the spatial information for 3 sampled frames (11,15, and 20) 
        from 10 consecutive frames (frames 11-20) of a video. The dictionary provides information 
        about the objects present in each frame. For every object, 
        the information includes the bounding box (identified by the "b_box"), which indicates 
        the object's location within the 2D video frame. In addition to the bounding box, the 3D 
        coordinates of the object (3d_cord), estimated from the depth map, are provided for each object. 
        These coordinates are in the format [x, y, z], where "z" represents the distance of the 
        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, 
        coincides with the camera's location in each frame. Based on this spatial information, 
        provide a detailed description of the events occurring within the video's 10 frames, 
        including scene composition and the relationships between objects (e.g., if two objects move closer 
        together or farther apart). Here is the dictionary: {kb_data}.
    """,
)

[33mUser_proxy[0m (to video-explainer):


        Here are some facts describing the spatial information for 3 sampled frames (11,15, and 20) 
        from 10 consecutive frames (frames 11-20) of a video. The dictionary provides information 
        about the objects present in each frame. For every object, 
        the information includes the bounding box (identified by the "b_box"), which indicates 
        the object's location within the 2D video frame. In addition to the bounding box, the 3D 
        coordinates of the object (3d_cord), estimated from the depth map, are provided for each object. 
        These coordinates are in the format [x, y, z], where "z" represents the distance of the 
        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, 
        coincides with the camera's location in each frame. Based on this spatial information, 
        provide a detailed description of the events occurring within the video's 10 frames, 
  

ChatResult(chat_history=[{'content': '\n        Here are some facts describing the spatial information for 3 sampled frames (11,15, and 20) \n        from 10 consecutive frames (frames 11-20) of a video. The dictionary provides information \n        about the objects present in each frame. For every object, \n        the information includes the bounding box (identified by the "b_box"), which indicates \n        the object\'s location within the 2D video frame. In addition to the bounding box, the 3D \n        coordinates of the object (3d_cord), estimated from the depth map, are provided for each object. \n        These coordinates are in the format [x, y, z], where "z" represents the distance of the \n        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, \n        coincides with the camera\'s location in each frame. Based on this spatial information, \n        provide a detailed description of the events occurring within the video\'s 10 fra

In [33]:
user_proxy.initiate_chat(
    video_agent,
    message=f"""
        Here are some facts describing the spatial information for 3 sampled frames (11,15, and 20) 
        from 10 consecutive frames (frames 11-20) of a video. The dictionary provides information 
        about the objects present in each frame. For every object, 
        the information includes the bounding box (identified by the "b_box"), which indicates 
        the object's location within the 2D video frame. In addition to the bounding box, the 3D 
        coordinates of the object (3d_cord), estimated from the depth map, are provided for each object. 
        These coordinates are in the format [x, y, z], where "z" represents the distance of the 
        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, 
        coincides with the camera's location in each frame. Based on this spatial information, 
        provide a detailed description of the events occurring within the video's 10 frames, 
        including scene composition and the relationships between objects (e.g., if two objects move closer 
        together or farther apart). Here is the dictionary: {kb_data_subset}.
    """,
)

[33mUser_proxy[0m (to video-explainer):


        Here are some facts describing the spatial information for 3 sampled frames (11,15, and 20) 
        from 10 consecutive frames (frames 11-20) of a video. The dictionary provides information 
        about the objects present in each frame. For every object, 
        the information includes the bounding box (identified by the "b_box"), which indicates 
        the object's location within the 2D video frame. In addition to the bounding box, the 3D 
        coordinates of the object (3d_cord), estimated from the depth map, are provided for each object. 
        These coordinates are in the format [x, y, z], where "z" represents the distance of the 
        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, 
        coincides with the camera's location in each frame. Based on this spatial information, 
        provide a detailed description of the events occurring within the video's 10 frames, 
  

ChatResult(chat_history=[{'content': '\n        Here are some facts describing the spatial information for 3 sampled frames (11,15, and 20) \n        from 10 consecutive frames (frames 11-20) of a video. The dictionary provides information \n        about the objects present in each frame. For every object, \n        the information includes the bounding box (identified by the "b_box"), which indicates \n        the object\'s location within the 2D video frame. In addition to the bounding box, the 3D \n        coordinates of the object (3d_cord), estimated from the depth map, are provided for each object. \n        These coordinates are in the format [x, y, z], where "z" represents the distance of the \n        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, \n        coincides with the camera\'s location in each frame. Based on this spatial information, \n        provide a detailed description of the events occurring within the video\'s 10 fra

In [34]:
user_proxy.send(
    message=f"""consider vehicles_0 and detect in which direction the vehicles_0 is moving""",
    recipient=video_agent,
)

[33mUser_proxy[0m (to video-explainer):

consider vehicles_0 and detect in which direction the vehicles_0 is moving

--------------------------------------------------------------------------------
[31m
>>>>>>>> USING AUTO REPLY...[0m
[34mYou are an AI agent and you can view images.
###Human: 
        Here are some facts describing the spatial information for 3 sampled frames (11,15, and 20) 
        from 10 consecutive frames (frames 11-20) of a video. The dictionary provides information 
        about the objects present in each frame. For every object, 
        the information includes the bounding box (identified by the "b_box"), which indicates 
        the object's location within the 2D video frame. In addition to the bounding box, the 3D 
        coordinates of the object (3d_cord), estimated from the depth map, are provided for each object. 
        These coordinates are in the format [x, y, z], where "z" represents the distance of the 
        object from the camera. The

ChatResult(chat_history=[{'content': '\n        Here are some facts describing the spatial information for 3 sampled frames (11,15, and 20) \n        from 10 consecutive frames (frames 11-20) of a video. The dictionary provides information \n        about the objects present in each frame. For every object, \n        the information includes the bounding box (identified by the "b_box"), which indicates \n        the object\'s location within the 2D video frame. In addition to the bounding box, the 3D \n        coordinates of the object (3d_cord), estimated from the depth map, are provided for each object. \n        These coordinates are in the format [x, y, z], where "z" represents the distance of the \n        object from the camera. The origin of the point cloud, used to derive the 3D coordinates, \n        coincides with the camera\'s location in each frame. Based on this spatial information, \n        provide a detailed description of the events occurring within the video\'s 10 fra