# Agent for Mixtral

Multimodal Agent: Image Captioning with Mixtral 7B, togheter API  

### tutorial
https://medium.com/@shivansh.kaushik/multimodal-agent-image-captioning-with-mistral-7b-on-cpu-a82d35c84549

# LLM

Using langchain and together ai

In [9]:
import os
from langchain_together import Together
llm = Together(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    temperature=0,
    max_tokens=256,
    top_k=1,
    together_api_key=os.getenv("TOGETHER_API_KEY")
)

### Base structured-chat-agent prompt

In [10]:
from langchain import hub

# Get the prompt to use - you can modify this!
prompt = hub.pull("hwchase17/structured-chat-agent")

#### Needs to be moddified to work with our agents and models

In [11]:
prompt[0].prompt.template = prompt[0].prompt.template[0:394] + "```\n\n\nFor example, if you want to use a tool to get an image's caption, your $JSON_BLOB might look like this:\n\n```\n{{\n    'action': 'image_captioner_json', \n    'action_input': {{'query': 'images url'}}\n}}```" + prompt[0].prompt.template[394:]

In [12]:
prompt

ChatPromptTemplate(input_variables=['agent_scratchpad', 'input', 'tool_names', 'tools'], input_types={'chat_history': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['tool_names', 'tools'], template='Respond to the human as helpfully and accurately as possible. You have access to the following tools:\n\n{tools}\n\nUse a json blob to specify a tool by providing an action key (tool name) and an action_input key (tool input).\n\nValid "action" values: "Final Answer" or {tool_names}\n\nProvide only ONE action per $JSON_BLOB, as shown:\n\n```\n{{\n  "action": $TOOL_NAME,\n  "action_input": $INPUT\n}}\n```\n\n```\n\n\nFor example, if you want to use a tool to get an image\'s cap

# Image descriptor model

In [13]:
# from transformers import BlipProcessor, BlipForConditionalGeneration

# blip_processor = BlipProcessor.from_pretrained("kha-white/manga-ocr-base")
# blip_model = BlipForConditionalGeneration.from_pretrained("kha-white/manga-ocr-base")

from transformers import BlipProcessor, BlipForConditionalGeneration

blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")



## Tool Initation

In [14]:
from langchain.tools import tool
from langchain.pydantic_v1 import BaseModel, Field
#from pydantic import BaseModel, Field
import requests
from PIL import Image

class ImageCaptionerInput(BaseModel):
    image_url: str = Field(description="URL of the image that is to be described")


@tool("image_captioner", return_direct=True, args_schema=ImageCaptionerInput)
def image_captioner(image_url: str) -> str:
    """Provides information about the image"""
    raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
    inputs = blip_processor(raw_image, return_tensors="pt")
    out = blip_model.generate(**inputs)
    return blip_processor.decode(out[0], skip_special_tokens=True)

In [15]:
tools = [image_captioner]

### Agent

In [16]:
from langchain.agents import AgentExecutor, create_openai_tools_agent, create_structured_chat_agent
#agent = create_openai_tools_agent(llm,tools,prompt)
agent = create_structured_chat_agent(llm, tools, prompt)

In [17]:
from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, handle_parsing_errors=True)

In [19]:
agent_executor.invoke({"input": "Que hay en esta imagen https://www.barcelo.com/guia-turismo/wp-content/uploads/2020/06/lago-di-como.jpg"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Thought: I need to describe the image. I will use the image_captioner tool.
Action:
```
{
  "action": "image_captioner",
  "action_input": "https://www.barcelo.com/guia-turismo/wp-content/uploads/2020/06/lago-di-como.jpg"
}
```
[0m



[36;1m[1;3ma lake with a boat in it[0m
[32;1m[1;3m[0m

[1m> Finished chain.[0m


{'input': 'Que hay en esta imagen https://www.barcelo.com/guia-turismo/wp-content/uploads/2020/06/lago-di-como.jpg',
 'output': 'a lake with a boat in it'}