# Create a native Agent using OpenVINO Generate API

LLM are limited to the knowledge on which they have been trained and the additional knowledge provided as context, as a result, if a useful piece of information is missing the provided knowledge, the model cannot “go around” and try to find it in other sources. This is the reason why we need to introduce the concept of Agents.

The core idea of agents is to use a language model to choose a sequence of actions to take. In agents, a language model is used as a reasoning engine to determine which actions to take and in which order. Agents can be seen as applications powered by LLMs and integrated with a set of tools like search engines, databases, websites, and so on. Within an agent, the LLM is the reasoning engine that, based on the user input, is able to plan and execute a set of actions that are needed to fulfill the request.

![agent](https://github.com/openvinotoolkit/openvino_notebooks/assets/91237924/22fa5396-8381-400f-a78f-97e25d57d807)

Previously, we already discussed how to build an instruction-following pipeline using OpenVINO, please check out [this tutorial](../llm-question-answering/llm-question-answering.ipynb) for reference.
In this tutorial, we consider how to use the power of OpenVINO for running Large Language Models for chat. We will use a pre-trained model from the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library. The [Hugging Face Optimum Intel](https://huggingface.co/docs/optimum/intel/index) library converts the models to OpenVINO™ IR format. To simplify the user experience, we will use [OpenVINO Generate API](https://github.com/openvinotoolkit/openvino.genai) for generation pipeline.

#### Table of contents:

- [Prerequisites](#Prerequisites)
- [Create LLM as agent](#Create-LLM-as-agent)
    - [Download model](#Select-model)
    - [Select inference device for LLM](#Select-inference-device-for-LLM)
    - [Instantiate pipeline with OpenVINO Generate API](#Instantiate-pipeline-with-OpenVINO-Generate-API)
    - [Create text generation method](#Create-text-generation-method)
- [Create prompt template](#Create-prompt-template)
- [Create parser](#Create-parers)
- [Create tools calling](#Create-tool-calling)
- [Run agent](#Run-agent)
- [Create AI agent demo with Gradio UI](#Create-AI-agent-demo-with-Gradio-UI)

### Installation Instructions

This is a self-contained example that relies solely on its own code.

We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/llm-agent-react/llm-agent-rag-llamaindex.ipynb" />

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/llm-agent-react/llm-agent-react.ipynb" />


## Prerequisites

[back to top ⬆️](#Table-of-contents:)

In [4]:
import os
import requests


r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w").write(r.text)

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/pip_helper.py",
)
open("pip_helper.py", "w").write(r.text)

os.environ["GIT_CLONE_PROTECTION_ACTIVE"] = "false"

from pip_helper import pip_install

pip_install(
    "-q",
    "--extra-index-url",
    "https://download.pytorch.org/whl/cpu",
    "transformers>=4.43.1",
    "gradio>=4.19",
)
pip_install(
    "-q",
    "git+https://github.com/huggingface/optimum-intel.git",
    "git+https://github.com/openvinotoolkit/nncf.git",
    "datasets",
    "accelerate",
    "huggingface-hub>=0.26.5",
    "openvino-genai",
)

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index-embeddings-openvino 0.5.0 requires huggingface-hub<0.24.0,>=0.23.0, but you have huggingface-hub 0.27.1 which is incompatible.[0m[31m
[0m

## Create LLM as agent

[back to top ⬆️](#Table-of-contents:)

### Download LLM

[back to top ⬆️](#Table-of-contents:)

To run LLM locally, we have to download the model in the first step. It is possible to [export your model](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#export) to the OpenVINO IR format with the CLI, and load the model from local folder.

Large Language Models (LLMs) are a core component of agent. LlamaIndex does not serve its own LLMs, but rather provides a standard interface for interacting with many different LLMs. In this example, we can select `Qwen2.5` as LLM in agent pipeline.
* **qwen2.5-3b-instruct/qwen2.5-7b-instruct/qwen2.5-14b-instruct** - Qwen2.5 is the latest series of Qwen large language models. Comparing with Qwen2, Qwen2.5 series brings significant improvements in coding, mathematics and general knowledge skills. Additionally, it brings long-context and multiple languages support including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. 
For more details, please refer to [model_card](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), [blog](https://qwenlm.github.io/blog/qwen2.5/), [GitHub](https://github.com/QwenLM/Qwen2.5), and [Documentation](https://qwen.readthedocs.io/en/latest/).

In [5]:
import ipywidgets as widgets

llm_model_ids = ["Qwen/Qwen2.5-3B-Instruct", "Qwen/Qwen2.5-7B-Instruct", "Qwen/qwen2.5-14b-instruct"]

llm_model_id = widgets.Dropdown(
    options=llm_model_ids,
    value=llm_model_ids[0],
    description="Model:",
    disabled=False,
)

llm_model_id

Dropdown(description='Model:', options=('Qwen/Qwen2.5-3B-Instruct', 'Qwen/Qwen2.5-7B-Instruct', 'Qwen/qwen2.5-…

In [8]:
from pathlib import Path

llm_model_path = llm_model_id.value.split("/")[-1]

if not Path(llm_model_path).exists():
    !optimum-cli export openvino --model {llm_model_id.value} --task text-generation-with-past --trust-remote-code --weight-format int4 --group-size 128 --ratio 1.0 --sym {llm_model_path}

### Select inference device for LLM

[back to top ⬆️](#Table-of-contents:)

In [9]:
from notebook_utils import device_widget

llm_device = device_widget("CPU", exclude=["NPU"])

llm_device

Dropdown(description='Device:', options=('CPU', 'GPU', 'AUTO'), value='CPU')

## Instantiate pipeline with OpenVINO Generate API
[back to top ⬆️](#Table-of-contents:)

[OpenVINO Generate API](https://github.com/openvinotoolkit/openvino.genai/blob/master/src/README.md) can be used to create pipelines to run an inference with OpenVINO Runtime. 

Firstly we need to create a pipeline with `LLMPipeline`. `LLMPipeline` is the main object used for text generation using LLM in OpenVINO GenAI API. You can construct it straight away from the folder with the converted model. We will provide directory with model and device for `LLMPipeline`. Then we run `generate` method and get the output in text format.
Additionally, we can configure parameters for decoding. We can create the default config with `ov_genai.GenerationConfig()`, setup parameters, and apply the updated version with `set_generation_config(config)` or put config directly to `generate()`. It's also possible to specify the needed options just as inputs in the `generate()` method, as shown below, e.g. we can add `max_new_tokens` to stop generation if a specified number of tokens is generated and the end of generation is not reached. We will discuss some of the available generation parameters more deeply later.  Generation process for long response may be time consuming, for accessing partial result as soon as it is generated without waiting when whole process finished, Streaming API can be used. Token streaming is the mode in which the generative system returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience. In code below, we implement simple streamer for printing output result. For more advanced streamer example please check openvino.genai [sample](https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/multinomial_causal_lm).

In [10]:
import openvino_genai

pipe = openvino_genai.LLMPipeline(llm_model_path, llm_device.value)

tokenizer = pipe.get_tokenizer()
config = openvino_genai.GenerationConfig()

### Create text generation method

[back to top ⬆️](#Table-of-contents:)

In this example, we would like to stream the output text though steamer, and stop text generation before `Observation` received from tool calling.

In [11]:
def streamer(subword):
    print(subword, end="", flush=True)
    # Return flag corresponds whether generation should be stopped.
    # False means continue generation.
    return False


def text_completion(prompt: str, stop_words) -> str:
    im_end = "<|im_end|>"
    if im_end not in stop_words:
        stop_words = stop_words + [im_end]

    config.max_new_tokens = 2000
    config.top_k = 1
    config.stop_strings = set(stop_words)
    output = pipe.generate(prompt, config, streamer)
    for stop_str in stop_words:
        idx = output.find(stop_str)
        if idx != -1:
            output = output[: idx + len(stop_str)]
    return output

## Create prompt template

[back to top ⬆️](#Table-of-contents:)

A prompt for a language model is a set of instructions or input provided by a user to guide the model's response, helping it understand the context and generate relevant and coherent language-based output, such as answering questions, completing sentences, or engaging in a conversation.

Different agents have different prompting styles for reasoning. In this example, we will use [ReAct agent](https://react-lm.github.io/) with its typical prompt template. For a full list of built-in agents see [agent types](https://python.langchain.com/docs/modules/agents/agent_types/).

![react](https://github.com/user-attachments/assets/c26432c2-3cf1-4942-ae03-fd8e8ebb4509)

A ReAct prompt consists of few-shot task-solving trajectories, with human-written text reasoning traces and actions, as well as environment observations in response to actions. ReAct prompting is intuitive and flexible to design, and achieves state-of-the-art few-shot performances across a variety of tasks, from question answering to online shopping!

In an prompt template for agent, `query` is user's query and other parameter should be a sequence of messages that contains the `descriptions` and `parameters` of agent tool.

In [12]:
TOOL_DESC = """{name_for_model}: Call this tool to interact with the {name_for_human} API. What is the {name_for_human} API useful for? {description_for_model} Parameters: {parameters}"""

PROMPT_REACT = """Answer the following questions as best you can. You have access to the following APIs:

{tools_text}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tools_name_text}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {query}"""

Meanwhile we have to create function for consolidate the tools information and conversation history into the prompt template.

In [13]:
import json
import json5


def build_input_text(chat_history, list_of_tool_info) -> str:
    tools_text = []
    for tool_info in list_of_tool_info:
        tool = TOOL_DESC.format(
            name_for_model=tool_info["name_for_model"],
            name_for_human=tool_info["name_for_human"],
            description_for_model=tool_info["description_for_model"],
            parameters=json.dumps(tool_info["parameters"], ensure_ascii=False),
        )
        if tool_info.get("args_format", "json") == "json":
            tool += " Format the arguments as a JSON object."
        elif tool_info["args_format"] == "code":
            tool += " Enclose the code within triple backticks (`) at the beginning and end of the code."
        else:
            raise NotImplementedError
        tools_text.append(tool)
    tools_text = "\n\n".join(tools_text)

    tools_name_text = ", ".join([tool_info["name_for_model"] for tool_info in list_of_tool_info])

    messages = [{"role": "system", "content": "You are a helpful assistant."}]
    for i, (query, response) in enumerate(chat_history):
        if list_of_tool_info:
            if (len(chat_history) == 1) or (i == len(chat_history) - 2):
                query = PROMPT_REACT.format(
                    tools_text=tools_text,
                    tools_name_text=tools_name_text,
                    query=query,
                )
        if query:
            messages.append({"role": "user", "content": query})
        if response:
            messages.append({"role": "assistant", "content": response})

    prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

    return prompt

## Create parser

[back to top ⬆️](#Table-of-contents:)

A Parser is used to convert raw output of LLM to the input arguments of tools.

In [14]:
def parse_latest_tool_call(text):
    tool_name, tool_args = "", ""
    i = text.rfind("\nAction:")
    j = text.rfind("\nAction Input:")
    k = text.rfind("\nObservation:")
    for stop_str in ['Observation"}', "Observation}"]:
        idx = text.find(stop_str)
        if idx != -1:
            text = text[:idx]
    if 0 <= i < j:  # If the text has `Action` and `Action input`,
        if k < j:  # but does not contain `Observation`,
            # then it is likely that `Observation` is ommited by the LLM,
            # because the output text may have discarded the stop word.
            text = text.rstrip() + "\nObservation:"  # Add it back.
        k = text.rfind("\nObservation:")
        tool_name = text[i + len("\nAction:") : j].strip()
        tool_args = text[j + len("\nAction Input:") : k].strip()
        text = text[:k]
    return tool_name, tool_args, text

## Create tools calling

[back to top ⬆️](#Table-of-contents:)

In this examples, we will create 2 customized tools for `image generation` and `weather qurey`. A detailed description of these tools should be defined in json format, which will be used as part of prompt.

In [15]:
tools = [
    {
        "name_for_human": "get weather",
        "name_for_model": "get_weather",
        "description_for_model": 'Get the current weather in a given city name."',
        "parameters": [
            {
                "name": "city_name",
                "description": "City name",
                "required": True,
                "schema": {"type": "string"},
            }
        ],
    },
    {
        "name_for_human": "image generation",
        "name_for_model": "image_gen",
        "description_for_model": "AI painting (image generation) service, input text description, and return the image URL drawn based on text information.",
        "parameters": [
            {
                "name": "prompt",
                "description": "describe the image",
                "required": True,
                "schema": {"type": "string"},
            }
        ],
    },
]

Then we should implement these tools with inputs and outputs, and execute them according to the output of LLM.

In [16]:
def call_tool(tool_name: str, tool_args: str) -> str:
    if tool_name == "get_weather":
        city_name = json5.loads(tool_args)["city_name"]
        key_selection = {
            "current_condition": [
                "temp_C",
                "FeelsLikeC",
                "humidity",
                "weatherDesc",
                "observation_time",
            ],
        }
        resp = requests.get(f"https://wttr.in/{city_name}?format=j1")
        resp.raise_for_status()
        resp = resp.json()
        ret = {k: {_v: resp[k][0][_v] for _v in v} for k, v in key_selection.items()}
        return str(ret)
    elif tool_name == "image_gen":
        import urllib.parse

        tool_args = tool_args.replace("(", "").replace(")", "")
        prompt = json5.loads(tool_args)["prompt"]
        prompt = urllib.parse.quote(prompt)
        return json.dumps(
            {"image_url": f"https://image.pollinations.ai/prompt/{prompt}"},
            ensure_ascii=False,
        )
    else:
        raise NotImplementedError


def llm_with_tool(prompt: str, history, list_of_tool_info=()):
    chat_history = [(x["user"], x["bot"]) for x in history] + [(prompt, "")]

    planning_prompt = build_input_text(chat_history, list_of_tool_info)
    text = ""
    while True:
        output = text_completion(planning_prompt + text, stop_words=["Observation:", "Observation:\n"])
        action, action_input, output = parse_latest_tool_call(output)
        if action:
            observation = call_tool(action, action_input)
            output += f"\nObservation: {observation}\nThought:"
            observation = f"{observation}\nThought:"
            print(observation)
            text += output
        else:
            text += output
            break

    new_history = []
    new_history.extend(history)
    new_history.append({"user": prompt, "bot": text})
    return text, new_history

## Run agent

[back to top ⬆️](#Table-of-contents:)

In [17]:
history = []
query = "get the weather in London, and create a picture of Big Ben based on the weather information"

response, history = llm_with_tool(prompt=query, history=history, list_of_tool_info=tools)

Thought: I should use the get_weather API to get the weather information for London and then use the image_gen API to create a picture of Big Ben based on that information.
Action: get_weather
Action Input: {"city_name": "London"}
Observation"}
{'current_condition': {'temp_C': '1', 'FeelsLikeC': '-2', 'humidity': '93', 'weatherDesc': [{'value': 'Overcast'}], 'observation_time': '06:29 AM'}}
Thought:
 I should use the weather information to create a picture of Big Ben. The current temperature is 1 degree Celsius, it feels like -2 degrees Celsius, the humidity is 93%, and the weather description is Overcast.
Action: image_gen
Action Input: {"prompt": "Big Ben in London, overcast weather, 1 degree Celsius, feels like -2 degrees Celsius, humidity 93%"}
Observation}
{"image_url": "https://image.pollinations.ai/prompt/Big%20Ben%20in%20London%2C%20overcast%20weather%2C%201%20degree%20Celsius%2C%20feels%20like%20-2%20degrees%20Celsius%2C%20humidity%2093%25"}
Thought:
 I now have the image URL 

## Create AI agent demo with Gradio UI

[back to top ⬆️](#Table-of-contents:)

In [18]:
from threading import Thread, Event
import queue


class IterableStreamer(openvino_genai.StreamerBase):
    """
    A custom streamer class for handling token streaming and detokenization with buffering.

    Attributes:
        tokenizer (Tokenizer): The tokenizer used for encoding and decoding tokens.
        tokens_cache (list): A buffer to accumulate tokens for detokenization.
        text_queue (Queue): A synchronized queue for storing decoded text chunks.
        print_len (int): The length of the printed text to manage incremental decoding.
    """

    def __init__(self, tokenizer):
        """
        Initializes the IterableStreamer with the given tokenizer.

        Args:
            tokenizer (Tokenizer): The tokenizer to use for encoding and decoding tokens.
        """
        super().__init__()
        self.tokenizer = tokenizer
        self.tokens_cache = []
        self.text_queue = queue.Queue()
        self.print_len = 0

    def __iter__(self):
        """
        Returns the iterator object itself.
        """
        return self

    def __next__(self):
        """
        Returns the next value from the text queue.

        Returns:
            str: The next decoded text chunk.

        Raises:
            StopIteration: If there are no more elements in the queue.
        """
        value = self.text_queue.get()  # get() will be blocked until a token is available.
        if value is None:
            raise StopIteration
        return value

    def get_stop_flag(self):
        """
        Checks whether the generation process should be stopped.

        Returns:
            bool: Always returns False in this implementation.
        """
        return False

    def put_word(self, word: str):
        """
        Puts a word into the text queue.

        Args:
            word (str): The word to put into the queue.
        """
        self.text_queue.put(word)

    def put(self, token_id: int) -> bool:
        """
        Processes a token and manages the decoding buffer. Adds decoded text to the queue.

        Args:
            token_id (int): The token_id to process.

        Returns:
            bool: True if generation should be stopped, False otherwise.
        """
        self.tokens_cache.append(token_id)
        text = self.tokenizer.decode(self.tokens_cache)

        word = ""
        if len(text) > self.print_len and "\n" == text[-1]:
            # Flush the cache after the new line symbol.
            word = text[self.print_len :]
            self.tokens_cache = []
            self.print_len = 0
        elif len(text) >= 3 and text[-3:] == chr(65533):
            # Don't print incomplete text.
            pass
        elif len(text) > self.print_len:
            # It is possible to have a shorter text after adding new token.
            # Print to output only if text length is increaesed.
            word = text[self.print_len :]
            self.print_len = len(text)
        self.put_word(word)

        if self.get_stop_flag():
            # When generation is stopped from streamer then end is not called, need to call it here manually.
            self.end()
            return True  # True means stop  generation
        else:
            return False  # False means continue generation

    def end(self):
        """
        Flushes residual tokens from the buffer and puts a None value in the queue to signal the end.
        """
        text = self.tokenizer.decode(self.tokens_cache)
        if len(text) > self.print_len:
            word = text[self.print_len :]
            self.put_word(word)
            self.tokens_cache = []
            self.print_len = 0
        self.put_word(None)

    def reset(self):
        self.tokens_cache = []
        self.text_queue = queue.Queue()
        self.print_len = 0


class ChunkStreamer(IterableStreamer):

    def __init__(self, tokenizer, tokens_len=4):
        super().__init__(tokenizer)
        self.tokens_len = tokens_len

    def put(self, token_id: int) -> bool:
        if (len(self.tokens_cache) + 1) % self.tokens_len != 0:
            self.tokens_cache.append(token_id)
            return False
        return super().put(token_id)


def run_chatbot(history):
    """
    callback function for running chatbot on submit button click

    Params:
      history: conversation history

    """
    chat_history = [(history[-1][0], "")]

    prompt = build_input_text(chat_history, tools)
    text = ""
    while True:
        planning_prompt = prompt + text
        im_end = "<|im_end|>"
        stop_words = ["Observation:", "Observation:\n"]
        if im_end not in stop_words:
            stop_words = stop_words + [im_end]
        streamer = ChunkStreamer(pipe.get_tokenizer())
        config = openvino_genai.GenerationConfig()
        config.max_new_tokens = 2000
        config.stop_strings = set(stop_words)

        stream_complete = Event()

        def generate_and_signal_complete():
            """
            genration function for single thread
            """
            streamer.reset()
            pipe.generate(planning_prompt, config, streamer)
            stream_complete.set()
            streamer.end()

        t1 = Thread(target=generate_and_signal_complete)
        t1.start()

        output = ""
        output_gui = ""
        show_response = False
        for new_text in streamer:
            output += new_text
            if "Final" in new_text:
                show_response = True
                idx = new_text.find("Final")
                new_text = new_text[idx:]
            if show_response:
                output_gui += new_text
                history[-1][1] = output_gui
                yield history

        # assert buffer.startswith(prompt)
        for stop_str in stop_words:
            idx = output.find(stop_str)
            if idx != -1:
                output = output[: idx + len(stop_str)]
        print(output)
        action, action_input, output = parse_latest_tool_call(output)
        if action:
            observation = call_tool(action, action_input)
            output += f"\nObservation: = {observation}\nThought:"
            observation = f"{observation}\nThought:"
            print(observation)
            text += output
        else:
            text += output
            break


def stop(streamer):
    if streamer is not None:
        streamer.end()
    return None

In [None]:
if not Path("gradio_helper.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/llm-agent-react/gradio_helper.py")
    open("gradio_helper.py", "w").write(r.text)

from gradio_helper import make_demo

examples = [
    ["Based on current weather in Beijing, show me a picture of Great Wall through its URL"],
    ["Create an image of pink cat and return its URL"],
    ["What is the weather like in New York now ?"],
]

demo = make_demo(run_fn=run_chatbot, stop_fn=stop, examples=examples)

try:
    demo.launch()
except Exception:
    demo.launch(share=True)
# If you are launching remotely, specify server_name and server_port
# EXAMPLE: `demo.launch(server_name='your server name', server_port='server port in int')`
# To learn more please refer to the Gradio docs: https://gradio.app/docs/

In [45]:
# please uncomment and run this cell for stopping gradio interface
# demo.close()

Closing server running on port: 5612
