## **Human**s are excellent at **recognizing complex patterns**, 
- but they often rely on tools like books, Google Search, or calculators to supplement their knowledge. 
Similarly, **Generative AI models** can be trained to **use external tools to access real-time information** or **perform real-world actions**.

For example, a model can access a database to get a customer's *purchase history* and generate *personalized recommendations*, or it can make API calls to send *emails* or complete *financial transactions*.

To do this, the model **must not only** have **access to external tools**, but **must also be able to plan and execute tasks autonomously**. This combination of reasoning, logic, and access to external information creates the concept of an **"agent,"** which is a **program that extends the capabilities of a generative model**.

## What are **AI agents**?
A generative **AI agent** is an application that seeks to achieve a goal by **observing the world and interacting with it** through the tools at its disposal.

**Agents are autonomous** and can act **without human intervention**, especially when they have well-defined goals.



## The Model, Tools, and Orchestration Layer in AI Agents
### The Model 

In the context of an agent, the model refers to the **language model (LM)** that serves as the **agent's decision-making center**. It can be composed of one or more models of various sizes, **capable of following logics and reasoning frameworks such as ReAct, Chain-of-Thought or Tree-of-Thoughts**.

### The Tools
**Language models**, while powerful in generating text and images, **cannot interact directly with the outside world**. To overcome this limitation, tools allow agents to access external data and perform real actions.

**The tools can have different levels of complexity** and **generally operate via web APIs** with **methods such as GET, POST, PATCH, and DELETE**. For example, an agent can use a tool to update a customer's information in a database or to obtain weather data to provide travel recommendations.

With the tools, agents **can support advanced systems such as Retrieval-Augmented Generation (RAG)**, which expands the capabilities of the base model and allows it to work with up-to-date and specific information.

<small>"Retrieval-Augmented Generation (RAG) enhances generative AI models by integrating information retrieval. This allows large language models (LLMs) to respond to queries using both their training data and up-to-date information from external documents."</small>

### The Orchestration Layer
The **orchestration layer defines the cyclical process** by which the agent **acquires information, reasons, and makes decisions**. This cycle continues until the agent reaches a goal or checkpoint.

## Agents vs. Models

To understand the distinction between agents and models, consider the following table:

| **Models** | **Agents** |
|------------|-----------|
| Knowledge is limited to what is available in their training data. | Knowledge is extended through the connection with external systems via tools. |
| Single inference/prediction based on the user query. No management of session history or continuous context (e.g., chat history). | Managed session history (e.g., chat history) enables multi-turn inference and decision-making within the orchestration layer. A "turn" represents one user query and one agent response. |
| No native tool implementation. | Tools are natively implemented in the agent architecture. |
| No native logic layer implemented. Users must rely on structured prompts or reasoning frameworks (CoT, ReAct, etc.) to guide model predictions. | Native cognitive architecture integrates reasoning frameworks like CoT, ReAct, or pre-built agent frameworks like LangChain. |


### **Cognitive Architectures: How AI Agents Operate**

The idea behind the **cognitive architectures** of AI agents is similar to that of a **chef in a busy kitchen**. The chef's goal is to create delicious dishes through a cycle of **information gathering, planning, execution, and adaptation**.

- **Information Gathering**: The chef checks customer orders and available ingredients.
- **Reasoning**: Decides which dishes can be prepared.
- **Action**: Starts cooking, adjusting quantities and preparation methods.
- **Adaptation**: If an ingredient runs out or a customer provides feedback, the chef modifies the recipe in real-time.

AI agents operate with a similar logic: **they process information, make decisions, and refine their actions based on the results obtained**. At the core of this architecture is the **orchestration layer**, which manages memory, state, reasoning, and planning. Orchestration is driven by **prompt engineering frameworks** that enhance the agent's interaction with the environment.

---

### **Reasoning Frameworks for AI Agents**
Agents can use different **reasoning approaches**, including:

1. **ReAct (Reason + Act)**
   - A framework that allows agents to **reason before acting**, improving response consistency and reducing hallucinations.
   - It has been shown to outperform many state-of-the-art AI solutions and increase user trust in LLMs.

2. **Chain-of-Thought (CoT)**
   - A technique that breaks down reasoning into **intermediate steps**, helping models perform complex tasks.
   - Includes variants like **self-consistency, active-prompt, and multimodal CoT**, each with strengths and limitations.

3. **Tree-of-Thought (ToT)**
   - Generalizes the logic of **Chain-of-Thought**, allowing agents to **explore multiple reasoning paths** to solve strategic problems.

---

### **Example of an Agent with ReAct**
When a user sends a request, an agent using the **ReAct** framework follows this process:

1. **The user sends a query to the agent.**
2. **The agent initiates the ReAct sequence:**
   - **Thought**: What to do based on the query.
   - **Action**: Decides on an action (e.g., search for flights, perform a web search, write code).
   - **Tools**: Selects an appropriate tool, such as a flight API.
   - **Observation**: Obtains the tool's results and updates its reasoning.
3. **The Thought → Action → Observation cycle continues until an optimal response is obtained.**
4. **The agent provides a final response to the user based on real data.**

---

### **Conclusion**
The quality of agent responses depends on their **ability to reason and select the right tools**. Like a chef using fresh ingredients and listening to customer feedback, an AI agent must **rely on reliable information and a solid decision-making process** to deliver accurate and useful results.

In the next section, the document delves into **how agents connect to up-to-date data to further improve their responses**.

The **ReAct** (Reasoning and Acting) framework is a prompting technique for large language models (LLMs) that combines reasoning with action to improve performance on complex tasks. Introduced by Yao et al. in 2022, ReAct allows models to generate both reasoning traces and task-specific actions in an interleaved manner, enabling greater synergy between the two. Reasoning traces help the model induce, track, and update action plans, as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information.

A typical ReAct prompt consists of a task description, few-shot examples showing the desired format, and alternating steps of "Thought", "Action", and "Observation". For example, for a complex question, the model might:

1. **Thought**: Determine what information is needed.
2. **Action**: Perform a search to obtain that information.
3. **Observation**: Analyze the search results.
4. **Thought**: Synthesize the gathered information.
5. **Action**: Provide the final answer.

This approach allows the model to dynamically update its knowledge and make more informed decisions, improving the accuracy and interpretability of responses. ReAct has been successfully applied to various tasks, including question answering, fact-checking, and decision-making in complex environments, outperforming methods based solely on reasoning or action.

### **Tools: The Key to the External World**

**Language models** are excellent at processing information, but **they cannot directly interact with the real world**. This limits their usefulness when interaction with external systems or up-to-date data is required.

To overcome this limitation, there are various tools that connect the AI model with the external world, including:
- **Functions**
- **Extensions**
- **Data Stores**
- **Plugins**

These tools allow agents to perform **real-time actions**, such as:
- Adjusting smart home settings
- Updating calendars
- Retrieving information from a database
- Sending emails

Currently, Google's models interact with **three main types of tools**:
1. **Extensions**
2. **Functions**
3. **Data Stores**

---

### **Extensions**
**Extensions** are the bridge between an **AI agent and an API**, standardizing API calls to make them more flexible and scalable.

#### **Practical Example: Flight Booking**
- A user asks: *"I want to book a flight from Austin to Zurich."*
- The agent needs to parse the text and call the **Google Flights** API.
- If the user **does not provide the departure city**, the API would fail due to missing essential data.

A **rigid approach** with custom code might fail in such cases, requiring continuous updates to handle exceptions. **Extensions solve this problem** by teaching the agent:
- **How to use the API** with practical examples
- **Which parameters** are necessary for API calls

---

### **Advantages of Extensions**
- **More robust and flexible than custom code**
- **Independent of the agent**, can be integrated dynamically
- **Allow selecting the most suitable extension based on the user's request**
- **Google provides preconfigured Extensions** to simplify integration, such as a **Code Interpreter** to execute Python code from a textual description

---

### **Conclusion**
Tools like **Extensions** enable AI agents to **interact with the real world**, enhancing their usefulness and accuracy. Thanks to these tools, an AI agent can **dynamically choose the best method to solve a request**, just as a software developer would select an appropriate API for a problem.

### Step 1: Import the Necessary Libraries
First, let's import the required libraries:

- `vertexai` to interact with Google's Vertex AI.
- `pprint` to print the results in a readable format.

In [13]:
%pip install vertexai

Note: you may need to restart the kernel to use updated packages.




In [14]:
import vertexai # the library that allows us to use Google's Vertex AI to make requests to AI models.
import pprint # It helps to format the text in output, making it more readable.

print(vertexai.__version__)

1.71.1


In [15]:
PROJECT_ID = "YOUR_PROJECT_ID"  # Replace with your actual Project ID
REGION = "us-central1"  # Region where Vertex AI is active

vertexai.init(project=PROJECT_ID, location=REGION)

In [16]:
# Import and Load the Code Interpreter Extension
from vertexai.preview.extensions import Extension

extension_code_interpreter = Extension.from_hub("code_interpreter")

# Extension.from_hub("code_interpreter") loads the Code Interpreter extension, which allows the AI agent to dynamically write and execute Python code.
# This extension enables generating code on demand and executing it.

DefaultCredentialsError: Your default credentials were not found. To set up Application Default Credentials, see https://cloud.google.com/docs/authentication/external/set-up-adc for more information.

In [None]:
# Create a Request to Generate Code
CODE_QUERY = """Write a python method to invert a binary tree in O(n) time."""

# The CODE_QUERY variable contains the question we are sending to the AI agent.
# In this case, we are asking to generate an algorithm to invert a binary tree.

In [None]:
# Execute the Request with the Code Interpreter
response = extension_code_interpreter.execute(
    operation_id="generate_and_execute",
    operation_params={"query": CODE_QUERY}
)

print("Generated Code:")
pprint.pprint({response['generated_code']})

# execute() sends the request to the agent to generate and execute the code.
# operation_id="generate_and_execute" indicates that we want to both generate and execute the code.
# operation_params={"query": CODE_QUERY} passes the question to the function so that the agent generates the desired code.

### **Functions: Modular Tools for AI Agents**

In the world of software development, **functions** are self-contained code modules that perform specific tasks and can be reused. A developer defines:
- **Which function to use** (e.g., `function_a` or `function_b`).
- **Which parameters to provide** and **what results to expect**.

In **AI agents**, functions work similarly, with the difference that the **choice of function and arguments is managed by the model** itself, rather than by a human developer.

---

### **Differences Between Functions and Extensions**
1. **Functions do not directly execute API calls**
   - The model generates a function with parameters but does not directly execute it on an external API.
   - APIs are called from the **client-side** (e.g., an external application), not by the AI agent.

2. **Functions are executed client-side, while extensions are agent-side**
   - **Extensions** are integrated into the agent's architecture and directly interact with external APIs.
   - **Functions** are executed by **client-side software**, giving the programmer more control over the interaction with APIs.

---

### **Example: Flight Booking with Functions vs. Extensions**
- With an **Extension**, the agent directly calls the **Google Flights API** and returns the results to the user.
- With a **Function**, the agent decides which function to use, but the API call is handled **elsewhere**, such as in the **client backend or middleware**.

📌 **Why use Functions instead of Extensions?**
- **If the API call needs to occur at another level** of the application (e.g., frontend, middleware).
- **For security or authentication reasons**, if the API is not directly accessible by the agent.
- **For batch operations or human review**, where APIs need to be called in a specific order.
- **To transform data before passing it to the agent**, if the API does not offer filtering tools.
- **To develop and test agents without modifying existing APIs** (e.g., simulating API calls).

---

### **Conclusion**
**Functions** give developers more control, separating the AI agent's logic from the API infrastructure, while **Extensions** allow for more direct and automated integration. The choice depends on the **security constraints, data flow, and customization needs** of the application.
In summary, **AI agents operate similarly to a programming script** but with an additional level of intelligence and automation. **They autonomously decide which functions to use and how to execute them**, allowing them to respond effectively and dynamically to user requests.

### **Data Stores: The Solution for Dynamic Data in AI Agents**

**Language models** are like a **large library** containing everything learned during training. However, **they cannot automatically update with new information**, which is a limitation when real-world knowledge changes.

**Data Stores solve this problem** by providing access to **real-time updated data**, keeping the model's responses **more accurate and relevant**.

---

### **How Do Data Stores Work?**
Imagine a **developer** needing to provide additional data to an AI agent, such as an **Excel sheet, a PDF, or a structured database**.

**With Data Stores**, the agent can directly access this data without needing to:
- **Manually convert data formats**
- **Retrain the model**
- **Perform fine-tuning**

**How does the integration work?**
1. The **Data Store** converts the document into **vector embeddings** (a mathematical representation of the data).
2. The agent uses these embeddings to **extract useful information** when needed.
3. The model can then **combine the Data Store's data with its pre-trained knowledge** to generate more accurate responses.

---

### **Applications of Data Stores**
**Data Stores** are often implemented as **vector databases**, allowing the agent to retrieve information as needed.

One of the most common uses is in **RAG (Retrieval-Augmented Generation) applications**, which **extend** the model's knowledge with updated data from:
- **Websites**
- **Structured data** (PDF, Word, CSV, Excel)
- **Unstructured data** (HTML, TXT, digitized documents)

**Example of RAG in action:**
1. The user sends a question to the AI agent.
2. The query is converted into **vector embeddings**.
3. These embeddings are **compared** with those in the **vector database**.
4. The most relevant content is retrieved and passed to the AI agent.
5. The agent combines the information and provides a response to the user.

---

### **Conclusion**
**Data Stores allow AI agents to access updated and accurate information without needing to retrain the model.**
- They are ideal for applications **based on documents, knowledge bases, and structured data retrieval**.
- **RAG + ReAct** enhances the decision-making process of AI agents, making them **smarter and contextually aware**.

**Final result**: AI agents can **query a dynamic database** to obtain responses based on updated and relevant data, instead of relying on static knowledge.

### **Improving Model Performance with Targeted Learning**

To make the best use of an AI model, it is essential that it knows how to **choose the right tools** when generating output. However, many real-world situations require **additional knowledge** beyond what is contained in the training data.

**Similar to cooking**, a chef may have basic knowledge, but to master a specific cuisine (e.g., Japanese), they need **targeted learning**.

There are **three main approaches** to provide models with this specific knowledge:

---

### **1. In-Context Learning**
The model **learns in real-time** by being provided with:
- A **prompt** (like a recipe).
- Some **tools** (ingredients).
- **Few-shot examples** (examples of similar dishes).

**Example (chef):**
The chef receives a recipe with a few ingredients and uses their general knowledge to prepare the dish **on the fly**.

**Example (AI):**
An AI model uses the **ReAct framework**, which allows it to **reason and act dynamically** based on examples in the prompt.

---

### **2. Retrieval-Based In-Context Learning**
The model **dynamically searches** for the most relevant information from an **external memory** (Data Store).
- Retrieves useful data, tools, and examples to respond to the query.

**Example (chef):**
The chef has access to **a pantry full of ingredients and cookbooks**. They can choose what to use based on the customer's request, combining **experience and new information**.

**Example (AI):**
An AI agent with **RAG architecture** can query **databases, documents, and APIs** to retrieve updated information before generating a response.

---

### **3. Fine-Tuning Based Learning**
The model is **pre-trained** on a larger dataset of specific examples **before receiving user queries**.
- This method allows it to be **well-prepared** for specific tasks.

**Example (chef):**
The chef **goes back to school** to specialize in an exotic cuisine, so in the future, they can cook complex dishes without improvising.

**Example (AI):**
An AI model is **fine-tuned** on specific data (e.g., a medical dataset) to become **an expert in that domain**, improving its performance without needing to retrieve external data.

---

### **Conclusion**
**Each method has pros and cons in terms of speed, cost, and latency.**
By combining **In-Context Learning, Retrieval-Based Learning, and Fine-Tuning**, we can achieve AI agents that are more **adaptable and accurate**, capable of using the **best of each strategy**.

### **📌 Summary: Fundamentals and Future of AI Agents**

This **whitepaper** explored the **fundamental concepts** of generative AI agents, their **structures**, and the most effective ways to implement them through **cognitive architectures**.

---

### **1. AI Agents Enhance Language Models**
**Agents extend the capabilities of language models** by using tools that allow them to:
   - **Access real-time information**.
   - **Suggest and plan actions in the real world**.
   - **Perform complex tasks autonomously**.

An agent can use **one or more language models** to:
   - **Manage complex states** (logical transitions between actions).
   - **Use external tools to solve problems** that a language model alone could not address.

---

### **2. The Orchestration Layer is the Core of the AI Agent**
The **orchestration layer** is a **cognitive structure** that guides the agent in:
   - **Reasoning**
   - **Planning**
   - **Decision-making**
   - **Generating responses based on logic and data**

**Reasoning techniques** include:
   - **ReAct** (Reason + Act) → To improve responses and reduce errors.
   - **Chain-of-Thought (CoT)** → To break down thinking into logical steps.
   - **Tree-of-Thought (ToT)** → To explore multiple solution paths.

---

### **3. Tools Connect Agents to the Real World**
**AI agents use tools** to interact with external systems and **access up-to-date data**.

**Types of tools:**
- **Extensions** → Connect agents to **external APIs** to obtain real-time data.
- **Functions** → Allow agents to **generate parameters** to execute **client-side**.
- **Data Stores** → Provide access to **structured and unstructured data**, enabling data-driven applications.

---

### **4. The Future of AI Agents**
The field of AI agents is **constantly evolving** and the potential is still **unexplored**.

**Future developments will include:**
- **Improved tools and reasoning capabilities** → Agents will become increasingly accurate and effective.
- **Chaining of specialized agents** → Creation of **networks of agents**, each expert in a specific domain.
- **Advanced applications in various sectors** → Business, healthcare, industrial automation, and more.

---

### **Conclusion**
Creating **complex AI agents requires an iterative process** of **continuous experimentation and optimization**.
**There is no perfect agent**, as each architecture is generative and depends on specific needs.
By combining **language models, external tools, and orchestration strategies**, we can build **AI applications capable of generating real-world value**.

# **Endnotes**

1. **Shafran, I., Cao, Y. et al., 2022**  
   *ReAct: Synergizing Reasoning and Acting in Language Models*.  
   Available at: [https://arxiv.org/abs/2210.03629](https://arxiv.org/abs/2210.03629)

2. **Wei, J., Wang, X. et al., 2023**  
   *Chain-of-Thought Prompting Elicits Reasoning in Large Language Models*.  
   Available at: [https://arxiv.org/pdf/2201.11903.pdf](https://arxiv.org/pdf/2201.11903.pdf)

3. **Wang, X. et al., 2022**  
   *Self-Consistency Improves Chain of Thought Reasoning in Language Models*.  
   Available at: [https://arxiv.org/abs/2203.11171](https://arxiv.org/abs/2203.11171)

4. **Diao, S. et al., 2023**  
   *Active Prompting with Chain-of-Thought for Large Language Models*.  
   Available at: [https://arxiv.org/pdf/2302.12246.pdf](https://arxiv.org/pdf/2302.12246.pdf)

5. **Zhang, H. et al., 2023**  
   *Multimodal Chain-of-Thought Reasoning in Language Models*.  
   Available at: [https://arxiv.org/abs/2302.00923](https://arxiv.org/abs/2302.00923)

6. **Yao, S. et al., 2023**  
   *Tree of Thoughts: Deliberate Problem Solving with Large Language Models*.  
   Available at: [https://arxiv.org/abs/2305.10601](https://arxiv.org/abs/2305.10601)

7. **Long, X., 2023**  
   *Large Language Model Guided Tree-of-Thought*.  
   Available at: [https://arxiv.org/abs/2305.08291](https://arxiv.org/abs/2305.08291)

8. **Google**  
   *Google Gemini Application*.  
   Available at: [http://gemini.google.com](http://gemini.google.com)

9. **Swagger**  
   *OpenAPI Specification*.  
   Available at: [https://swagger.io/specification/](https://swagger.io/specification/)

10. **Xie, M., 2022**  
    *How does in-context learning work? A framework for understanding the differences from traditional supervised learning*.  
    Available at: [https://ai.stanford.edu/blog/understanding-incontext/](https://ai.stanford.edu/blog/understanding-incontext/)

11. **Google Research**  
    *ScaNN (Scalable Nearest Neighbors)*.  
    Available at: [https://github.com/google-research/google-research/tree/master/scann](https://github.com/google-research/google-research/tree/master/scann)

12. **LangChain**  
    *LangChain Documentation*.  
    Available at: [https://python.langchain.com/v0.2/docs/introduction/](https://python.langchain.com/v0.2/docs/introduction/)


# **Glossary of AI Agent Terminology**

### **A**
- **Agent**: A system that extends the capabilities of a language model by incorporating reasoning, tool usage, and real-world actions.
- **Agent Chaining**: A method where multiple specialized agents collaborate, each handling a specific domain or task.
- **Agent Orchestration**: The framework that manages an agent’s reasoning, planning, and decision-making process.

### **C**
- **Chain-of-Thought (CoT)**: A reasoning technique where a model breaks down a problem into intermediate steps to improve logical inference.
- **Client-Side Execution**: When API calls and logic are handled by the application using the AI agent, rather than the agent itself.

### **D**
- **Data Stores**: Storage systems that provide structured or unstructured data to an agent, allowing real-time access to updated knowledge.
- **Decision-Making**: The process through which an AI agent determines the next action based on reasoning and available information.

### **E**
- **Embedding Model**: A machine learning model that converts text or other data into a numerical vector representation for efficient searching and retrieval.
- **Extensions**: Tools that connect agents to external APIs, enabling them to retrieve real-time information or execute API calls.

### **F**
- **Fine-Tuning**: The process of training a language model on additional specific data to enhance its performance in a particular domain.
- **Function Calling**: A method where an AI model generates parameters for a function, but the execution occurs outside the model, typically on a client-side system.

### **H**
- **Human-in-the-Loop (HITL)**: A method where humans supervise, validate, or intervene in the decision-making process of an AI model.

### **I**
- **In-Context Learning (ICL)**: A technique where a model learns how to perform a task dynamically by being provided with examples at inference time.

### **L**
- **Language Model (LM)**: An AI system trained to process and generate text based on vast amounts of training data.
- **LangChain**: A framework for building AI-powered applications by linking language models with external tools and logic.
- **LangGraph**: A library used to create structured agent workflows and reasoning paths.

### **M**
- **Middleware**: A software layer that connects different applications, such as AI agents and databases, allowing seamless data transfer and processing.
- **Multi-Modal Chain-of-Thought**: A CoT reasoning approach that incorporates various data types, such as text and images, to improve decision-making.

### **O**
- **Orchestration Layer**: The central component of an agent that organizes its reasoning and tool usage, ensuring efficient task execution.

### **P**
- **Prompt Engineering**: The process of designing and structuring inputs to guide an AI model in generating better responses.

### **R**
- **ReAct (Reasoning + Acting)**: A framework where an AI agent first reasons about a task before taking an action, enhancing decision-making and reducing errors.
- **Retrieval-Augmented Generation (RAG)**: A technique where an AI model retrieves external information before generating a response.
- **Reasoning Frameworks**: Structured approaches like ReAct, CoT, and Tree-of-Thoughts that guide an AI's logical processing.
- **Reinforcement Learning with Human Feedback (RLHF)**: A training method where human preferences help fine-tune an AI model’s responses.

### **S**
- **Self-Consistency**: A method that improves reasoning in AI models by generating multiple solutions and selecting the most consistent one.
- **Stop Sequence**: A specific token or character that prevents an AI model from continuing to generate unnecessary text.

### **T**
- **Tool**: A function, API, or external system that an AI agent can use to enhance its capabilities and interact with real-world data.
- **Tree-of-Thought (ToT)**: A structured reasoning method where an AI explores multiple possible reasoning paths before making a decision.

### **U**
- **Unstructured Data**: Information that is not stored in a predefined format, such as text files, images, or PDFs.

### **V**
- **Vector Database**: A database that stores information as high-dimensional vectors, enabling efficient similarity searches.

### **W**
- **Workflow Automation**: The use of AI agents to manage and streamline repetitive business processes.

