# 1. Introduction to Agents

## Objective

- Understanding Agents
    - What is an Agent, and how does it work?
    - How do Agents make decisions using reasoning and planning?

- The Role of LLMs (Large Language Models) in Agents

    - How LLMs serve as the ‚Äúbrain‚Äù behind an Agent.
    - How LLMs structure conversations via the Messages system.

- Tools and Actions
    - How Agents use external tools to interact with the environment.
    - How to build and integrate tools for your Agent.
    
- The Agent Workflow:
    - Think ‚Üí Act ‚Üí Observe.


After exploring these topics, **you‚Äôll build your first Agent** using smolagents!

Your Agent, named Alfred, will handle a simple task and demonstrate how to apply these concepts in practice.

You‚Äôll even learn how to **publish your Agent on Hugging Face Spaces**, so you can share it with friends and colleagues.

Finally, at the end of this Unit, you‚Äôll take a quiz. Pass it, and you‚Äôll **earn your first course certification**: the üéì Certificate of Fundamentals of Agents.

<img src="../assets/cetification_fundamentals.png" alt="Certification Fundamentals" width="400px" />


## Unit Outline

This Unit is your essential starting point, laying the groundwork for understanding Agents before you move on to more advanced topics.

<img src="../assets/Unit1_outline.png" alt="Unit 1 Outline" width="600px" />


### Step 1: What is an Agent?

An Agent is a system that can perceive its environment, reason about it, and take actions to achieve specific goals.

- Agents understand natural language and can interact with users in a conversational manner.
- They can also use external tools to perform tasks, such as searching the web, accessing databases, or executing code.
- Agents can learn from their experiences and improve their performance over time.
- They can also adapt to new situations and environments, making them versatile and flexible.

**Agents working logic:**

Agents fullfill a request or task by following a simple workflow:
1. **Think**: The Agent uses *reasoning and planning* to understand the task and decide on the best course of action.
2. **Act**: The Agent takes action by using external tools or APIs to perform the task. For this he is usually provided with a set of tools he should know about and how to use them.
3. **Observe**: The Agent observes the results of its actions and uses this feedback to improve its performance in the future.

#### **Standard Definition:**

*Short*

what an Agent is: **an AI model capable of reasoning, planning, and interacting with its environment.**

*Long*

An Agent is a system that **can perceive its environment**, **reason** about it, and **take actions** to achieve specific goals. It uses **reasoning and planning** to understand tasks, takes **action using external tools**, and **learns from its experiences** to improve performance over time. 

It is called Agent because it has agency, aka it has the ability to interact with the environment.

<img src="../assets/Unit1_step1_Agent_Outline.png.png" alt="Agent" width="700px" />


#### Formal Definition AI Agent

"An Agent is a system that leverages an AI model to interact with its environment in order to achieve a user-defined objective. It combines reasoning, planning, and the execution of actions (often via external tools) to fulfill tasks."

The Agent has two parts:

- AI model as its brain
- Environment as its body (capabilities and tools)

#### Spectrum of "Agency"

| Agency Level | Description                                               | What that‚Äôs called   | Example pattern                                                 |
|--------------|-----------------------------------------------------------|-----------------------|------------------------------------------------------------------|
| ‚òÜ‚òÜ‚òÜ          | Agent output has no impact on program flow                | Simple processor      | `process_llm_output(llm_response)`                              |
| ‚òÖ‚òÜ‚òÜ          | Agent output determines basic control flow                | Router                | `if llm_decision(): path_a() else: path_b()`                    |
| ‚òÖ‚òÖ‚òÜ          | Agent output determines function execution                | Tool caller           | `run_function(llm_chosen_tool, llm_chosen_args)`                |
| ‚òÖ‚òÖ‚òÖ          | Agent output controls iteration and program continuation  | Multi-step Agent      | `while llm_should_continue(): execute_next_step()`              |
| ‚òÖ‚òÖ‚òÖ          | One agentic workflow can start another agentic workflow   | Multi-Agent           | `if llm_trigger(): execute_agent()`                             |



#### What type of AI Models do we use for Agents?

The most common AI model found in Agents is an LLM (Large Language Model), which takes Text as an input and outputs Text as well.

Well known examples are GPT4 from OpenAI, LLama from Meta, Gemini from Google, etc. These models have been trained on a vast amount of text and are able to generalize well. We will learn more about LLMs in the next section.

        **It's also possible to use models that accept other inputs as the Agent's core model. For example, a Vision Language Model (VLM), which is like an LLM but also understands images as input. We'll focus on LLMs for now and will discuss other options later.**

#### How does an AI take action on its environment?

LLMs are amazing models, but they can only generate text.

However, if you ask a well-known chat application like HuggingChat or ChatGPT to generate an image, they can! How is that possible?

The answer is that the developers of HuggingChat, ChatGPT and similar apps implemented additional functionality (called Tools), that the LLM can use to create images.


#### What type of tasks can an Agent do?

An Agent can perform any task we implement via Tools to complete Actions.

For example, if I write an Agent to act as my personal assistant (like Siri) on my computer, and I ask it to ‚Äúsend an email to my Manager asking to delay today‚Äôs meeting‚Äù, I can give it some code to send emails. This will be a new Tool the Agent can use whenever it needs to send an email. We can write it in Python:

   ```python
   def send_message_to(recipient, message):
       """Useful to send an e-mail message to a recipient"""
       pass
   ```

The LLM, as we‚Äôll see, will generate code to run the tool when it needs to, and thus fulfill the desired task.

   ```python
   send_message_to("Manager", "Can we postpone today's meeting?")
   ```

The design of the Tools is very important and has a great impact on the quality of your Agent. Some tasks will require very specific Tools to be crafted, while others may be solved with general purpose tools like ‚Äúweb_search‚Äù.

        "Note that Actions are not the same as Tools. An Action, for instance, can involve the use of multiple Tools to complete."

Allowing an agent to interact with its environment allows real-life usage for companies and individuals.


##### Example 1: Personal Virtual Assistants

Virtual assistants like Siri, Alexa, or Google Assistant, work as agents when they interact on behalf of users using their digital environments.

They take user queries, analyze context, retrieve information from databases, and provide responses or initiate actions (like setting reminders, sending messages, or controlling smart devices).

##### Example 2: Customer Service Chatbots

Many companies deploy chatbots as agents that interact with customers in natural language.

These agents can answer questions, guide users through troubleshooting steps, open issues in internal databases, or even complete transactions.

Their predefined objectives might include improving user satisfaction, reducing wait times, or increasing sales conversion rates. By interacting directly with customers, learning from the dialogues, and adapting their responses over time, they demonstrate the core principles of an agent in action.

##### Example 3: AI Non-Playable Character in a video game

AI agents powered by LLMs can make Non-Playable Characters (NPCs) more dynamic and unpredictable.

Instead of following rigid behavior trees, they can respond contextually, adapt to player interactions, and generate more nuanced dialogue. This flexibility helps create more lifelike, engaging characters that evolve alongside the player‚Äôs actions.


#### Summary:

To summarize, an Agent is a system that uses an AI Model (typically an LLM) as its core reasoning engine, to:

- Understand natural language: Interpret and respond to human instructions in a meaningful way.

- Reason and plan: Analyze information, make decisions, and devise strategies to solve problems.

- Interact with its environment: Gather information, take actions, and observe the results of those actions.

Now that you have a solid grasp of what Agents are, let‚Äôs reinforce your understanding with a short, ungraded quiz. After that, we‚Äôll dive into the ‚ÄúAgent‚Äôs brain‚Äù: the LLMs.

### Step 2: What are LLMs?

<img src="../assets/Unit1_step2.png" alt="LLM" width="700px" />

In the previous section we learned that each Agent needs an AI Model at its core, and that LLMs are the most common type of AI models for this purpose.

Now we will learn what LLMs are and how they power Agents.

This section offers a concise technical explanation of the use of LLMs. If you want to dive deeper, you can check our free Natural Language Processing Course.


#### What is a Large Language Model?

An LLM is a type of AI model that excels at understanding and generating human language. They are trained on vast amounts of text data, allowing them to learn patterns, structure, and even nuance in language. These models typically consist of many millions of parameters.

Most LLMs nowadays are built on the Transformer architecture‚Äîa deep learning architecture based on the ‚ÄúAttention‚Äù algorithm, that has gained significant interest since the release of BERT from Google in 2018.

There are 3 types of transformers:

1. Encoders

    An encoder-based Transformer takes text (or other data) as input and outputs a dense representation (or embedding) of that text.

    - Example: BERT from Google
    - Use Cases: Text classification, semantic search, Named Entity Recognition
    - Typical Size: Millions of parameters

2. Decoders

    A decoder-based Transformer focuses on generating new tokens to complete a sequence, one token at a time.

    - Example: Llama from Meta
    - Use Cases: Text generation, chatbots, code generation
    - Typical Size: Billions (in the US sense, i.e., 10^9) of parameters

3. Seq2Seq (Encoder‚ÄìDecoder)

    A sequence-to-sequence Transformer combines an encoder and a decoder. The encoder first processes the input sequence into a context representation, then the decoder generates an output sequence.

    - Example: T5, BART
    - Use Cases: Translation, Summarization, Paraphrasing
    - Typical Size: Millions of parameters

Although Large Language Models come in various forms, LLMs are typically decoder-based models with billions of parameters. Here are some of the most well-known LLMs:


| Model        | Provider                    |
|--------------|-----------------------------|
| Deepseek-R1  | DeepSeek                    |
| GPT4         | OpenAI                      |
| Llama 3      | Meta (Facebook AI Research) |
| SmolLM2      | Hugging Face                |
| Gemma        | Google                      |
| Mistral      | Mistral                     |


The underlying principle of an LLM is simple yet highly effective: its objective is to predict the next token, given a sequence of previous tokens. A ‚Äútoken‚Äù is the unit of information an LLM works with. You can think of a ‚Äútoken‚Äù as if it was a ‚Äúword‚Äù, but for efficiency reasons LLMs don‚Äôt use whole words.

For example, while English has an estimated 600,000 words, an LLM might have a vocabulary of around 32,000 tokens (as is the case with Llama 2). Tokenization often works on sub-word units that can be combined.

For instance, consider how the tokens ‚Äúinterest‚Äù and ‚Äúing‚Äù can be combined to form ‚Äúinteresting‚Äù, or ‚Äúed‚Äù can be appended to form ‚Äúinterested.‚Äù

Each LLM has some special tokens specific to the model. The LLM uses these tokens to open and close the structured components of its generation. For example, to indicate the start or end of a sequence, message, or response. Moreover, the input prompts that we pass to the model are also structured with special tokens. The most important of those is the End of sequence token (EOS).

The forms of special tokens are highly diverse across model providers.

The table below illustrates the diversity of special tokens.

| Model       | Provider                       | EOS Token              | Functionality                    |
|-------------|--------------------------------|-------------------------|----------------------------------|
| GPT-4       | OpenAI                         | `<|endoftext|>`         | End of message text              |
| Llama 3     | Meta (Facebook AI Research)    | `<|eot_id|>`            | End of sequence                  |
| Deepseek-R1 | DeepSeek                       | `<|end_of_sentence|>`   | End of message text              |
| SmolLM2     | Hugging Face                   | `<|im_end|>`            | End of instruction or message    |
| Gemma       | Google                         | `<end_of_turn>`         | End of conversation turn         |


        We do not expect you to memorize these special tokens, but it is important to appreciate their diversity and the role they play in the text generation of LLMs. If you want to know more about special tokens, you can check out the configuration of the model in its Hub repository. For example, you can find the special tokens of the SmolLM2 model in its tokenizer_config.json.


#### Understanding next token prediction.

LLMs are said to be autoregressive, meaning that the output from one pass becomes the input for the next one. This loop continues until the model predicts the next token to be the EOS token, at which point the model can stop.

In other words, an LLM will decode text until it reaches the EOS. But what happens during a single decoding loop?

While the full process can be quite technical for the purpose of learning agents, here‚Äôs a brief overview:

- Once the input text is tokenized, the model computes a representation of the sequence that captures information about the meaning and the position of each token in the input sequence.
- This representation goes into the model, which outputs scores that rank the likelihood of each token in its vocabulary as being the next one in the sequence.

Based on these scores, we have multiple strategies to select the tokens to complete the sentence.

- The easiest decoding strategy would be to always take the token with the maximum score.

You can interact with the decoding process yourself with SmolLM2 in this Space (remember, it decodes until reaching an EOS token which is <|im_end|> for this model):

<img src="../assets/Unit1_step2_Example_Tokenization_Decoding.png" alt="Decoding" width="700px" />

- But there are more advanced decoding strategies. For example, beam search explores multiple candidate sequences to find the one with the maximum total score‚Äìeven if some individual tokens have lower scores.
    - Beam Search Parameters:
        - Sentence to decode from (inputs): the input sequence to your decoder.
        - Number of steps (max_new_tokens): the number of tokens to generate.
        - Number of beams (num_beams): the number of beams to use.
        - Length penalty (length_penalty): the length penalty to apply to outputs. length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences. This parameter will not impact the beam search paths, but only influence the choice of sequences in the end towards longer or shorter sequences.
        - Number of return sequences (num_return_sequences): the number of sequences to be returned at the end of generation. Should be <= num_beams.


#### Attention is all you need

A key aspect of the Transformer architecture is Attention. When predicting the next word, not every word in a sentence is equally important; words like ‚ÄúFrance‚Äù and ‚Äúcapital‚Äù in the sentence ‚ÄúThe capital of France is ‚Ä¶‚Äù carry the most meaning.

This process of identifying the most relevant words to predict the next token has proven to be incredibly effective.
Although the basic principle of LLMs‚Äîpredicting the next token‚Äîhas remained consistent since GPT-2, there have been significant advancements in scaling neural networks and making the attention mechanism work for longer and longer sequences.

If you‚Äôve interacted with LLMs, you‚Äôre probably familiar with the term context length, which refers to the maximum number of tokens the LLM can process, and the maximum attention span it has.


#### Prompting the LLM is important

Considering that the only job of an LLM is to predict the next token by looking at every input token, and to choose which tokens are ‚Äúimportant‚Äù, the wording of your input sequence is very important.

The input sequence you provide an LLM is called a prompt. Careful design of the prompt makes it easier to guide the generation of the LLM toward the desired output.


#### How are LLMs trained?

LLMs are trained on large datasets of text, where they learn to predict the next word in a sequence through a self-supervised or masked language modeling objective.

From this unsupervised learning, the model learns the structure of the language and underlying patterns in text, allowing the model to generalize to unseen data.

After this initial pre-training, LLMs can be fine-tuned on a supervised learning objective to perform specific tasks. For example, some models are trained for conversational structures or tool usage, while others focus on classification or code generation.


#### How can I use LLMs?

You have two main options:

        1. Run Locally (if you have sufficient hardware).

        2. Use a Cloud/API (e.g., via the Hugging Face Serverless Inference API).

Throughout this course, we will primarily use models via APIs on the Hugging Face Hub. Later on, we will explore how to run these models locally on your hardware.


#### How are LLMs used in AI Agents?

LLMs are a key component of AI Agents, providing the foundation for understanding and generating human language.

They can interpret user instructions, maintain context in conversations, define a plan and decide which tools to use.

We will explore these steps in more detail in this Unit, but for now, what you need to understand is that the LLM is the brain of the Agent.


----

That was a lot of information! We‚Äôve covered the basics of what LLMs are, how they function, and their role in powering AI agents.

If you‚Äôd like to dive even deeper into the fascinating world of language models and natural language processing, don‚Äôt hesitate to check out our free [NLP course](https://huggingface.co/learn/llm-course/chapter1/1).

Now that we understand how LLMs work, it‚Äôs time to see how LLMs structure their generations in a conversational context.

To run [this notebook](https://huggingface.co/agents-course/notebooks/blob/main/dummy_agent_library.ipynb), you need a Hugging Face token that you can get from https://hf.co/settings/tokens.

For more information on how to run Jupyter Notebooks, checkout [Jupyter Notebooks on the Hugging Face Hub](https://huggingface.co/docs/hub/notebooks).

You also need to request access to the [Meta Llama models](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).

### Step 2: What are Tools?

<img src="../assets/Unit1_step3.png" alt="tools" width="700px" />


One crucial aspect of AI Agents is their ability to take actions. As we saw, this happens through the use of Tools.

In this section, we‚Äôll learn what Tools are, how to design them effectively, and how to integrate them into your Agent via the System Message.

By giving your Agent the right Tools‚Äîand clearly describing how those Tools work‚Äîyou can dramatically increase what your AI can accomplish. Let‚Äôs dive in!


#### What are AI Tools?
A Tool is a function given to the LLM. This function should fulfill a clear objective.

Here are some commonly used tools in AI agents:

| Tool             | Description                                                   |
|------------------|---------------------------------------------------------------|
| Web Search       | Allows the agent to fetch up-to-date information from the internet. |
| Image Generation | Creates images based on text descriptions.                   |
| Retrieval        | Retrieves information from an external source.               |
| API Interface    | Interacts with an external API (GitHub, YouTube, Spotify, etc.). |


Those are only examples, as you can in fact create a tool for any use case!

A good tool should be something that complements the power of an LLM.

For instance, if you need to perform arithmetic, giving a calculator tool to your LLM will provide better results than relying on the native capabilities of the model.

Furthermore, LLMs predict the completion of a prompt based on their training data, which means that their internal knowledge only includes events prior to their training. Therefore, if your agent needs up-to-date data you must provide it through some tool.

For instance, if you ask an LLM directly (without a search tool) for today‚Äôs weather, the LLM will potentially hallucinate random weather.

A Tool should contain:

- A textual description of what the function does.
- A Callable (something to perform an action).
- Arguments with typings.
- (Optional) Outputs with typings.

#### How do tools work?

LLMs, as we saw, can only receive text inputs and generate text outputs. They have no way to call tools on their own. When we talk about providing tools to an Agent, we mean teaching the LLM about the existence of these tools and instructing it to generate text-based invocations when needed.

For example, if we provide a tool to check the weather at a location from the internet and then ask the LLM about the weather in Paris, the LLM will recognize that this is an opportunity to use the ‚Äúweather‚Äù tool. Instead of retrieving the weather data itself, the LLM will generate text that represents a tool call, such as call weather_tool(‚ÄòParis‚Äô).

The Agent then reads this response, identifies that a tool call is required, executes the tool on the LLM‚Äôs behalf, and retrieves the actual weather data.

The Tool-calling steps are typically not shown to the user: the Agent appends them as a new message before passing the updated conversation to the LLM again. The LLM then processes this additional context and generates a natural-sounding response for the user. From the user‚Äôs perspective, it appears as if the LLM directly interacted with the tool, but in reality, it was the Agent that handled the entire execution process in the background.

We‚Äôll talk a lot more about this process in future sessions.

#### How do we give tools to an LLM?

The complete answer may seem overwhelming, but we essentially use the system prompt to provide textual descriptions of available tools to the model:

```python
system_message = """You are an AI assistant designed to help users efficiently and accurately. Your primary goal is to provide helpful, precise, and clear responses.

You have access to the following tools:
{tools_description}
"""
```

For this to work, we have to be very precise and accurate about:

1. What the tool does
2. What exact inputs it expects

This is the reason why tool descriptions are usually provided using expressive but precise structures, such as computer languages or JSON. It‚Äôs not necessary to do it like that, any precise and coherent format would work.

If this seems too theoretical, let‚Äôs understand it through a concrete example.

We will implement a simplified calculator tool that will just multiply two integers. This could be our Python implementation:

```python
def calculator(a: int, b: int) -> int:
    """Multiply two integers."""
    return a * b
```

So our tool is called calculator, it multiplies two integers, and it requires the following inputs:

- a (int): An integer.
- b (int): An integer.
The output of the tool is another integer number that we can describe like this:

(int): The product of a and b.
All of these details are important. Let‚Äôs put them together in a text string that describes our tool for the LLM to understand.

        Tool Name: calculator, Description: Multiply two integers., Arguments: a: int, b: int, Outputs: int

"Reminder: This textual description is what we want the LLM to know about the tool."


When we pass the previous string as part of the input to the LLM, the model will recognize it as a tool, and will know what it needs to pass as inputs and what to expect from the output.

If we want to provide additional tools, we must be consistent and always use the same format. This process can be fragile, and we might accidentally overlook some details.

Is there a better way?

Auto-formatting Tool sections
Our tool was written in Python, and the implementation already provides everything we need:

- A descriptive name of what it does: calculator
- A longer description, provided by the function‚Äôs docstring comment: Multiply two integers.
- The inputs and their type: the function clearly expects two ints.
- The type of the output.

There‚Äôs a reason people use programming languages: they are expressive, concise, and precise.

We could provide the Python source code as the specification of the tool for the LLM, but the way the tool is implemented does not matter. All that matters is its name, what it does, the inputs it expects and the output it provides.

We will leverage Python‚Äôs introspection features to leverage the source code and build a tool description automatically for us. All we need is that the tool implementation uses type hints, docstrings, and sensible function names. We will write some code to extract the relevant portions from the source code.

After we are done, we‚Äôll only need to use a Python decorator to indicate that the calculator function is a tool:

```python
@tool
def calculator(a: int, b: int) -> int:
    """Multiply two integers."""
    return a * b

print(calculator.to_string())
```

Note the @tool decorator before the function definition.

With the implementation we‚Äôll see next, we will be able to retrieve the following text automatically from the source code via the to_string() function provided by the decorator:

        Tool Name: calculator, Description: Multiply two integers., Arguments: a: int, b: int, Outputs: int

As you can see, it‚Äôs the same thing we wrote manually before!


#### Generic Tool implementation

We create a generic Tool class that we can reuse whenever we need to use a tool. This class will be responsible for generating the textual description of the tool, and for calling the function when needed.

        "Disclaimer: This example implementation is fictional but closely resembles real implementations in most libraries."


```python
from typing import Callable


class Tool:
    """
    A class representing a reusable piece of code (Tool).

    Attributes:
        name (str): Name of the tool.
        description (str): A textual description of what the tool does.
        func (callable): The function this tool wraps.
        arguments (list): A list of argument.
        outputs (str or list): The return type(s) of the wrapped function.
    """
    def __init__(self,
                 name: str,
                 description: str,
                 func: Callable,
                 arguments: list,
                 outputs: str):
        self.name = name
        self.description = description
        self.func = func
        self.arguments = arguments
        self.outputs = outputs

    def to_string(self) -> str:
        """
        Return a string representation of the tool,
        including its name, description, arguments, and outputs.
        """
        args_str = ", ".join([
            f"{arg_name}: {arg_type}" for arg_name, arg_type in self.arguments
        ])

        return (
            f"Tool Name: {self.name},"
            f" Description: {self.description},"
            f" Arguments: {args_str},"
            f" Outputs: {self.outputs}"
        )

    def __call__(self, *args, **kwargs):
        """
        Invoke the underlying function (callable) with provided arguments.
        """
        return self.func(*args, **kwargs)
```


It may seem complicated, but if we go slowly through it we can see what it does. We define a Tool class that includes:

- name (str): The name of the tool.
- description (str): A brief description of what the tool does.
- function (callable): The function the tool executes.
- arguments (list): The expected input parameters.
- outputs (str or list): The expected outputs of the tool.
- __call__(): Calls the function when the tool instance is invoked.
- to_string(): Converts the tool‚Äôs attributes into a textual representation.

We could create a Tool with this class using code like the following:

```python
calculator_tool = Tool(
    "calculator",                   # name
    "Multiply two integers.",       # description
    calculator,                     # function to call
    [("a", "int"), ("b", "int")],   # inputs (names and types)
    "int",                          # output
)
```

But we can also use Python‚Äôs inspect module to retrieve all the information for us! This is what the @tool decorator does.

        "If you are interested, you can disclose the following section to look at the decorator implementation."

Example of the @tool decorator

```python
import inspect

def tool(func):
    """
    A decorator that creates a Tool instance from the given function.
    """
    # Get the function signature
    signature = inspect.signature(func)

    # Extract (param_name, param_annotation) pairs for inputs
    arguments = []
    for param in signature.parameters.values():
        annotation_name = (
            param.annotation.__name__
            if hasattr(param.annotation, '__name__')
            else str(param.annotation)
        )
        arguments.append((param.name, annotation_name))

    # Determine the return annotation
    return_annotation = signature.return_annotation
    if return_annotation is inspect._empty:
        outputs = "No return annotation"
    else:
        outputs = (
            return_annotation.__name__
            if hasattr(return_annotation, '__name__')
            else str(return_annotation)
        )

    # Use the function's docstring as the description (default if None)
    description = func.__doc__ or "No description provided."

    # The function name becomes the Tool name
    name = func.__name__

    # Return a new Tool instance
    return Tool(
        name=name,
        description=description,
        func=func,
        arguments=arguments,
        outputs=outputs
    )
```

Just to reiterate, with this decorator in place we can implement our tool like this:

```python
@tool

def calculator(a: int, b: int) -> int:
    """Multiply two integers."""
    return a * b

print(calculator.to_string())
```

And we can use the Tool‚Äôs to_string method to automatically retrieve a text suitable to be used as a tool description for an LLM:

        "Tool Name: calculator, Description: Multiply two integers., Arguments: a: int, b: int, Outputs: int"


The description is injected in the system prompt. Taking the example with which we started this section, here is how it would look like after replacing the tools_description:

```python
system_message="""You are an AI assistant designed to help users efficiently and accurately. Your primary goal is to provide helpful, precise, and clear responses.

You have access to the following tools:
Tool Name: calculator, Description: Multiply two integers., Arguments: a: int, b: int, Outputs: int
"""
```

In the Actions section, we will learn more about how an Agent can Call this tool we just created.

Model Context Protocol (MCP): a unified tool interface
Model Context Protocol (MCP) is an open protocol that standardizes how applications provide tools to LLMs. MCP provides:

- A growing list of pre-built integrations that your LLM can directly plug into
- The flexibility to switch between LLM providers and vendors
- Best practices for securing your data within your infrastructure

This means that any framework implementing MCP can leverage tools defined within the protocol, eliminating the need to reimplement the same tool interface for each framework.

----

Tools play a crucial role in enhancing the capabilities of AI agents.

To summarize, we learned:

- What Tools Are: Functions that give LLMs extra capabilities, such as performing calculations or accessing external data.

- How to Define a Tool: By providing a clear textual description, inputs, outputs, and a callable function.

- Why Tools Are Essential: They enable Agents to overcome the limitations of static model training, handle real-time tasks, and perform specialized actions.

Now, we can move on to the Agent Workflow where you‚Äôll see how an Agent observes, thinks, and acts. This brings together everything we‚Äôve covered so far and sets the stage for creating your own fully functional AI Agent.

But first, it‚Äôs time for another short quiz!

### Step 4: Agent Workflow

<img src="../assets/Unit1_step4.png" alt="Agent Workflow" width="500px" />

#### Understanding AI Agents through the Thought-Action-Observation Cycle

In the previous sections, we learned:

- How tools are made available to the agent in the system prompt.
- How AI agents are systems that can ‚Äòreason‚Äô, plan, and interact with their environment.

In this section, we‚Äôll explore the complete AI Agent Workflow, a cycle we defined as Thought-Action-Observation.

And then, we‚Äôll dive deeper on each of these steps.


##### The Core Components

Agents work in a continuous cycle of: thinking (Thought) ‚Üí acting (Act) and observing (Observe).

Let‚Äôs break down these actions together:

1. Thought: The LLM part of the Agent decides what the next step should be.
2. Action: The agent takes an action, by calling the tools with the associated arguments.
3. Observation: The model reflects on the response from the tool.


##### The Thought-Action-Observation Cycle

The three components work together in a continuous loop. To use an analogy from programming, the agent uses a while loop: the loop continues until the objective of the agent has been fulfilled.

<img src="../assets/Unit1_step4_TAO_cycle.png" alt="Agent Workflow" width="500px" />

In many Agent frameworks, the rules and guidelines are embedded directly into the system prompt, ensuring that every cycle adheres to a defined logic.

In a simplified version, our system prompt may look like this:

<img src="../assets/Unit1_step4_prompt_example.png" alt="System Prompt" width="500px" />

We see here that in the System Message we defined :

- The Agent‚Äôs behavior.
- The Tools our Agent has access to, as we described in the previous section.
- The Thought-Action-Observation Cycle, that we bake into the LLM instructions.

##### Example: Alfred the Weather Agent

A user asks Alfred: ‚ÄúWhat‚Äôs the current weather in New York?‚Äù

Alfred‚Äôs job is to answer this query using a weather API tool.

Here‚Äôs how the cycle unfolds:

1. Thought

    - Internal Reasoning:

    Upon receiving the query, Alfred‚Äôs internal dialogue might be:

    ‚ÄúThe user needs current weather information for New York. I have access to a tool that fetches weather data. First, I need to call the weather API to get up-to-date details.‚Äù

    This step shows the agent breaking the problem into steps: first, gathering the necessary data.



2. Action

    - Tool Usage:

    Based on its reasoning and the fact that Alfred knows about a get_weather tool, Alfred prepares a JSON-formatted command that calls the weather API tool. For example, its first action could be:

    Thought: I need to check the current weather for New York.

    ```json
       {
         "action": "get_weather",
         "action_input": {
           "location": "New York"
         }
       }
    ```
    
    Here, the action clearly specifies which tool to call (e.g., get_weather) and what parameter to pass (the ‚Äúlocation‚Äù: ‚ÄúNew York‚Äù).

3. Observation

    - Feedback from the Environment:

    After the tool call, Alfred receives an observation. This might be the raw weather data from the API such as:

    ‚ÄúCurrent weather in New York: partly cloudy, 15¬∞C, 60% humidity.‚Äù


    This observation is then added to the prompt as additional context. It functions as real-world feedback, confirming whether the action succeeded and providing the needed details.

4. Updated thought

    - Reflecting:

    With the observation in hand, Alfred updates its internal reasoning:

    ‚ÄúNow that I have the weather data for New York, I can compile an answer for the user.‚Äù


5. Final Action

    Alfred then generates a final response formatted as we told it to:

    - Thought: I have the weather data now. The current weather in New York is partly cloudy with a temperature of 15¬∞C and 60% humidity.‚Äù

    - Final answer : The current weather in New York is partly cloudy with a temperature of 15¬∞C and 60% humidity.

    This final action sends the answer back to the user, closing the loop.



What we see in this example:

- Agents iterate through a loop until the objective is fulfilled:

    Alfred‚Äôs process is cyclical. It starts with a thought, then acts by calling a tool, and finally observes the outcome. If the observation had indicated an error or incomplete data, Alfred could have re-entered the cycle to correct its approach.

- Tool Integration:

    The ability to call a tool (like a weather API) enables Alfred to go beyond static knowledge and retrieve real-time data, an essential aspect of many AI Agents.

- Dynamic Adaptation:

    Each cycle allows the agent to incorporate fresh information (observations) into its reasoning (thought), ensuring that the final answer is well-informed and accurate.

This example showcases the core concept behind the ReAct cycle (a concept we‚Äôre going to develop in the next section): the interplay of Thought, Action, and Observation empowers AI agents to solve complex tasks iteratively.

By understanding and applying these principles, you can design agents that not only reason about their tasks but also effectively utilize external tools to complete them, all while continuously refining their output based on environmental feedback.

---

Let‚Äôs now dive deeper into the Thought, Action, Observation as the individual steps of the process.


##### Thought: Internal Reasoning and the ReAct Approach

        In this section, we dive into the inner workings of an AI agent‚Äîits ability to reason and plan. We‚Äôll explore how the agent leverages its internal dialogue to analyze information, break down complex problems into manageable steps, and decide what action to take next. Additionally, we introduce the ReAct approach, a prompting technique that encourages the model to think ‚Äústep by step‚Äù before acting.

Thoughts represent the Agent‚Äôs internal reasoning and planning processes to solve the task.

This utilises the agent‚Äôs Large Language Model (LLM) capacity to analyze information when presented in its prompt.

Think of it as the agent‚Äôs internal dialogue, where it considers the task at hand and strategizes its approach.

The Agent‚Äôs thoughts are responsible for accessing current observations and decide what the next action(s) should be.

Through this process, the agent can break down complex problems into smaller, more manageable steps, reflect on past experiences, and continuously adjust its plans based on new information.

        "Note: In the case of LLMs fine-tuned for function-calling, the thought process is optional. In case you‚Äôre not familiar with function-calling, there will be more details in the Actions section."

Here are some examples of common thoughts:

| Type of Thought     | Example                                                                                                    |
|---------------------|------------------------------------------------------------------------------------------------------------|
| Planning            | ‚ÄúI need to break this task into three steps: 1) gather data, 2) analyze trends, 3) generate report‚Äù         |
| Analysis            | ‚ÄúBased on the error message, the issue appears to be with the database connection parameters‚Äù              |
| Decision Making     | ‚ÄúGiven the user‚Äôs budget constraints, I should recommend the mid-tier option‚Äù                              |
| Problem Solving     | ‚ÄúTo optimize this code, I should first profile it to identify bottlenecks‚Äù                                 |
| Memory Integration  | ‚ÄúThe user mentioned their preference for Python earlier, so I‚Äôll provide examples in Python‚Äù               |
| Self-Reflection     | ‚ÄúMy last approach didn‚Äôt work well, I should try a different strategy‚Äù                                      |
| Goal Setting        | ‚ÄúTo complete this task, I need to first establish the acceptance criteria‚Äù                                 |
| Prioritization      | ‚ÄúThe security vulnerability should be addressed before adding new features‚Äù                                |


##### The ReAct Approach

A key method is the ReAct approach, which is the concatenation of ‚ÄúReasoning‚Äù (Think) with ‚ÄúActing‚Äù (Act).

ReAct is a simple prompting technique that appends ‚ÄúLet‚Äôs think step by step‚Äù before letting the LLM decode the next tokens.

Indeed, prompting the model to think ‚Äústep by step‚Äù encourages the decoding process toward next tokens that generate a plan, rather than a final solution, since the model is encouraged to decompose the problem into sub-tasks.

This allows the model to consider sub-steps in more detail, which in general leads to less errors than trying to generate the final solution directly.

        We have recently seen a lot of interest for reasoning strategies. This is what's behind models like Deepseek R1 or OpenAI's o1, which have been fine-tuned to "think before answering".
        These models have been trained to always include specific thinking sections (enclosed between <think> and </think> special tokens). This is not just a prompting technique like ReAct, but a training method where the model learns to generate these sections after analyzing thousands of examples that show what we expect it to do.



#### Actions: Enabling the Agent to Engage with Its Environment

        In this section, we explore the concrete steps an AI agent takes to interact with its environment.
        We‚Äôll cover how actions are represented (using JSON or code), the importance of the stop and parse approach, and introduce different types of agents.

Actions are the concrete steps an AI agent takes to interact with its environment.

Whether it‚Äôs browsing the web for information or controlling a physical device, each action is a deliberate operation executed by the agent.

For example, an agent assisting with customer service might retrieve customer data, offer support articles, or transfer issues to a human representative.

##### Types of Agent Actions

There are multiple types of Agents that take actions differently:

| Type of Agent          | Description                                                                                      |
|------------------------|--------------------------------------------------------------------------------------------------|
| JSON Agent             | The Action to take is specified in JSON format.                                                  |
| Code Agent             | The Agent writes a code block that is interpreted externally.                                    |
| Function-calling Agent | It is a subcategory of the JSON Agent which has been fine-tuned to generate a new message for each action. |


Actions themselves can serve many purposes:

| Type of Action         | Description                                                                                      |
|------------------------|--------------------------------------------------------------------------------------------------|
| Information Gathering  | Performing web searches, querying databases, or retrieving documents.                            |
| Tool Usage             | Making API calls, running calculations, and executing code.                                      |
| Environment Interaction| Manipulating digital interfaces or controlling physical devices.                                |
| Communication          | Engaging with users via chat or collaborating with other agents.                                 |


The LLM only handles text and uses it to describe the action it wants to take and the parameters to supply to the tool. For an agent to work properly, the LLM must STOP generating new tokens after emitting all the tokens to define a complete Action. This passes control from the LLM back to the agent and ensures the result is parseable - whether the intended format is JSON, code, or function-calling.


##### The Stop and Parse Approach

One key method for implementing actions is the stop and parse approach. This method ensures that the agent‚Äôs output is structured and predictable:

1. Generation in a Structured Format:

    The agent outputs its intended action in a clear, predetermined format (JSON or code).

2. Halting Further Generation:

    Once the text defining the action has been emitted, the LLM stops generating additional tokens. This prevents extra or erroneous output.

3. Parsing the Output:

    An external parser reads the formatted action, determines which Tool to call, and extracts the required parameters.

For example, an agent needing to check the weather might output:

```json
Thought: I need to check the current weather for New York.
Action :
{
  "action": "get_weather",
  "action_input": {"location": "New York"}
}
```

The framework can then easily parse the name of the function to call and the arguments to apply.

This clear, machine-readable format minimizes errors and enables external tools to accurately process the agent‚Äôs command.

Note: Function-calling agents operate similarly by structuring each action so that a designated function is invoked with the correct arguments. We‚Äôll dive deeper into those types of Agents in a future Unit.

##### Code Agents

An alternative approach is using Code Agents. The idea is: instead of outputting a simple JSON object, a Code Agent generates an executable code block‚Äîtypically in a high-level language like Python.

<img src="../assets/Unit1_step4_Code_Agents.png" alt="Code Agent" width="800px" />

This approach offers several advantages:

- Expressiveness: Code can naturally represent complex logic, including loops, conditionals, and nested functions, providing greater flexibility than JSON.
- Modularity and Reusability: Generated code can include functions and modules that are reusable across different actions or tasks.
- Enhanced Debuggability: With a well-defined programming syntax, code errors are often easier to detect and correct.
- Direct Integration: Code Agents can integrate directly with external libraries and APIs, enabling more complex operations such as data processing or real-time decision making.

For example, a Code Agent tasked with fetching the weather might generate the following Python snippet:

```python
# Code Agent Example: Retrieve Weather Information
def get_weather(city):
    import requests
    api_url = f"https://api.weather.com/v1/location/{city}?apiKey=YOUR_API_KEY"
    response = requests.get(api_url)
    if response.status_code == 200:
        data = response.json()
        return data.get("weather", "No weather information available")
    else:
        return "Error: Unable to fetch weather data."

# Execute the function and prepare the final answer
result = get_weather("New York")
final_answer = f"The current weather in New York is: {result}"
print(final_answer)
```

In this example, the Code Agent:

Retrieves weather data via an API call,
Processes the response,
And uses the print() function to output a final answer.
This method also follows the stop and parse approach by clearly delimiting the code block and signaling when execution is complete (here, by printing the final_answer).

---

We learned that Actions bridge an agent‚Äôs internal reasoning and its real-world interactions by executing clear, structured tasks‚Äîwhether through JSON, code, or function calls.

This deliberate execution ensures that each action is precise and ready for external processing via the stop and parse approach. In the next section, we will explore Observations to see how agents capture and integrate feedback from their environment.

After this, we will finally be ready to build our first Agent!



##### Observe: Integrating Feedback to Reflect and Adapt

Observations are how an Agent perceives the consequences of its actions.

They provide crucial information that fuels the Agent‚Äôs thought process and guides future actions.

They are signals from the environment‚Äîwhether it‚Äôs data from an API, error messages, or system logs‚Äîthat guide the next cycle of thought.

In the observation phase, the agent:

- Collects Feedback: Receives data or confirmation that its action was successful (or not).
- Appends Results: Integrates the new information into its existing context, effectively updating its memory.
- Adapts its Strategy: Uses this updated context to refine subsequent thoughts and actions.

For example, if a weather API returns the data ‚Äúpartly cloudy, 15¬∞C, 60% humidity‚Äù, this observation is appended to the agent‚Äôs memory (at the end of the prompt).

The Agent then uses it to decide whether additional information is needed or if it‚Äôs ready to provide a final answer.

This iterative incorporation of feedback ensures the agent remains dynamically aligned with its goals, constantly learning and adjusting based on real-world outcomes.

These observations can take many forms, from reading webpage text to monitoring a robot arm‚Äôs position. This can be seen like Tool ‚Äúlogs‚Äù that provide textual feedback of the Action execution.

| Type of Observation   | Example                                                                    |
|------------------------|----------------------------------------------------------------------------|
| System Feedback        | Error messages, success notifications, status codes                       |
| Data Changes           | Database updates, file system modifications, state changes                |
| Environmental Data     | Sensor readings, system metrics, resource usage                           |
| Response Analysis      | API responses, query results, computation outputs                         |
| Time-based Events      | Deadlines reached, scheduled tasks completed                              |


##### How Are the Results Appended?

After performing an action, the framework follows these steps in order:

1. Parse the action to identify the function(s) to call and the argument(s) to use.
2. Execute the action.
3. Append the result as an Observation.

We‚Äôve now learned the Agent‚Äôs Thought-Action-Observation Cycle.

If some aspects still seem a bit blurry, don‚Äôt worry‚Äîwe‚Äôll revisit and deepen these concepts in future Units.

Now, it‚Äôs time to put your knowledge into practice by coding your very first Agent!

The code for the first agent can be found in the folder [`src/smolagent_basic`]() in the course repository.
