Reflection is one of the most effective design patterns for agents, where a LLM enters into an iterative loop of problem solving and  feedback until a solution is reached. This feedback can be external or intrinsic, where external feedback refers to receiving feedback from external evaluators such as humans, code execution output, or unit test results, and intrinsic feedback refers to self-critique from the same LLM on its previous response. The reflection process is depicted in @fig-1.

![The reflection process of iteratively using feedback to improve LLM output. Dotted arrows indicate optionality in how the LLM output is processed into feedback, which can be provided by the same LLM (intrinsic feedback), by an external evaluator (external feedback), or both.](./pics/reflection.png){#fig-1 fig-align="center" width=80%}

There has been some debate regarding whether self-feedback from the same LLM alone can drive performance improvements. Madaan _et al._ introduced Self-Refine, a reflection framework that relies completely on intrinsic feedback [@madaan2023self]. After the initial response construction, each iteration of Self-Refine consists of the steps of feedback and refinement. Each step uses the same LLM, but differs in the prompt that contains instructions (e.g. instructions for feedback or refinement) and few-shot examples. Empirically, this was shown to improve performance materially, with most of the improvement occurring in the initial iterations of reflection. However, the improvement is not necessarily monotonic, and Madaan _et al._ observed that Self-Refine was most effective when the feedback is specific, actionable, or broken down into different evaluation dimensions. However, another group of researchers from Google DeepMind independently assessed instrinsic self-correction and found that its effectiveness was limited, instead causing performance degradation with each self-refine iteration [@huang2023large]. They concluded that this discrepancy was caused by flawed experimental design from the Self-Refine study where the complete set of requirements was included in the feedback prompt instead of the initial response instruction. As a result, it is unclear whether the improvement was due to "leakage" of requirements from the feedback or from the iterative process of self-improvement. When the complete set of requirements was included in the initial response instruction, Huang _et al._ observed that standard prompting outperformed Self-Refine. 

These studies show that in order to reliably improve after reflection, external feedback is necessary. Intuitively, this makes sense because if the bottleneck of performance is due to the lack of certain parametric knowledge (i.e. knowledge trained into the weights of a model), then feedback from the same LLM is unlikely to provide the missing knowledge required to arrive at the solution. We next introduce self-correction from external feedback and ReAct as effective reflection examples that make use of external feedback or signal to drive performance beyond prompting a LLM in isolation.

## Self-Correction with External Feedback

Multiple works of research has shown that external feedback reliably and materially improves performance on the problem an agent is trying to solve, with the LLM self-correcting its previous response using the provided feedback [@chen2023teaching; @shinn2023reflexion; @gou2023critic]. Gou _et al._ introduced a simple version of this by using a set of tools through which critiques on correctness are obtained. The critiques are used by the LLM to improve its previous response, and the loop terminates when the critiques determine the latest response to be correct [@gou2023critic]. Shinn _et al._ introduced the Reflexion framework that added more bells and whistles by (1) integrating LLM-driven self-reflection with external feedback signal to serve as the final feedback and (2) incorporating short-term memory of conversation history (i.e. trajectory) and long-term memory of past verbal feedback. By using an evaluator LLM to score the latest LLM response and using another LLM to suggest actionable feedback based on the score, Reflexion improves the quality of the feedback. With memory of the LLM's past interactions with the envornment and their outcomes, the actor LLM essentially undergoes reinforcement learning based on in-context learning examples. The Reflexion framework is best illustrated with the diagram in @fig-2.

![Reflexion framework. Source: "Reflexion: Language Agents with Verbal Reinforcement Learning" (https://arxiv.org/pdf/2303.11366).](./pics/reflexion.png){#fig-2 fig-align="center" width=60%}

Finally, Chen _et al._ applied reflection to programming agents to simulate rubber
duck debugging, where debugging is done by explaining code and following the execution results [@chen2023teaching]. Like Reflexion, the feedback step includes both the external signal (from code execution) and a LLM generated explanation of the code as the final feedback provided to the same LLM for the next iteration of code generation. In agreement with other research, the authors found that receiving feedback from code execution is important for improving performance consistently. 

## ReAct

ReAct, which stands for reasoning and acting, is a prompting technique to combine reasoning and acting with language models to solve diverse language reasoning and decision making tasks [@yao2023react]. By interspersing actions and subsequent observations with a LLM's reasoning traces, Yao _et al._ showed that this reduces hallucination and error propagation, demonstrating superior performance over baselines like standard prompting, chain-of-thought prompting (i.e. reasoning only), and acting only (i.e. ReAct without thoughts). It is believed that reasoning improves an agent's subsequent actions via mechanisms such as introducing notions of planning, injecting extra parametric knowledge, or by extracting important parts of observations. 

Concretely, ReAct prompting induces a repeating pattern of thinking, acting, and receiving observation until a task is solved. Starting with a question, ReAct prompting invokes the LLM to output thinking traces on how to solve the problem, followed by another LLM invocation on what action to output given the thinking. The selected action is executed, then the corresponding observation is added to the agent's short-term memory (i.e. conversation history), and the process repeats. Thus, the number of LLM calls is $O(N)$ in the number $N$ of think, act, and observation iterations required to solve a problem, making ReAct prompting relatively expensive. However, ReAct improves the flexibility of an agent to address open-ended requests dynamically. Additionally, Yao _et al._ reported ReAct introduces strong generalization capabilities to new problems with just a few in-context examples. 

We next use ReAct to implement an agent that uses reasoning and internet search to recommend where to live in the United States based on their preferences. Our ReAct prompt first describes to the LLM that we wish to solve the problem using a thinking, action, and observation pattern, instructing the LLM output format at each step, explaining the action space, and how to end. Then, this is followed by a one-shot example of a thought, action, and observation pattern used to address a sample question. The prompt ends by instructing the agent to _only_ output tokens for action or thinking.


In [37]:
SYSTEM_PROMPT = """
You are an assistant trying to help me determine which city in the United States I should live in.
"""

ROLE_PROMPT = """
Solve a question answering task with interleaving Thought, Action, Observation steps. 
Respond ONLY with ONE JSON-complaint string of the format:

{{
    "type": "Thought" or "Action"
    "content": "content of Thought, Action, or Observation"
}}

The "content" of Action is either "Search[arguments]" or "Finish[answer]", where:

-`"Search[arguments]"` means to search the web with arguments
-`"Finish[answer]"` means to finish the steps with an answer. You should output `"Finish[answer]"` as soon as you've gathered enough information
  to answer the question.

The "content" of Thought is a reasoning about what Action to take next based on previous Observation. DO NOT output new lines for Thought content.

Example pattern:

Question: 
"Which city should I live in and what activities does it provide? I enjoy walkable cities."

Thought 1:
{{
    "type": "Thought"
    "content": "To find which city you should live in, I need to search for a list of most walkable cities in the 
                United States and their temperatures."
}}

Action 1:
{{
    "type": "Action"
    "content": "Search[Most walkable American cities]"
}}

Observation 1:
{{
    "type": "Observation"
    "content": "One of the most walkable cities in the U.S. is New York."
}}

Thought 2:
{{
    "type": "Thought"
    "content": "I need to search what hobbies or activities are available in New York."
}}

Action 2:
{{
    "type": "Action"
    "content": "Search[What hobbies or activities are available in New York?]"
}}

Observation 2:
{{
    "type": "Observation"
    "content": "From big highlights like Times Square to quaint walks locals love. New York City is a hub of culture 
                and adventure, offering an array of activities for everyone. Explore iconic landmarks like the Statue 
                of Liberty, the Empire State Building, and Central Park, or marvel at art of the Museum of Modern Art."
}}

Thought 3:
{{
    "type": "Thought"
    "content": "From the search results, New York City is the most walkable city in America with hobbies or activites 
                like exploring city landmarks or visiting museums."
}}

Action 3:
{{
    "type": "Action"
    "content": "Finish[I recommend you live in New York City, which is one of the most walkable cities in America. There 
                       you can explore city landmarks Statue of Liberty, the Empire State Building, and Central Park, or marvel 
                       at art of the Museum of Modern Art]"
}}

...

Return type and content for ONE step ONLY
"""

To implement the actual agent, we essentially wrap the LLM call within a while loop until a maximum number of steps is reached. The agent class is additionally equipped with:

+ Short-term memory, implemented simply as a string in this protype.
+ The internet search tool `DuckDuckGoSearchRun`, which simply accepts a natural language query and responds with a paragraph of text results.

Then, in each iteration we parse the LLM response to decide what action to take. The initial question and subsequent thoughts, actions, and observations are continually added to the conversation history to influence what the LLM will output next. For example, if the last item in the history is an observation, then the LLM will output a thinking trace based on following the ReAct prompt.

In [38]:
from langchain_community.tools import DuckDuckGoSearchRun
import json
import boto3
import warnings
warnings.filterwarnings('ignore')

class ReAct_agent:
    def __init__(self, client, system_prompt, role_prompt, max_steps=20, search_max_length=1000):
        self.client = client
        self.SYSTEM_PROMPT = system_prompt
        self.ROLE_PROMPT = role_prompt
        self.ddg_search = DuckDuckGoSearchRun()
        self.max_steps = max_steps
        self.step_num = 0
        self.search_max_length = search_max_length
        self.history = ""

    def ask(self, message, verbose = True):
        self.add_to_memory(message)
        model_response = self.invoke_llm(message)
        model_response_json = json.loads(model_response.replace("```json", "").replace("```", "").strip())
        response_type = model_response_json["type"]
        while self.step_num < self.max_steps:
            self.add_to_memory(model_response)
            if response_type == "Action":
                prefix, content = self.parse_action(model_response_json["content"])
                if prefix == "Finish":
                    return content
                elif prefix == "Search":
                    observation = self.search(content)
                    if verbose:
                        print(json.dumps({"type": "Observation", "content": observation}, indent=4))
                    self.add_to_memory(json.dumps({"type": "Observation", "content": observation}))
            model_response = self.invoke_llm(self.history)
            model_response_json = json.loads(model_response.replace("```json", "").replace("```", "").strip())
            response_type = model_response_json["type"]
            if verbose:
                print(json.dumps(model_response_json, indent=4))
            self.step_num += 1

    def search(self, message):
        return self.ddg_search.invoke(message)[:self.search_max_length]

    def invoke_llm(self, message):
        bedrock_runtime_response = bedrock_runtime.converse(
            modelId = "us.anthropic.claude-3-7-sonnet-20250219-v1:0",
            system = [
                {'text': self.SYSTEM_PROMPT}, 
                {'text': self.ROLE_PROMPT}
            ],
            messages = [{"role": "user", "content": [{"text": message}]}]
        )
        return bedrock_runtime_response["output"]['message']['content'][0]['text']

    def add_to_memory(self, message):
        self.history += ", " + message

    def parse_action(self, message):
        start = message.find('[')
        end = message.find(']')
        prefix = message[:start]
        content = message[start+1:end]
        return prefix, content

Now we ask the agent where we should live if we prefer cities with access to nature and outdoor activities.

In [39]:
bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-west-2")

agent = ReAct_agent(bedrock_runtime, SYSTEM_PROMPT, ROLE_PROMPT, max_steps = 15)

user_input = "Which city should I live in? I enjoy outdoor activities."
answer = agent.ask(user_input)

print(answer)

{
    "type": "Action",
    "content": "Search[best US cities for outdoor activities]"
}
{
    "type": "Observation",
    "content": "Nov 16, 2024 \u00b7 Adventure awaits in these top 10 US cities ! Discover destinations perfect for hiking, biking, skiing, and more outdoor activities across America. Mar 19, 2025 \u00b7 15 U.S . Cities with the Best Outdoor Activities for Nature Lovers March 19, 2025 by Donna Dizon Leave a Comment Many nature lovers long to live in a cabin out in the country, but for most of us, that\u2019s not a reality. Sep 2, 2025 \u00b7 Read Time: 8 min read Summary: Looking for the best cities for outdoor activities in the U.S .? Check out Alexandria, VA for its beautiful parks and trails, Boston, MA for skiing and kayaking, Denver, CO for mountain adventures, Portland, OR for hiking and camping, Roseville, CA for river rafting and national parks, Salt Lake City, UT for skiing and hiking, and Seattle, WA for fishing and ... Image Editorial Credit: Farragutful via W

We see that with ReAct prompting, the agent first got a list of candidate cities, then searched each candidate city for specific outdoor activities before returning the final answer, which is Denver, Colorado. 

## LLM as Optimizers

Yang _et al._ from DeepMind showed that a LLM can be prompted to perform non-gradient optimization of numerical and discrete problems, using linear regression and the traveling salesman problem as example problems [@yang2023large]. The prompt includes a description of the problem and goal, followed by a set of candidate solutions and their corresponding scores, which the authors together refer to as the _meta-prompt_. By proposing new solutions and getting them scored, the LLM can utilize in-context learning to iterate on the problem. Crucially, this requires an external objective function evaluator to score solutions at each iteration, which are then added to the growing set of solution-score pairs. The solution-score pairs are sorted to presumably better help the LLM identify patterns that can be used to further refine the solution. See @fig-3. to see how a LLM can be used as an optimizer through prompting.

![Optimization by prompting framework. Source: "Large Language Models as Optimizers" (https://arxiv.org/pdf/2309.03409).](./pics/llm_as_optimizer.png){#fig-3 fig-align="center" width=80%}

We can use the meta-prompt for linear regression from the paper to explain the prompt design:

> <span style="color: orange">Now you will help me minimize a function with two input variables w, b. I have some (w, b) pairs
and the function values at those points. The pairs are arranged in descending order based on their
function values, where lower values are better.</span>.
>
> <span style="color: blue">input:<br>w=18, b=15<br>value:<br>10386334</span>
> 
> <span style="color: blue">input:<br>w=17, b=18<br>value:<br>9204724</span>
> 
> <span style="color: orange">Give me a new (w, b) pair that is different from all pairs above, and has a function value lower than
any of the above. Do not write code. The output must end with a pair [w, b], where w and b are
numerical values.</span>.

In the above meta-prompt, the orange text are the instructions for linear regression and the blue text are the set of candidate solutions are the weight values $w, b$ along with their sum of squared error (presumably). 

As a more practical application, Yang _et al._ used the LLM to optimize the prompt for a given task. Specifically, the meta-prompt contains a statement that the goal is to generate a prompt (i.e. meta-instructions), examples of the output format, and past prompt-score pairs. After generation, a scorer LLM or objective function computes accuracy scores for these new prompts. The meta-prompt is then updated with the best new prompt-score pairs for further optimization.