### Redundant Execution for Fault Tolerance
When companies deploy their agentic solution they do face several issues like APIs time out, models crash, and networks drop. The previous patterns we learned have focused on improving the quality and speed of our agents under ideal conditions.

This pattern, Redundant Execution, focuses on ensuring our system remains reliable and performant even under adverse conditions.

Press enter or click to view image in full size


<p align="center">
  <img src="../../figures/redundant_execution.png" width="1000">
</p>

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace

llm = ChatHuggingFace(
    llm=HuggingFaceEndpoint(
        model="Qwen/Qwen3-4B-Instruct-2507"
    )
)

  from .autonotebook import tqdm as notebook_tqdm


The concept is straightforward: for a critical and potentially unreliable step, we execute two or more identical agents in parallel. The system then uses the result from the first agent to successfully finish and cancels the rest. This technique provides a defense against intermittent failures and unpredictable performance.

We will build a simple agent that relies on an intentionally unreliable, simulated tool. We will then run it with and without redundant execution to demonstrate the dramatic and measurable improvements in both speed (latency consistency) and success rate (effective accuracy).

First, we need to create our unreliable tool. This is the core of our simulation. It will have a chance to fail, a chance to be very slow, or a chance to be fast, mimicking the unpredictable nature of real-world network dependencies.

In [4]:
from langchain_core.tools import tool
import time
import random

@tool
def get_critical_data(query: str) -> str:
    """A simulated tool that fetches critical data from an external service that can be slow or fail intermittently."""
    # We assign a random ID to each instance of the tool call for clear logging.
    instance_id = random.randint(1000, 9999)
    print(f"--- [Tool Instance {instance_id}] Attempting to fetch data for query: '{query}' ---")
    
    # We simulate unreliability with a random roll.
    roll = random.random()
    if roll < 0.20: # 20% chance of a complete failure.
        print(f"--- [Tool Instance {instance_id}] FAILED: Network connection error. ---")
        raise ConnectionError("Failed to connect to the external service.")
    elif roll < 0.30: # 10% chance of a "long-tail latency" event.
        slow_duration = random.uniform(5, 7)
        print(f"--- [Tool Instance {instance_id}] SLOW: Experiencing high latency. Will take {slow_duration:.2f}s. ---")
        time.sleep(slow_duration)
    else: # 70% chance of a normal, fast execution.
        fast_duration = random.uniform(0.5, 1.0)
        print(f"--- [Tool Instance {instance_id}] FAST: Executing normally. Will take {fast_duration:.2f}s. ---")
        time.sleep(fast_duration)
    
    result = f"Data for '{query}' successfully retrieved by instance {instance_id}."
    print(f"--- [Tool Instance {instance_id}] SUCCESS: {result} ---")
    return result

The get_critical_data tool is our controlled experiment. The random roll and time.sleep calls are a simple but effective way to simulate the two most common problems in distributed systems, transient failures (ConnectionError) and unpredictable latency. This tool will allow us to create a clear, repeatable demonstration of the problem we are trying to solve.

### The Baseline - A Simple, Unreliable Agent

In [5]:
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

simple_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a reliable agent. Your job is to use the provided tool to get critical data based on the user's request."),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

simple_agent = create_tool_calling_agent(llm, [get_critical_data], simple_prompt)
simple_executor = AgentExecutor(agent=simple_agent, tools=[get_critical_data])

In [6]:
simple_results = []
num_runs = 5
for i in range(num_runs):
    print(f"--- Running Simple Agent (Attempt {i+1}/{num_runs}) ---")
    start_time = time.time()
    try:
        result = simple_executor.invoke({"input": "Please fetch the user's profile"})
        end_time = time.time()
        simple_results.append(("SUCCESS", end_time - start_time, result))
        print(f"SUCCESS in {end_time - start_time:.2f}s. Result: {result}\n")
    except Exception as e:
        end_time = time.time()
        simple_results.append(("FAILURE", end_time - start_time, str(e)))
        print(f"FAILURE in {end_time - start_time:.2f}s. Reason: {e}\n")

--- Running Simple Agent (Attempt 1/5) ---
--- [Tool Instance 8871] Attempting to fetch data for query: 'user profile' ---
--- [Tool Instance 8871] FAST: Executing normally. Will take 0.93s. ---
--- [Tool Instance 8871] SUCCESS: Data for 'user profile' successfully retrieved by instance 8871. ---
SUCCESS in 3.47s. Result: {'input': "Please fetch the user's profile", 'output': "The user's profile has been successfully retrieved. Let me know if you need further details!"}

--- Running Simple Agent (Attempt 2/5) ---
--- [Tool Instance 4390] Attempting to fetch data for query: 'user profile' ---
--- [Tool Instance 4390] FAILED: Network connection error. ---
FAILURE in 0.97s. Reason: Failed to connect to the external service.

--- Running Simple Agent (Attempt 3/5) ---
--- [Tool Instance 6132] Attempting to fetch data for query: 'user profile' ---
--- [Tool Instance 6132] SLOW: Experiencing high latency. Will take 6.37s. ---
--- [Tool Instance 6132] SUCCESS: Data for 'user profile' successf

### Building the Redundant Execution Graph
Now, let's build the fault-tolerant version. The key is to use a ThreadPoolExecutor to launch two identical agent executions and use as_completed to get the result from the one that finishes first.

In [7]:
from typing import TypedDict, Optional, Any
from concurrent.futures import ThreadPoolExecutor, as_completed, Future

class RedundantState(TypedDict):
    input: str
    result: Optional[Any]
    error: Optional[str]
    performance_log: Optional[str]

def redundant_executor_node(state: RedundantState):
    """Executes two identical agents in parallel and returns the first successful result."""
    print("--- [Redundant Executor] Starting 2 agents in parallel... ---")
    start_time = time.time()
    
    with ThreadPoolExecutor(max_workers=2) as executor:
        # Submit two identical tasks
        futures = [executor.submit(simple_executor.invoke, {"input": state['input']}) for _ in range(2)]
        
        first_result = None
        for future in as_completed(futures):
            try:
                # Get the result of the first completed future
                first_result = future.result()
                print("--- [Redundant Executor] A task finished successfully. Cancelling others. ---")
                # Once we have one success, we don't need the other. We can break.
                # In a real system, you might try to cancel the other running futures.
                break
            except Exception as e:
                print(f"--- [Redundant Executor] A task failed with error: {e}. Waiting for the other. ---")
                # If one fails, we just wait for the next one to complete.
                pass
    
    execution_time = time.time() - start_time
    log = f"Redundant execution completed in {execution_time:.2f}s."
    print(f"--- [Redundant Executor] {log} ---")
    
    if first_result:
        return {"result": first_result, "performance_log": log, "error": None}
    else:
        return {"result": None, "performance_log": log, "error": "Both redundant executions failed."}

In [8]:
from langgraph.graph import StateGraph, END

workflow = StateGraph(RedundantState)
workflow.add_node("redundant_executor", redundant_executor_node)
workflow.set_entry_point("redundant_executor")
workflow.add_edge("redundant_executor", END)
app = workflow.compile()

# Run the resilient graph multiple times
redundant_results = []
for i in range(num_runs):
    print(f"--- Running Redundant Agent (Attempt {i+1}/{num_runs}) ---")
    start_time = time.time()
    result = app.invoke({"input": "Please fetch the user's profile"})
    end_time = time.time()
    if result['error']:
        redundant_results.append(("FAILURE", end_time - start_time, result['error']))
        print(f"FAILURE in {end_time - start_time:.2f}s.\n")
    else:
        redundant_results.append(("SUCCESS", end_time - start_time, result['result']))
        print(f"SUCCESS in {end_time - start_time:.2f}s.\n")

--- Running Redundant Agent (Attempt 1/5) ---
--- [Redundant Executor] Starting 2 agents in parallel... ---
--- [Tool Instance 9173] Attempting to fetch data for query: 'user profile' ---
--- [Tool Instance 9173] FAILED: Network connection error. ---
--- [Redundant Executor] A task failed with error: Failed to connect to the external service.. Waiting for the other. ---
--- [Tool Instance 2074] Attempting to fetch data for query: 'user profile' ---
--- [Tool Instance 2074] FAST: Executing normally. Will take 0.66s. ---
--- [Tool Instance 2074] SUCCESS: Data for 'user profile' successfully retrieved by instance 2074. ---
--- [Redundant Executor] A task finished successfully. Cancelling others. ---
--- [Redundant Executor] Redundant execution completed in 3.11s. ---
SUCCESS in 3.11s.

--- Running Redundant Agent (Attempt 2/5) ---
--- [Redundant Executor] Starting 2 agents in parallel... ---
--- [Tool Instance 7266] Attempting to fetch data for query: 'user profile' ---
--- [Tool Instance

### Analysis

In [9]:
import numpy as np

# and the results are stored in 'simple_results' and 'redundant_results' lists as in the notebook)
# --- Reliability Analysis ---
simple_successes = sum(1 for r in simple_results if r[0] == "SUCCESS")
simple_rate = (simple_successes / len(simple_results)) * 100 if simple_results else 0
redundant_successes = sum(1 for r in redundant_results if r[0] == "SUCCESS")
redundant_rate = (redundant_successes / len(redundant_results)) * 100 if redundant_results else 0
print("="*60)
print("                  SYSTEM RELIABILITY ANALYSIS")
print("="*60 + "\n")
print("--- Simple Agent ---")
print(f"Success Rate: {simple_rate:.1f}% ({simple_successes} successes, {len(simple_results) - simple_successes} failures)\n")
print("--- Redundant Agent ---")
print(f"Success Rate: {redundant_rate:.1f}% ({redundant_successes} successes, {len(redundant_results) - redundant_successes} failures)\n")

if simple_rate > 0:
    reliability_increase = ((redundant_rate - simple_rate) / simple_rate) * 100
    print(f"Accuracy / Reliability Increase: +{reliability_increase:.1f}%\n")

# --- Latency Analysis ---
simple_latencies = [r[1] for r in simple_results if r[0] == "SUCCESS"]
redundant_latencies = [r[1] for r in redundant_results if r[0] == "SUCCESS"]
print("="*60)
print("                  PERFORMANCE & LATENCY ANALYSIS")
print("="*60 + "\n")
print("--- Simple Agent (Successful Runs Only) ---")
print(f"Latencies: {[round(l, 2) for l in simple_latencies]}")
print(f"Average Latency: {np.mean(simple_latencies):.2f} seconds" if simple_latencies else "N/A")
print(f"Max Latency (P100): {np.max(simple_latencies):.2f} seconds" if simple_latencies else "N/A")

print("--- Redundant Agent (Successful Runs Only) ---")
print(f"Latencies: {[round(l, 2) for l in redundant_latencies]}")
print(f"Average Latency: {np.mean(redundant_latencies):.2f} seconds" if redundant_latencies else "N/A")
print(f"Max Latency (P100): {np.max(redundant_latencies):.2f} seconds" if redundant_latencies else "N/A")

                  SYSTEM RELIABILITY ANALYSIS

--- Simple Agent ---
Success Rate: 80.0% (4 successes, 1 failures)

--- Redundant Agent ---
Success Rate: 80.0% (4 successes, 1 failures)

Accuracy / Reliability Increase: +0.0%

                  PERFORMANCE & LATENCY ANALYSIS

--- Simple Agent (Successful Runs Only) ---
Latencies: [3.47, 60.8, 3.49, 58.76]
Average Latency: 31.63 seconds
Max Latency (P100): 60.80 seconds
--- Redundant Agent (Successful Runs Only) ---
Latencies: [3.11, 3.05, 3.36, 2.63]
Average Latency: 3.04 seconds
Max Latency (P100): 3.36 seconds
