# Fault Tolerance in LangGraph

This notebook demonstrates fault tolerance in agentic workflows using LangGraph. If a workflow crashes or is interrupted, it can resume from the last checkpoint.

**Steps covered:**
1. Define the state and steps for the workflow.
2. Simulate a crash/interruption.
3. Resume the workflow to show fault tolerance.

---

In [None]:
# Import required libraries for LangGraph fault tolerance demo
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import InMemorySaver
from typing import TypedDict
import time  # Used to simulate a long-running process

## 1. Import Libraries

We import LangGraph components, a checkpoint saver, and other utilities needed for the workflow.

In [None]:
# Define the state for the workflow
# This will be used to track progress and enable fault tolerance
class CrashState(TypedDict):
    input: str
    step1: str
    step2: str

## 2. Define the State

We define a `CrashState` class to represent the state at each step of the workflow.

In [None]:
# Define workflow steps

def step_1(state: CrashState) -> CrashState:
    # Step 1: Mark as done and pass input forward
    print("‚úÖ Step 1 executed")
    return {"step1": "done", "input": state["input"]}


def step_2(state: CrashState) -> CrashState:
    # Step 2: Simulate a hang to demonstrate interruption
    print("‚è≥ Step 2 hanging... now manually interrupt from the notebook toolbar (STOP button)")
    time.sleep(1000)  # Simulate long-running hang
    return {"step2": "done"}


def step_3(state: CrashState) -> CrashState:
    # Step 3: Final step after resuming
    print("‚úÖ Step 3 executed")
    return {"done": True}

---
**Note:** To simulate a crash, manually stop execution during Step 2 using the notebook's STOP button. The workflow will resume from the last checkpoint when re-run.

## 3. Define Workflow Steps

We create three steps:
- **Step 1:** Executes normally.
- **Step 2:** Simulates a long-running process (hang) to demonstrate interruption.
- **Step 3:** Final step after resuming.

In [None]:
# Build the workflow graph and set up checkpointing
builder = StateGraph(CrashState)
builder.add_node("step_1", step_1)
builder.add_node("step_2", step_2)
builder.add_node("step_3", step_3)

builder.set_entry_point("step_1")
builder.add_edge("step_1", "step_2")
builder.add_edge("step_2", "step_3")
builder.add_edge("step_3", END)

# Use in-memory checkpointing for demonstration
checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)

## 4. Build the Graph

We construct the workflow graph, add nodes and edges, and set up checkpointing for fault tolerance.

In [None]:
# Run the graph and manually interrupt during Step 2 to simulate a crash
try:
    print("‚ñ∂Ô∏è Running graph: Please manually interrupt during Step 2...")
    graph.invoke({"input": "start"}, config={"configurable": {"thread_id": 'thread-1'}})
except KeyboardInterrupt:
    print("‚ùå Kernel manually interrupted (crash simulated).")

## 5. Simulate Crash/Interruption

We run the graph and manually interrupt execution during Step 2 to simulate a crash. This demonstrates how checkpointing allows recovery.

In [None]:
# Resume the graph after interruption to show fault tolerance
print("\nüîÅ Re-running the graph to demonstrate fault tolerance...")
final_state = graph.invoke(None, config={"configurable": {"thread_id": 'thread-1'}})
print("\n‚úÖ Final State:", final_state)

## 6. Resume and Demonstrate Fault Tolerance

After interruption, we re-run the graph. It resumes from the last checkpoint, showing fault tolerance in action.

---
### Summary

This notebook showed how LangGraph can recover from interruptions using checkpointing. You can adapt this pattern for more complex agentic workflows to ensure reliability and fault tolerance.