# **Pydantic Data Validation**

In [1]:
from langgraph.graph import StateGraph,START,END
from pydantic import BaseModel

In [None]:
class State(BaseModel):
    name:str

In [4]:
## node function
def example_node(state:State):
    return {"name":"Hello"}

In [7]:
## stateGraph
builder=StateGraph(State)
builder.add_node("example_node",example_node)

builder.add_edge(START,"example_node")
builder.add_edge("example_node",END)

graph=builder.compile()

In [9]:
graph.invoke({"name":123})

ValidationError: 1 validation error for State
name
  Input should be a valid string [type=string_type, input_value=123, input_type=int]
    For further information visit https://errors.pydantic.dev/2.10/v/string_type

## **Notes**

### **Introduction to Pydantic**

**Pydantic** is a powerful Python library that uses **type hints** for **data validation, parsing, and serialization**.
It is widely used in modern AI and LLM frameworks like **LangChain**, **LangGraph**, and **FastAPI** because it provides:

* Strict type validation at runtime
* Automatic serialization/deserialization
* Easy integration with APIs and JSON data
* Cleaner code for defining structured states or schemas

Pydantic models are ideal for **LLM state management**, **RAG data structures**, and **LangGraph node definitions**.

---

**Defining a Pydantic Model**

Here’s a simple example of defining a chatbot state with **Pydantic**:

```python
from pydantic import BaseModel, Field
from typing import List, Optional

class ChatState(BaseModel):
    user_messages: List[str] = Field(default_factory=list)
    ai_responses: List[str] = Field(default_factory=list)
    context: Optional[str] = None
```

**Features:**

* Fields are validated automatically.
* Default values use `Field(default_factory=...)` for mutable types like lists.
* Any invalid data types raise clear validation errors.

---

**Example: Using Pydantic in RAG Pipelines**

In RAG (Retrieval-Augmented Generation) pipelines, Pydantic helps enforce **data consistency** between retrieval, generation, and evaluation stages.

```python
class RAGState(BaseModel):
    query: str
    retrieved_docs: List[str] = Field(default_factory=list)
    answer: Optional[str] = None

def retrieve_docs(state: RAGState) -> RAGState:
    # Simulated retrieval step
    state.retrieved_docs = ["Document 1 text", "Document 2 text"]
    return state

def generate_answer(state: RAGState) -> RAGState:
    # Combine retrieved docs into an answer
    state.answer = f"Answer based on: {', '.join(state.retrieved_docs)}"
    return state

# Example run
rag_state = RAGState(query="Explain multi-hop reasoning")
rag_state = retrieve_docs(rag_state)
rag_state = generate_answer(rag_state)

print(rag_state.model_dump_json(indent=2))
```

**Output:**

```json
{
  "query": "Explain multi-hop reasoning",
  "retrieved_docs": ["Document 1 text", "Document 2 text"],
  "answer": "Answer based on: Document 1 text, Document 2 text"
}
```

**Why this matters:**

* Ensures only valid data enters the RAG pipeline.
* Prevents unexpected attribute errors or type mismatches.
* Compatible with JSON storage and APIs.

---

**Using Pydantic in LangGraph**

In **LangGraph**, state objects are shared between nodes.
Using **Pydantic** models as state ensures **typed, validated, and serializable** state transitions.

**LangGraph State Using Pydantic**

```python
from langgraph.graph import StateGraph, END

class ChatGraphState(BaseModel):
    user_input: str
    context: Optional[str] = None
    ai_reply: Optional[str] = None
```

```python
def process_input(state: ChatGraphState):
    # Preprocess or analyze input
    state.context = f"User said: {state.user_input}"
    return state

def generate_response(state: ChatGraphState):
    state.ai_reply = f"AI Response to: {state.context}"
    return state

graph = StateGraph(ChatGraphState)
graph.add_node("InputProcessor", process_input)
graph.add_node("Responder", generate_response)
graph.add_edge("InputProcessor", "Responder")
graph.add_edge("Responder", END)
graph.set_entry_point("InputProcessor")
```

**Advantages:**

* Graph state is strongly typed and validated.
* Each node can rely on the schema for consistency.
* Easy to serialize graph state for logs or debugging.

---

**Example with Validation and Custom Logic**

Pydantic allows adding **custom validators** for extra control.

```python
from pydantic import field_validator

class ValidatedState(BaseModel):
    query: str
    top_k: int = 5

    @field_validator('top_k')
    def validate_top_k(cls, v):
        if not (1 <= v <= 100):
            raise ValueError("top_k must be between 1 and 100")
        return v

# Validation example
try:
    bad_state = ValidatedState(query="LangGraph intro", top_k=500)
except Exception as e:
    print("Validation Error:", e)
```

**Output:**

```
Validation Error: 1 validation error for ValidatedState
top_k
  top_k must be between 1 and 100 (type=value_error)
```

---

**Pydantic vs. Data Classes vs. TypedDict**

| Feature                                  | **TypedDict**           | **DataClass**               | **Pydantic Model**                           |
| ---------------------------------------- | ----------------------- | --------------------------- | -------------------------------------------- |
| **Runtime Validation**                   | ❌ None                  | ❌ None                      | ✅ Yes                                        |
| **Serialization / JSON Support**         | ✅ Native dict           | ✅ via `asdict()`            | ✅ via `.model_dump()` / `.model_dump_json()` |
| **Default Factories**                    | ⚠️ Manual handling      | ✅ Supported                 | ✅ Supported                                  |
| **Custom Validators**                    | ❌                       | ✅ (manual)                  | ✅ Built-in                                   |
| **Performance**                          | ⚡ Fastest (lightweight) | ⚡ Fast                      | 🧠 Slightly slower (runtime validation)      |
| **Methods Support**                      | ❌                       | ✅                           | ✅                                            |
| **Use Case**                             | Lightweight schemas     | Structured state containers | Strict validation, API/state management      |
| **Integration in LangChain / LangGraph** | ✅ Simple                | ✅ Complex logic             | ✅ Best for production workflows              |

---

**When to Use Pydantic**

| Scenario                                                       | Recommendation                          |
| -------------------------------------------------------------- | --------------------------------------- |
| **You need strict data validation**                            | ✅ Use Pydantic                          |
| **You want fast, flexible structures**                         | ⚡ Use Data Classes                      |
| **You’re defining lightweight dict-like states**               | 🧩 Use TypedDict                        |
| **You’re exposing APIs (FastAPI, Flask)**                      | ✅ Use Pydantic (direct integration)     |
| **You’re building RAG / LangGraph workflows with persistence** | ✅ Use Pydantic for clean state handling |

---

**Final Example — RAG with Pydantic + LangGraph Integration**

```python
from pydantic import BaseModel, Field
from langgraph.graph import StateGraph, END

class RAGGraphState(BaseModel):
    query: str
    retrieved_docs: list[str] = Field(default_factory=list)
    answer: str = ""

def retrieve_node(state: RAGGraphState):
    # Mock retrieval
    state.retrieved_docs = ["Doc A", "Doc B"]
    return state

def generate_node(state: RAGGraphState):
    state.answer = f"Generated answer using {len(state.retrieved_docs)} docs"
    return state

graph = StateGraph(RAGGraphState)
graph.add_node("Retriever", retrieve_node)
graph.add_node("Generator", generate_node)
graph.add_edge("Retriever", "Generator")
graph.add_edge("Generator", END)
graph.set_entry_point("Retriever")
```

Here:

* Pydantic ensures that every node receives a **validated RAG state**.
* State transitions are type-safe and serializable.
* Easy to log or store states for debugging or metrics.

---

**Summary**

* **Pydantic** is the most robust option for defining and validating **RAG**, **LangGraph**, or **multi-agent** states.
* It ensures **type safety, validation, and serialization**, making it ideal for production-grade pipelines.
* **DataClasses** offer performance and simplicity for smaller workflows.
* **TypedDict** is useful for lightweight or prototype-level state handling.

