## **Introduction to Pydantic, Data Validation & Structured Outputs for LLMs**

**Pydantic** is a popular library in Python for data validation and parsing using Python type annotations/hints. It is often compared to dataclasses, which is a decorator that automatically generates special methods like `__init__()` and `__repr__()` for user-defined classes.

```python
# Without dataclasses
class Person:
  def __init__(self, name: str, age: int):
    self.name = name
    self.age = age
```

```python
# With dataclasses
from dataclasses import dataclass

@dataclass
class Person:
  name: str
  age: int
```

However, Pydantic goes further by validating and parsing data, ensuring that the data structures adhere to the specified types and constraints.

Pydantic is particularly useful in the development of APIs, LLM applications (including agents, RAG systems etc.), and structured input/output, where strict data validation is crucial.

```python
# With Pydantic
from pydantic import BaseModel

class Person(BaseModel):
  name: str
  age: int
```

Highlights from [Pydantic's documentation](https://docs.pydantic.dev/latest/concepts/models/)

> One of the primary ways of defining schema in Pydantic is via models. Models are simply classes which inherit from pydantic.BaseModel and define fields as annotated attributes.

> Models share many similarities with Python's dataclasses, but have been designed with some subtle-yet-important differences that streamline certain workflows related to validation, serialization, and JSON schema generation.

In [1]:
# @title Install dependencies

# Install the latest versions of Pydantic, Anthropic and OpenAI
!pip install -qqU pydantic
!pip install -qqU anthropic[vertex] # For Claude Sonnet 3.5 using Google Vertex AI API
!pip install -qqU openai
!pip install -qqU beautifulsoup4==4.12.2

import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
os.environ['ANTHROPIC_API_KEY'] = userdata.get('ANTHROPIC_API_KEY')  # Enable to use Claude

### **Pydantic vs. Dataclasses**

While `dataclasses` provide a convenient way to generate boilerplate code for class instances, they do not enforce type constraints.

This means that you can assign incorrect types to fields without raising an error, which could lead to issues later when those fields are used.

On the other hand, Pydantic enforces type validation and raises a `ValidationError` if the data does not match the expected types, making it a safer choice when data integrity is critical.

In [2]:
from dataclasses import dataclass

@dataclass
class Person:
	name: str
	age: int  # The expected type is 'int', but no enforcement occurs.

# No error is raised, even though 'age' is passed as a string.
Person(name="Sam", age="10")

Person(name='Sam', age='10')

In [3]:
# This will raise a TypeError because 'age' is stored as a string, not an integer.
Person(name="Sam", age="10").age + 1

TypeError: can only concatenate str (not "int") to str

In [4]:
# No error is raised, even though 'age' is not a number.
Person(name="Stella", age="twelve")

Person(name='Stella', age='twelve')

In [5]:
from pydantic import BaseModel

class Person(BaseModel):
	name: str
	age: int # The expected type is 'int', and Pydantic enforces this.

# Pydantic automatically converts the string '10' to the integer 10.
Person(name="Sam", age="10")

Person(name='Sam', age=10)

In [6]:
# This works as expected because 'age' is correctly stored as an integer.
Person(name="Sam", age="10").age + 1

11

In [7]:
# Pydantic raises a ValidationError because 'twelve' cannot be converted to an integer.
Person(name="Sam", age="twelve")

ValidationError: 1 validation error for Person
age
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='twelve', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/int_parsing

**Note:** When using `dataclasses`, any string can be passed as a value, even if it doesn't match the expected type, without raising an error. This can lead to issues later when the field is used, as shown in the example where adding 1 to the age field results in a TypeError because the field was incorrectly stored as a string.

In contrast, Pydantic automatically validates and converts types. For example, it converts the string "10" to the integer 10 and raises a ValidationError when the input cannot be converted, such as when trying to set age to "twelve". This makes Pydantic more robust for ensuring data integrity.

### **Field Validators**

Pydantic provides several ways to validate data, one of which is through Field Validators. These are particularly useful when creating custom constraints to enforce specific validation rules on your data.


##### **Example 1:** Using Field Validators on a single field.

In [8]:
from pydantic import field_validator

class PrimeNumber(BaseModel):
    number: int

    @field_validator('number')
    def number_must_be_prime(cls, value: int):
        if value <= 1:
            raise ValueError("Number must be greater than 1")
        for i in range(2, int(value**0.5) + 1):
            if value % i == 0:
                raise ValueError("Number is not prime")
        return value

In [9]:
try:
    prime = PrimeNumber(number=10)
except ValueError as e:
    print(f"Validation error: {e}")

Validation error: 1 validation error for PrimeNumber
number
  Value error, Number is not prime [type=value_error, input_value=10, input_type=int]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error


In [10]:
PrimeNumber(number=7)

PrimeNumber(number=7)

##### **Example 2:** Using Field Validators on multiple fields.

In [11]:
from pydantic import ValidationInfo

class UserModel(BaseModel):
    name: str
    address: str

    @field_validator('address')
    @classmethod
    def address_must_contain_space(cls, v: str) -> str:
        if ' ' not in v:
            raise ValueError('must contain a space')
        return v.title()

    # you can select multiple fields, or use '*' to select all fields
    @field_validator('address', 'name')
    @classmethod
    def check_if_longer_than_5(cls, v: str, info: ValidationInfo) -> str:
        if isinstance(v, str):
            # info.field_name is the name of the field being validated
            is_greater_than_5 = len(v) > 5
            assert is_greater_than_5, f'{info.field_name} must be at least 5 characters.'
        return v

In [44]:
try:
  UserModel(name="David4", address="123 Main St")
except ValueError as e:
    print(f"Validation error: {e}")

### **Structured Outputs for LLMs**

From the inception of OpenAI's Function Calling to JSON mode, one thing has been consistent: the need for structured outputs. These outputs are useful building blocks for developers aiming to create reliable applications by ensuring that model-generated responses conform to a particular schema. This need has led to further efforts by the company, resulting in the recent release of a feature called ["Structured Outputs"](https://openai.com/index/introducing-structured-outputs-in-the-api/). Other closed-source and open-source model providers, including Anthropic's Claude series and Meta's LLaMA,
are also addressing these requirements.

It is important to note that each LLM family has its own preferences for defining structured input and output. For example, Claude tends to use XML, while GPT favors Markdown and JSON.

Fortunately, both Anthropic and OpenAI SDKs accept JSON schemas as tools, though there are slight differences in their fields. Let's explore how this is implemented.

#### **Structured Output with JSON Schema**

While JSON Schema is machine-friendly, it is prone to errors and hard to maintain. Pydantic provides better structure and is easier to maintain.

##### **OpenAI Example**

In [13]:
import openai

client = openai.OpenAI()

In [14]:
get_weather = {'type': 'function',
 'function': {'name': 'get_weather',
  'strict': True,
  'parameters': {'description': 'Fetches the weather in the given location',
   'properties': {'location': {'description': 'The location to get the weather for',
     'type': 'string'},
    'unit': {'description': 'The unit to return the temperature in',
     'enum': ['F', 'C'],
     'type': 'string'}},
   'required': ['location', 'unit'],
   'title': 'get_weather',
   'type': 'object',
   'additionalProperties': False},
  'description': 'Fetches the weather in the given location'}}



get_people = {'type': 'function',
 'function': {'name': 'people_list',
  'strict': True,
  'parameters': {'$defs': {'Person': {'description': 'Extract information about a person',
     'properties': {'name': {'description': 'The name of the person',
       'title': 'Name',
       'type': 'string'},
      'age': {'description': 'The age of the person',
       'title': 'Age',
       'type': 'integer'},
      'city': {'description': 'The city they live in',
       'enum': ['Paris', 'London', 'New York'],
       'title': 'City',
       'type': 'string'}},
     'required': ['name', 'age', 'city'],
     'title': 'Person',
     'type': 'object',
     'additionalProperties': False}},
   'properties': {'people': {'items': {'$ref': '#/$defs/Person'},
     'title': 'People',
     'type': 'array'}},
   'required': ['people'],
   'title': 'People',
   'type': 'object',
   'description': 'Extracts a list of people and their information',
   'additionalProperties': False}}}

In [15]:
tools = [get_weather, get_people]

In [53]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is the weather in Paris in Celsius?"}],
    tools=tools,
)

In [54]:
response.choices[0].message.tool_calls[0].function

Function(arguments='{"location":"Paris","unit":"C"}', name='get_weather')

In [18]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Noah Lyles, 27, won the gold medal at the 2024 Summer Olympics Games in Paris. Léon Marchand, the 22 year old from Toulouse, France won the highest number of medals, with 4 gold medals and 1 bronze."}],
    tools=tools,
)

In [19]:
response.choices[0].message.tool_calls[0].function.arguments

'{"people": [{"name": "Noah Lyles", "age": 27, "city": "Paris"}, {"name": "Léon Marchand", "age": 22, "city": "Paris"}]}'

##### **Claude Example**

In [None]:
import anthropic

client = anthropic.Anthropic()

In [None]:
get_weather = {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "input_schema": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["C", "F"],
                        "description": "The unit of temperature, either \"celsius\" or \"fahrenheit\""
                    }
                },
                "required": ["location"]
            }
        }


get_people = {
    "name": "people_list",
    "description": "Extracts a list of people and their information",
    "input_schema": {
        "type": "object",
        "properties": {
            "people": {
                "type": "array",
                "description": "A list of people",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {
                            "type": "string",
                            "description": "The name of the person"
                        },
                        "age": {
                            "type": "integer",
                            "description": "The age of the person"
                        },
                        "city": {
                            "type": "string",
                            "enum": ["Paris", "London", "New York"],
                            "description": "The city they live in"
                        }
                    },
                    "required": ["name", "age", "city"]
                }
            }
        },
        "required": ["people"]
    }
}

In [None]:
tools = [get_weather, get_people]

In [None]:
response = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What is the weather in Paris in Celsius?"}]
)

print(response)

#### **Pydantic is *All You Need***

Pydantic provides better structure, easier maintenance, and robust data validations out of the box.

In [20]:
from enum import Enum
from pydantic import BaseModel, Field
from typing import Literal

class Unit(str, Enum):
    F = "F"
    C = "C"

class Weather(BaseModel):
  """Fetches the weather in the given location"""
  location: str = Field(description="The location to get the weather for")
  unit: Literal["F", "C"] = Field(description="The unit to return the temperature in")

In [60]:
from pydantic import BaseModel, Field
from typing import Literal, List

class Person(BaseModel):
  """Extract information about a person"""
  name: str = Field(description="The name of the person")
  age: int = Field(description="The age of the person")
  city: Literal["Paris", "London", "New York"] = Field(description="The city they live in")

class People(BaseModel):
  people: List[Person] = Field(description="Extracts a list of people and their information")

##### **OpenAI Structured Outputs with Pydantic**

In [22]:
client = openai.OpenAI()

In [23]:
completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "What is the weather in Paris in Celsius?"},
    ],
    response_format=Weather,
)

In [55]:
completion.choices[0].message.parsed

Questions(questions=[QuestionAnswer(question='What are the three qualities that the work you choose should have, according to the text?', answer='The work you choose needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work.', source='The work you choose needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work.'), QuestionAnswer(question="What should you do if you're young and don't know what to work on?", answer='You should take action and try lots of things, meet people, read books, and ask questions. You need to make yourself a big target for luck by being curious.', source='So you need to make yourself a big target for luck, and the way to do that is to be curious. Try lots of things, meet lots of people, read lots of books, ask lots of questions.'), QuestionAnswer(question='Why i

In [25]:
completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Noah Lyles, 27, won the gold medal at the 2024 Summer Olympics Games in Paris. Léon Marchand, the 22 year old from Toulouse, France won the highest number of medals, with 4 gold medals and 1 bronze."},
    ],
    response_format=People,
)

In [26]:
completion.choices[0].message.parsed.dict()

{'people': [{'name': 'Noah Lyles', 'age': 27, 'city': 'Paris'},
  {'name': 'Léon Marchand', 'age': 22, 'city': 'Paris'}]}

##### **Validating models output**

In [45]:
messages=[
# {"role": "system", "content": "Return yours answers in CAPITAL LETTERS"},
{"role": "user", "content": "Noah Lyles, 27, won the gold medal at the 2024 Summer Olympics Games in Paris. Léon Marchand, the 22 year old from Toulouse, France won the highest number of medals, with 4 gold medals and 1 bronze."},
]

In [46]:
from pydantic import BaseModel, Field, field_validator
from typing import Literal, List

class Person(BaseModel):
  """Extract information about a person"""
  name: str = Field(description="The name of the person.")
  age: int = Field(description="The age of the person")
  city: Literal["Paris", "London", "New York"] = Field(description="The city they live in")

  @field_validator('name')
  @classmethod
  def validate_uppercase(cls, v: str) -> str:
    if not v.isupper():
        error_message = f"{v} must be an upper cased value"
        raise ValueError(error_message)
    return v

class People(BaseModel):
  people: List[Person] = Field(description="Extracts a list of people and their information.")

In [47]:
try:
  completion = client.beta.chat.completions.parse(
      model="gpt-4o-mini",
      messages=messages,
      response_format=People,
  )

  print(completion.choices[0].message.parsed.dict())
except ValueError as e:
    print(f"Validation error: {e}")

Validation error: 2 validation errors for People
people.0.name
  Value error, Noah Lyles must be an upper cased value [type=value_error, input_value='Noah Lyles', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error
people.1.name
  Value error, Léon Marchand must be an upper cased value [type=value_error, input_value='Léon Marchand', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error


### **Project: Generate synthetic data (Questions & Answers) from text**

Scrape any website of your choice, and create synthetic data i.e

In [61]:
import requests
from bs4 import BeautifulSoup

response = requests.get('https://paulgraham.com/greatwork.html')
soup = BeautifulSoup(response.text, 'html.parser')

In [62]:
body = soup.find('body')
text = body.get_text(separator=' ', strip=True)

In [40]:
class QuestionAnswer(BaseModel):
  """
  Question & Answer Pairs from the text
  """
  question: str = Field(description="A question from the text")
  answer: str = Field(description="The answer to the question")
  source: str = Field(description="The ground truth sentence from where the answer is derived")


class Questions(BaseModel):
  """
  Generate questions & answers from the given text
  """
  questions: List[QuestionAnswer] = Field(description="A list of questions and answers from the text")

In [41]:
completion = client.beta.chat.completions.parse(
      model="gpt-4o-mini",
      messages=[
          {"role": "system", "content": f"Generate questions and answers from the provided text"},
          {"role": "user", "content": f"{text}"}
      ],
      response_format=Questions,
)

In [42]:
completion.choices[0].message.parsed.dict()

{'questions': [{'question': 'What are the three qualities that the work you choose should have, according to the text?',
   'answer': 'The work you choose needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work.',
   'source': 'The work you choose needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work.'},
  {'question': "What should you do if you're young and don't know what to work on?",
   'answer': 'You should take action and try lots of things, meet people, read books, and ask questions. You need to make yourself a big target for luck by being curious.',
   'source': 'So you need to make yourself a big target for luck, and the way to do that is to be curious. Try lots of things, meet lots of people, read lots of books, ask lots of questions.'},
  {'question': 'Why is it import

### Building Agent with Pydantic