(ch04)=
# Introcuction to Pydantic

```{admonition} Starting File: <code>03_manual_valid_molecule.py</code>
:class: important
This chapter will start from the <code>03_manual_valid_molecule.py</code> and end on the <code>04_pydantic_molecule.py</code>.
```

Validating data is hard and time consuming to do by hand. The last chapter showed just how difficult it can be to do even simple validation. Furthermore, all of the type hints we've written and the `dataclass` decorator have been helpful for visually making the code more legible, but otherwise have not been really programtically helpful in doing validation.

You'll be introduced to a powerful non-native library called *pydantic* in this chapter. A data validation and settings management tool which leverages existing Python type hints to handle validation for you. *pydantic* is not the only possible soultion out there for validation of Python data and schema, but it is a natural extension of the type hints and `dataclass` we've already discussed.

```{admonition} Check Out Pydantic
:class: note
We will not be covering all the capabilities of *pydantic* here, and we highly encourage you to visit [the pydantic docs](https://pydantic-docs.helpmanual.io/) to learn about all the powerful and easy-to-execute things *pydantic* can do.
```

```{admonition} Compatibility with Python 3.8 and below
:class: note
If you have Python 3.8 or below, you will need to import container type objects such as `List`, `Tuple`, `Dict`, etc. from the `typing` library instead of their native types of `list`, `tuple`, `dict`, etc. This chapter will assume Python 3.9 or greater, however, both approaches will work in >=Python 3.9 and have 1:1 replacements of the same name.
```

## Pydantic's Main Object: BaseModel

*pydantic* operates by modifying a class which looks and behaves similar to a `dataclass` object. However, it instead subclasses the *pydantic* object called `BaseModel` to do so. Let's start with our final `Molecule` object from {ref}`ch03`.

In [1]:
from dataclasses import dataclass
from typing import Union

# Type Helpers
fi = Union[float, int]
lfi = list[fi]
tfi = tuple[fi, ...]
inner = Union[lfi, tfi]
lo = list[inner]
tupo = tuple[inner, ...]


@dataclass
class Molecule:
    name: str
    charge: fi
    symbols: Union[list[str], tuple[str, ...]]
    coordinates: Union[lo, tupo]

    def __post_init__(self):
        # We'll validate the inputs here.
        if not isinstance(self.name, str):
            raise ValueError(f"'name' must be a str, was {self.name}")

        if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
            raise ValueError(f"'charge' must be a float or int, was {self.charge}")

        try:
            if not (isinstance(self.symbols, list) or isinstance(self.symbols, tuple)):
                raise TypeError
            for content in self.symbols:  # Loop over elements
                if not isinstance(content, str):  # Check content
                    raise ValueError(content, type(content))
        except TypeError as exec:  # Trap not iterable item
            # This will throw if you can't iterate over self.symbols
            raise ValueError(f"'symbols' must be a list or tuple of string, was {type(self.symbols)}") from exec
        except ValueError as exec:  # Trap the content error
            raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec

        try:
            if not (isinstance(self.coordinates, list) or isinstance(self.coordinates, tuple)):
                raise TypeError
            for inner in self.coordinates:  # Loop over elements
                try:
                    if not (isinstance(inner, list) or isinstance(inner, tuple)):
                        raise TypeError
                    for content in inner:  # Loop over elements
                        if not (isinstance(content, int), isinstance(content, float)):  # Check content
                            raise ValueError(content, type(content))
                except TypeError as exec:  # Trap not iterable item
                    # This will throw if you can't iterate over self.symbols
                    raise ValueError(f"'coordinates' inner elements must be a list or tuple of float/int, was {type(inner)}") from exec
                except ValueError as exec:  # Trap the content error
                    raise ValueError(f"Each inner element of 'coordinates' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
        except TypeError as exec:  # Trap not iterable item
            # This will throw if you can't iterate over self.symbols
            raise ValueError(f"'coordinates' must be a list or tuple of int/float, was {type(inner)}") from exec
        except ValueError as exec:  # Trap the content error
            raise ValueError(f"'coordinates' must be a list or tuple of int/float, however the following error was thrown: {exec}") from exec

    @property
    def num_atoms(self):
        return len(self.symbols)

    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"

In [6]:
mol_data = {  # Good data
    "coordinates": [[0, 0, 0]], 
    "symbols": ["H", "H", "O"], 
    "charge": 0.0, 
    "name": "water"
}

bad_name = {"name": 789}  # Name is not str
bad_charge = {"charge": [1, 0.0]}  # Charge is not int or float
noniter_symbols = {"symbols": 1234567890}  # Symbols is an int
nonlist_symbols = {"symbols": '["H", "H", "O"]'}  # Symbols is a string (notably is a string-ified list)
tuple_symbols = {"symbols": ("H", "H", "O")}  # Symbols as a tuple?
bad_coords = {"coordinates": ["1", "2", "3"]}  # Coords is a single list of string
bad_symbols_and_cords = {"symbols": ["H", "H", "O"],
                         "coordinates": [[1, 1, 1], [2.0, 2.0, 2.0]]
                        }  # Coordinates top-level list is not the same length as symbols

This was a fair amount of work to get it to this point from the original version. However, it has the problems of lots of manually written validation code, not actually doing anything with the type hints, and very quickly bloating up. Let's fix all of these problems one at a time.

To start, lets convert our `Molecule` from a `dataclass` to a *pydantic* `BaseModel` by importing `BaseModel`, subclassing `BaseModel` into our `Molecule`, and removing the `dataclass` decorator. At this point we don't even need the `dataclasses` import, so lets remove it as well.

In [10]:
from typing import Union

from pydantic import BaseModel

# Type Helpers
fi = Union[float,int]
lfi = list[fi]
tfi = tuple[fi, ...]
inner = Union[lfi, tfi]
lo = list[inner]
tupo = tuple[inner, ...]

class Molecule(BaseModel):
    name: str
    charge: fi
    symbols: Union[list[str], tuple[str, ...]]
    coordinates: Union[lo, tupo]
    
    def __post_init__(self):
        # We'll validate the inputs here.
        if not isinstance(self.name, str):
            raise ValueError(f"'name' must be a str, was {self.name}")
            
        if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
            raise ValueError(f"'charge' must be a float or int, was {self.charge}")
            
        try:
            if not (isinstance(self.symbols, list) or isinstance(self.symbols, tuple)):
                raise TypeError
            for content in self.symbols:  # Loop over elements
                if not isinstance(content, str):  # Check content
                    raise ValueError(content, type(content))
        except TypeError as exec:  # Trap not iterable item
            # This will throw if you can't iterate over self.symbols
            raise ValueError(f"'symbols' must be a list or tuple of string, was {type(self.symbols)}") from exec
        except ValueError as exec:  # Trap the content error
            raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
            
        try:
            if not (isinstance(self.coordinates, list) or isinstance(coordinates, tuple)):
                raise TypeError
            for inner in self.coordinates:  # Loop over elements
                try:
                    if not (isinstance(inner, list) or isinstance(inner, tuple)):
                        raise TypeError
                    for content in inner:  # Loop over elements
                        if not (isinstance(content, int), isinstance(content, float)):  # Check content
                            raise ValueError(content, type(content))
                except TypeError as exec:  # Trap not iterable item
                        # This will throw if you can't iterate over self.symbols
                        raise ValueError(f"'coordinates' inner elements must be a list or tuple of float/int, was {type(inner)}") from exec
                except ValueError as exec:  # Trap the content error
                        raise ValueError(f"Each inner element of 'coordinates' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
        except TypeError as exec:  # Trap not iterable item
                # This will throw if you can't iterate over self.symbols
                raise ValueError(f"'coordinates' must be a list or tuple of int/float, was {type(inner)}") from exec
        except ValueError as exec:  # Trap the content error
                raise ValueError(f"'coordinates' must be a list or tuple of int/float, however the following error was thrown: {exec}") from exec
        
    @property
    def num_atoms(self):
        return len(self.symbols)
        
    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"

Even though we have removed `dataclass` decorator from our code now, *pydantic* still structures its inputs much in the same way. `__init__` is handled from the class itself: assigning Class Attributes to Instance Attributes on call (and many other things, but that's beyond the scope of this workshop). Also like `dataclass`, you should not implement an `__init__` method as `BaseModel`. 

One main difference between how `BaseModel` and `dataclass` behave on initilization is that `BaseModel` does not accept arguments on a 1-to-1 match of listed Class Attributes. Anticipation of this change in behavior is one of the reasons we have been calling our `Molecule` by providing keyword arguments instead of positional arguments (and because its good practice for the reasons discussed in {ref}`ch02`)

```{admonition} Dataclasses can work with Pydantic if you really want to
:class: note
Dataclasses and Pydantic are not mutually exclusive. Pydantic [provides a dataclass decorator](https://pydantic-docs.helpmanual.io/usage/dataclasses/) to nearly perfect mimic the native dataclass, but with all the extra validation pydantic provides.
```

Let's see what happens if we try to call this calss with no other modifications.

In [11]:
water = Molecule(**mol_data)

Huzzah! It worked! But why? We have removed the `dataclass` decorator, but none of *our* validation code ran. `BaseModel` does not have a specialized single function to handle validation (we'll cover custom validation later), so the `__post_init__` function does not run; that was a special method of the `dataclass`. In fact, lets just delete the entire `__post_init__` method as we won't be needing it anymore. Let's also delete the `__str__` method as `BaseModel` provides its own.

In [12]:
from typing import Union

from pydantic import BaseModel

# Type Helpers
fi = Union[float,int]
lfi = list[fi]
tfi = tuple[fi, ...]
inner = Union[lfi, tfi]
lo = list[inner]
tupo = tuple[inner, ...]

class Molecule(BaseModel):
    name: str
    charge: fi
    symbols: Union[list[str], tuple[str, ...]]
    coordinates: Union[lo, tupo]
    
        
    @property
    def num_atoms(self):
        return len(self.symbols)

That looks simpler, so lets run our model through and actually take a look at the output from `print`.

In [13]:
water = Molecule(**mol_data)
print(water)

name='water' charge=0.0 symbols=['H', 'H', 'O'] coordinates=[[0.0, 0.0, 0.0]]


You can see that *pydantic* provides is own comple representation of the data structure, including all its attributes. The model also allows accessing attributes like you would any class attribute as well.

In [17]:
print(water.name)
print(water.coordinates)

water
[[0.0, 0.0, 0.0]]


*pydantic* also provides a few built-in methods for quick exporting of models to other data structures like dictionaries and JSON strings.

In [18]:
water.dict()

{'name': 'water',
 'charge': 0.0,
 'symbols': ['H', 'H', 'O'],
 'coordinates': [[0.0, 0.0, 0.0]]}

In [19]:
water.json()

'{"name": "water", "charge": 0.0, "symbols": ["H", "H", "O"], "coordinates": [[0.0, 0.0, 0.0]]}'

Here is where *pydantic* helps us. Because this is a `BaseModel`, our type hints are no longer *hints*, they are *mandates*. Let's show that by feeding in invalid data.

In [20]:
mangle = {**mol_data, **bad_name, **bad_charge, **bad_coords}
water = Molecule(**mangle)

ValidationError: 14 validation errors for Molecule
charge
  value is not a valid float (type=type_error.float)
charge
  value is not a valid integer (type=type_error.integer)
coordinates -> 0
  value is not a valid list (type=type_error.list)
coordinates -> 0
  value is not a valid tuple (type=type_error.tuple)
coordinates -> 1
  value is not a valid list (type=type_error.list)
coordinates -> 1
  value is not a valid tuple (type=type_error.tuple)
coordinates -> 2
  value is not a valid list (type=type_error.list)
coordinates -> 2
  value is not a valid tuple (type=type_error.tuple)
coordinates -> 0
  value is not a valid list (type=type_error.list)
coordinates -> 0
  value is not a valid tuple (type=type_error.tuple)
coordinates -> 1
  value is not a valid list (type=type_error.list)
coordinates -> 1
  value is not a valid tuple (type=type_error.tuple)
coordinates -> 2
  value is not a valid list (type=type_error.list)
coordinates -> 2
  value is not a valid tuple (type=type_error.tuple)

A new type of error has been thrown. The `ValidationError` is a custom error that *pydantic* will throw when you try to insert data which does not adhere to the typing assigned to it via the type annotation.

```{admonition} Type Hints No More\!
:class: note
We will be calling the *pydantic*'s use of type "type annotations" because, although they are still technically a "type hint," they are no longer hints.
```

*pydantic* reads the type annotations assigned to the variables, and then validates the incoming arguments against those types. Because we made sure we were thorough enough with our type hints last chapter, our type annotations correctly capture the correct data.

We also now have simultanious validation of multiple entries. In {ref}`ch03`, our validation code would throw the first error it found, without validating everything else. Here, *pydantic* is validating everything all at once, and raising it at the end.

Reading the `ValidationError` output takes some getting used to, but once you recognize it, you can understand.

### Reading the Validation error

```python
charge
  value is not a valid float (type=type_error.float)
charge
  value is not a valid integer (type=type_error.integer)
```

Here `charge` is the attribute and its `value` is not a valid type of `float`. On the next line, `charge` is also an error because its not an `int`. *pydantic* treats `Union` types as an either or and validates them sepratley, accepting whichever one comes first.

```python
coordinates -> 0
  value is not a valid list (type=type_error.list)
coordinates -> 1
  value is not a valid list (type=type_error.list)
```

`coordinates` is the attribute that did not recieve valid data. `-> 0` indicates that at index `0` of `coordinates`, the validator was expecting a `list` but did not get one. `-> 1` on the next entry specifies index `1` of `coordinates` was also not a `list`.

## Data Coersion

*pydantic* has already helped us simplify our code by providing type checking, but let's simplify further by reducing our type annotation complexity and seeing what *pydantic* does to some invalid types. 

For starters, lets assume `charge` can only be a float. Right now `charge` accepts `float` or `int`, and because it does, we can see that *pydantic* SHOULD show a different output depending on what type we give it.

In [33]:
int_charge = {**mol_data, **{"charge": 0}}
float_charge = {**mol_data, **{"charge": -1.5}}  # Value that can't be int.
print(type(int_charge["charge"]))
print(type(float_charge["charge"]))

<class 'int'>
<class 'float'>


In [31]:
int_water = Molecule(**int_charge)
float_water = Molecule(**float_charge)
print(f"Integer water has value {int_water.charge} and type {type(int_water.charge)}")
print(f"Float water has value {float_water.charge} and type {type(float_water.charge)}")

Integer water has value 0.0 and type <class 'float'>
Float water has value -1.5 and type <class 'float'>


Uh-oh. What happened? We expcted to get an `int` out of the `int_water` object, but instead got a `float`. The reason for this is *pydantic* does what is called "data coercion" based on the the type annotations you provide.

Data coercion is the process of molding and shaping data to adhere to certain rules. In this case, the `int` was coerced by casting it to a `float` before being stored. The question you may ask is "why did *pydantic* do that if it can accept both `float` or `int`?" And the answer has to do with the *order* we provided the type annotations. Take a look at our first type helpers.

In [27]:
# Type Helpers
fi = Union[float,int]

We specfied that `float` was first and `int` was second. From a pure set theory standpoint, order should not mater, but *pydantic* does respect the order of its types. Down at a code level, *pydantic* is doing a few things. Here is a simplified list

1. Handle pre-validators (covered later {ref}`ch05`)
2. Attempt to coerce data through first type annotation encoutered
3. Accept coersion if no error is thrown
4. Try next type annotation if present
5. Repeat 2-4 until resolved or error thrown with no resolution
6. Handle user validators (covered later {ref}`ch05`)

Because we specified `float` first, the `int` was cast to a `float` and accepted. We could reverse the order and try again to see what happens:

In [28]:
class IntThenFloatMolecule(Molecule):  # Subclass our defined model to inherit attributes
    charge: Union[int, float]

In [34]:
int_water = IntThenFloatMolecule(**int_charge)
float_water = IntThenFloatMolecule(**float_charge)
print(f"Integer water has value {int_water.charge} and type {type(int_water.charge)}")
print(f"Float water has value {float_water.charge} and type {type(float_water.charge)}")

Integer water has value 0 and type <class 'int'>
Float water has value -1 and type <class 'int'>


We have now lost information from the coercion. However, *pydantic* did exactly what it was told. The lesson here is that it is better to be permissive where possible. Since `float` will ensure we don't have data loss for a field which can accept both.

```{admonition} Don't like something? Config it
:class: note
Pydantic's BaseModels are highly configurable through a class you can create in any model called Config. Some configurations will be shown later, but you can always [check the pydantic Config docs](https://pydantic-docs.helpmanual.io/usage/model_config/) for more things you can do. Changing this specific <code>Union</code> behavior is a setting called "Smart Union".
```

In [35]:
from typing import Union

from pydantic import BaseModel

# Type Helpers
fi = Union[float,int]
lfi = list[fi]
tfi = tuple[fi, ...]
inner = Union[lfi, tfi]
lo = list[inner]
tupo = tuple[inner, ...]

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: Union[lo, tupo]
    
        
    @property
    def num_atoms(self):
        return len(self.symbols)

We've simplfied some of our type hints now. One of the other changes we've made is setting `symbols` to a list of string instead of accepting either `list` or `tuple`.

<div class="exercise">
<p class="exercise-title"> Practice simplifying type annotations with coersion
    <p>What would the simplified type annotation of <code>coordinates</code> be?</p>

```{admonition} Solution:
:class: dropdown
Either one of the following
```python
coordinates: list[list[float]]
coordinates: list[list[Union[float, int]]
```
    
```{admonition} Incorrect Answer:
:class: dropdown
This option is wrong because it will cast floats to integers by default, which is bad unless you are on a discrete grid of coordinates.
```python
coordinates: list[list[Union[int, float]]
```

</div>

## Preparing for Custom Validation

In [2]:
from pydantic import BaseModel


class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]

    @property
    def num_atoms(self):
        return len(self.symbols)

Above is our final code for this chapter and is the code in `04_pydantic_molecule.py`. We've converted our original code to a type validated model which is easy to read. We did this by leverging the power of the *pydantic* module, but through the process of understanding type hints, and then `dataclasses` structure native to Python. 

Next chapter we'll cover doing so much more with *pydantic* (and yet still so little of what it can), focusing on writting validator's beyond simple type checks.