Validating Data Beyond Types
========================

```{important} Starting File: CHAPTER_4_WAFFLES
This chapter will start from the CHAPTER_4_WAFFLES and end on the CHAPTER_5_WAFFLES.
```

Data validation goes far beyond just type. *Pydantic* has provided the basic tools for doing data validation on data types, but it also provides the tools for writting custom validators to check so much more.

We'll be covering the *pydantic* `validator` decorator and applying that to our data to check structure and scientific rigor. We'll also cover how to validate types not native to Python, such as NumPy arrays.

```{admonition} Check Out Pydantic!
:class: note
We will not be covering all the capabilities of *pydantic* here, and we highly encourage you to visit [the pydantic docs](https://pydantic-docs.helpmanual.io/) to learn about all the powerful and easy-to-execute things *pydantic* can do.
```



```{admonition} Compatibility with Python 3.8 and below
:class: note
If you have Python 3.8 or below, you will need to import container type objects such as `List`, `Tuple`, `Dict`, etc. from the `typing` library instead of their native types of `list`, `tuple`, `dict`, etc. This chapter will assume Python 3.9 or greater, however, both approaches will work in >=Python 3.9 and have 1:1 replacements of the same name.
```

## Pydantic's Validator Decorator

Let's start by looking at the state of our code prior extending the validators. As usual, lets also define our test data.

In [1]:
from pydantic import BaseModel

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]

    @property
    def num_atoms(self):
        return len(self.symbols)

In [9]:
mol_data = {  # Good data
    "coordinates": [[0, 0, 0], [1, 1, 1], [2, 2, 2]], 
    "symbols": ["H", "H", "O"], 
    "charge": 0.0, 
    "name": "water"
}

bad_name = {"name": 789}  # Name is not str
bad_charge = {"charge": [1, 0.0]}  # Charge is not int or float
noniter_symbols = {"symbols": 1234567890}  # Symbols is an int
nonlist_symbols = {"symbols": '["H", "H", "O"]'}  # Symbols is a string (notably is a string-ified list)
tuple_symbols = {"symbols": ("H", "H", "O")}  # Symbols as a tuple?
bad_coords = {"coordinates": ["1", "2", "3"]}  # Coords is a single list of string
inner_coords_not3d = {"coordinates": [[1, 2, 3], [4, 5]]}
bad_symbols_and_cords = {"symbols": ["H", "H", "O"],
                         "coordinates": [[1, 1, 1], [2.0, 2.0, 2.0]]
                        }  # Coordinates top-level list is not the same length as symbols

You may notice we have extended our "Good Data" here to have `coordinates` actually define the `Nx3` structure where `N = len(symbols)`. This is important for what we plan to validate.

*pydantic* allows you to write custom validators, in addition to the type valitators which runn automatically for a type annotation. This `validator` is pulled from the `pydantic` module just like `BaseModel`, and is used to decorate a *class* function you write. Lets look at the most basic `validator` we can write and assign it to `coordinates`.

In [4]:
from pydantic import BaseModel, validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]
        
    @validator("coordinates")
    def ensure_coordinates_is_3D(cls, coords):
        return coords

    @property
    def num_atoms(self):
        return len(self.symbols)

Here we have defined an additional validator which does nothing, but has the basic structure we can look at. For convience and reference, I've broken the aspects of `validator` into a list.

* The `validator` decorator takes as arguments the *exact* name of the attributes you are validating against as a string. In this case `coordinates`. You could provide multiple string args of each attribute you want to run through the validator if you want to reuse it.
* The function name can be whatever you want it to be. We've called it `ensure_coordinates_is_3D` to be meanigful if anyone ever want to come back and see what this should be doing.
* The function itself is a *class function*. Similar to what happens when you use the `@classmethod` decorator from native Python, this validator is intented to be called on the non-instanced class. The formal nomenclature for the first variable here is therefore `cls` and not `self`. Your IDE may complain about this, but it should be `cls`. 
* The first argument of the function can be whatever string name you want EXCLUDING the following list: `values`, `config`, and `field` (resons discussed later in chapter too)
* The return MUST be the validated data to be fed into the attribute. We've done nothing to or variable `coords`, so we simply return it. If you fail to have a `return` statement with something, it will return `None` and that will be considered valid.
* `validator` runs *after* type validation, unless specified (see later in this chapter)

That may seem like alot of rules, but most of them are boilerplate and intutitve to use. Let's apply these items to our validator. We want to make sure the inner lists of `coordinates` are 3D, or length 3. We don't have to worry about type checking (that was done before any custom `validator` was run), so we can just do an interation of the top list and make sure. Let's apply that now.

In [28]:
from pydantic import BaseModel, validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]
        
    @validator("coordinates")
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

In [29]:
good_water = Molecule(**mol_data)
mangled = {**mol_data, **inner_coords_not3d}
water = Molecule(**mangled)

ValidationError: 1 validation error for Molecule
coordinates
  Inner coordinates must be 3D, got [4.0, 5.0] of length 2 (type=value_error)

Here we have checked the good data still works, and checked that the mangled data raised an error. It's important to note the error raised by the function can be any type of error, but what came out in the error report was a `ValdationError`. We can also see the error message is what we put as the error string and `type` of error is of the type we raised. This is why its very important to have meaningfull error strings when your custom validator fails.

With all that said, our validator function really does look like any other function we may call to do a quick check of data, and then some special addons to make it work with *pydantic*. There is no practical limit to the nubmer of `validator`s you have in a given class, so validate to your heart's content.

```{admonition} Python Assignement Expressions "The Walrus Operator" <code>:=</code>
:class: note
Since Python 3.8, there is a new operator for "assignment expressions" called "[The Walrus Operator](https://peps.python.org/pep-0572/)" which allows variables to be assigned inside other expressions. We've used it here to trap the value at time of error and save space. Do not feel compelled to use this yourself, especially if its not clear what is happening.
```

<div class="exercise">
<p class="exercise-title"> Check your knowledge: Validator Basics
    <p>How would you validate that <code>symbols</code> entries are at most 2 characters? There is more than one correct solution beyond what we show here.</p>

```{admonition} Possible Solution:
:class: dropdown
```python
@validator("symbols")
def symbols_are_possible_element_length(cls, symbs):
    if not all(1 <= len(failure := symb) <= 2 for symb in symbs):
        raise ValueError(f"Symbols be 1 or 2 characters, got {failure}")
    return symbs
```
</div>

## Validating against other fields

*pydantic*'s validators can check fields beyond their own. This is helpful for cross referencing dependent data. In our example, we want to make sure there are exactly the right number of `coordinates` as there are `symbols` in our `Molecule`. To check against other fields in a `validator`, we extend the arguments to include one called `values`. We are going to leave our initial validator to show a feature of the `validator`s for now, but we could combine them (and will) later.

In [42]:
from pydantic import BaseModel, validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]
        
    @validator("coordinates")
    def ensure_coordinates_match_symbols(cls, coords, values):
        n_symbols = len(values["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @validator("coordinates")
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

We've added a second validator to our code called `ensure_coordinates_match_symbols`, and this funciton will validate against `coordinates`. There are two main things we can see from adding this function:

1. Multiple functions can be declared to validate against the same field.
2. We've added a one of the blocked argument names to our new validator: `values`.

The reason the blocked argument names were given in the list of rules for `validators` is because *pydantic*'s `validator` reserves those to inject special code. The addition of `values` as an argument tells the `validator` to also retrive *all previously validated fields for the model*. In our case, that would be `name`, `charge`, and `symbols` as those entries appeared before `coordinates` in the list of attributes. Any and all validators which would have been applied to those three entries have already been done and what we have access to is their validated records as a dictionary called `values` in the function itself. [See the *pydantic* docs](https://pydantic-docs.helpmanual.io/usage/validators/) for more details about the special arguments in `validator`.

Let's see this in action

In [43]:
good_water = Molecule(**mol_data)
mangled = {**mol_data, **bad_symbols_and_cords}
water = Molecule(**mangled)

ValidationError: 1 validation error for Molecule
coordinates
  There must be an equal number of XYZ coordinates as there are symbols. There are 2 coordinates and 3 symbols. (type=value_error)

## Non-native Types in Pydantic

Scientific data does not, and often should not, be confined to native Python types. One of the most common data types the Python, especially the sciences, is the NumPy Array (`ndarray` class). The most natural place for this would be `coordinates` where we want to simplfy this list of list construct. Let's see what happens when we try to just make the type annotation a `ndarray` and see how *pydantic* handles coersion, or how it does not.

In [44]:
import numpy as np
from pydantic import BaseModel, validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
        
    @validator("coordinates")
    def ensure_coordinates_match_symbols(cls, coords, values):
        n_symbols = len(values["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @validator("coordinates")
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

RuntimeError: no validator found for <class 'numpy.ndarray'>, see `arbitrary_types_allowed` in Config

This error was thrown because *pydantic* is coded to handle certain types of data, but it cannot handle type it was not programmed to understand. However, *pydantic* does provide a usefull erorr message to fix this.

You can configure your *pydantic* models to modify their behavior by adding a class within the `BaseModel` class explicitly called `Config`. This is not an imported object, its just a class bearing that name. Within that class, you set class attributes that serve as the options.

```{admonition} More Config settings
:class: note
You can see all of the config settings [in the *pydantic* docs](https://pydantic-docs.helpmanual.io/usage/model_config/)
```

Our particular error is saying we need to configure our model and set `arbitrary_types_allowed`, in this case to `True`. This will tell this particular `BaseModel` to permit types that it does not naturally understand how to handle, and assume the user/programer will handle it. Let's see what `Molecule` looks like with this set. Note: The location of the `class Config` statement does not mater, and `Config` is on a per-model basis, not a global *pydantic* config.

In [52]:
import numpy as np
from pydantic import BaseModel, validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
        
    class Config:
        arbitrary_types_allowed = True
        
    @validator("coordinates")
    def ensure_coordinates_match_symbols(cls, coords, values):
        n_symbols = len(values["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @validator("coordinates")
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

Our model is now configured to allow aribtrary types; no more error. Let's see what happens when we pass in our data.

In [53]:
water = Molecule(**mol_data)

ValidationError: 1 validation error for Molecule
coordinates
  instance of ndarray expected (type=type_error.arbitrary_type; expected_arbitrary_type=ndarray)

We're still getting a validation error, but its different. *pydantic* is now telling us that the data given to `coordinates` must be of type `ndarray`. Remember there are two default levels of validation in *pydantic*: Ensure type, manually written validators. When we have `arbitrary_types_allowed` configured, any unknown type to *pydantic* is not type-checked or coerced beyond that it is the declared type. Effectivley, a glorified `isinstance` check.

So to fix this, either the user has to have already cast the data to the expected type, or the developer has to preempt the type validation somehow.

## Pre-Validators in Pydantic

Good news! You can make *pydantic* validators that run before the type validation, effectivley adding a third layer of validation stack. These are called "pre-validators" and will run before any other level of validator. The primary use case for these validators is data coercion, and that includes casting incoming data to specific types. E.g. Casting a list of lists to a NumPy array because we have `arbitrary_types_allowed` set.

A pre-validator is defined exactly like any other `validator`, it just has the keyword `pre=True` in its arguments. We're going to use the validator to take the `coordinates` data in, and cast it to a NumPy array.

In [54]:
import numpy as np
from pydantic import BaseModel, validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
        
    class Config:
        arbitrary_types_allowed = True
    
    @validator("coordinates", pre=True)
    def coord_to_numpy(cls, coords):
        try:
            coords = np.asarray(coords)
        except ValueError:
            raise ValueError(f"Could not cast {coords} to numpy array")
        return coords
        
    @validator("coordinates")
    def ensure_coordinates_match_symbols(cls, coords, values):
        n_symbols = len(values["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @validator("coordinates")
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

Now we can see what happens when we run our model

In [56]:
water = Molecule(**mol_data)
water.coordinates

array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2]])

We now have a NumPy array for our `coordinates`. Since we now have a NumPy array for `coordinates`, we can refine the original `validator`s. We'll condense our normal `coordinates` `validator`s down to a single one.

In [59]:
import numpy as np
from pydantic import BaseModel, validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
        
    class Config:
        arbitrary_types_allowed = True
    
    @validator("coordinates", pre=True)
    def coord_to_numpy(cls, coords):
        try:
            coords = np.asarray(coords)
        except ValueError:
            raise ValueError(f"Could not cast {coords} to numpy array")
        return coords
        
    @validator("coordinates")
    def coords_length_of_symbols(cls, coords, values):
        symbols = values["symbols"]
        if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
            raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

In [60]:
water = Molecule(**mol_data)

In [62]:
mangle = {**mol_data, **bad_charge, **bad_coords}
water = Molecule(**mangle)

ValidationError: 2 validation errors for Molecule
charge
  value is not a valid float (type=type_error.float)
coordinates
  Coordinates must be of shape [Number Symbols, 3], was (3,) (type=value_error)

We've now upgraded our `Molecule` with more advanced data validation leaning into scientific validity, added in custom types which increase our model's usability, and configured our model to further expand our capabilities. The code is now at the source materials labeled CHAPTER_5_WAFFLES.

Next chapter we'll look at nesting models to allow more complicated data structures. Below is a supplementary section on how you can define custom, non-native types without `arbitrary_types_allowed`, giving you greater control over defining custom or even shorthand types.

## Supplemental: Defining Custom Types with Built-In Validators