# Introducing Pydantic: Pythonic Data Validation

[Pydantic](https://docs.pydantic.dev/latest/) is a data validation library in Python that provides a way to define data schemas in a way that is both Pythonic and easy to use. 


## TO DO IN THIS NOTEBOOK
- [x] Note the problem of data validation and the complexity of the multiple observers pattern needed to handle that even for this relatively simple dashboard. We want to improve our previous dashboard by better handling data validation and more easily handling the generation of the settings UI.
- [x] Introduce classes
- [x] Use of type annotation in dataclasses and data validation in Python (?)
- [x] Introduce type annotations
- [x] Introduce Pydantic
- [x] Show how to define a Pydantic model and use it to 
  - [x] validate data using a Pydantic model
  - [x] express constraints of smoothing window/order in pydantic
- [ ] Mention other features of Pydantic that may be relevant to our dashboard
- [ ] Have them migrate the existing data validation in the dashboard project to pydantic (might require a lot of refactoring of the code)

## What problem are we trying to solve?

The dashboard we have built so far works, but would be hard to include as part of a larger project, and its individual components would be difficult to reuse. For example, the controls to select a range of years and apply smoothing to the data would be useful in making other graphs. As the code stand now, it would be hard to take those pieces as a unit and integrate them into a different dashboard.

Put differenly, one might want a `DataSelector` widget that has as its value a range of years, a smoothing windows, and a polynomial order for the smoothing that can be reused.

### Our route to the solution is indirect

In this notebook we will define a class using Pydantic that does not, by itself, have any widgets attached to it. In the next notebook we will use `ipyautoui` to generate a widget-based user interface from our Pydantic class that will be much easier to reuse.

### Goals for our pydantic class:

The class should have these attributes (also called *fields* in the pydantic documentation):

+ `year_range`, the range of time selected.
+ `window_size`, the size of the smoothing window.
+ `polynomial_order`, the order of the polynomial used in smoothing.

These attributes also have some important constraints:

+ The `window_size` should be an integer larger than one and, to match our earlier example, less than or equal to 100.
+ The `polynomial_order` should be an integer less than or equal to 10, and less than the `window_size`.


## Writing classes

We being with a quick review of writing classes in Python. The smallest class you can write is below. It does not really do anything, but you can create instances of it.

In [None]:
class Basic:
    """This basic class does nothing!"""

In [None]:
basic = Basic()
print(basic)

There is not much going on here, but we do at least get a docstring:

In [None]:
print(basic.__doc__)

Next, we write a class with a single attribute, called `year_range`. This is not the most compact or efficient way that a class that simply has one attribute could be written, but that is deliberate. Understanding the "plainest" way to do this will help motivate some of the shortcuts we see in a little bit.

This class has a method called `__init__` that is called when the class is created. It is one of many special methods (called "magic" methods) recognized by Python.

The `print` statement in the `__init__` is there to make it easier to see when it is called.

In [None]:
class DataSelectorPlainPython:
    """
    Partial implementation of a class to hold a data selector widget.
    """
    def __init__(self, year_range_input=(1800, 2000)):
        self.year_range = year_range_input
        print(f"In __init__, {year_range_input=} and {self.year_range=}")

Let's make an instance of this class and print it.

In [None]:
selector_plain = DataSelectorPlainPython()
print(f"{selector_plain=}")
print(f"{selector_plain.year_range=}")

It is great that we can see (and could use) the `year_range` but printing the object itself is not that nice. We will return to that later.

### Exercise

1. Try making another instance of `DataSelectorPlainPython` with a `year_range` of 1950 to 2020. You cannot modify the class definition to do this part.

2. Add a `window_size` attribute and a `polynomial_order` attribute to the class in the cell below.

In [None]:
class DataSelectorPlainPython:
    """
    Partial implementation of a class to hold a data selector widget.
    """
    def __init__(self, year_range_input=(1800, 2000)):
        self.year_range = year_range_input
        # Put your code here and in the __init__ call too

3. Make an instance of the class and print each of its attributes.

In [None]:
# your code here

4. Try setting the attributes to nonsense values, e.g. a string, and see what happens.

In [None]:
sel_plain_2 = DataSelectorPlainPython(year_range_input=(1991, 2018))
sel_plain_3 = DataSelectorPlainPython(year_range_input=(1991, 2018))

In [None]:
sel_plain_2 == selector_plain

In [None]:
sel_plain_3 == sel_plain_2

At this point we have a class which has all the attributes we want, though it has no widgets attached to it, no notion of what constitute valid values, and is a little verbose to write.

## `dataclass`: a more compact plain Python approach

Writing a class with multiple attributes gets repetitive in Python. Each attribute typically comes with an argument to the class and lines of boilerplate code to set the attributes of the class to the those arguments.

Data classes were introduced in Python 3.7 to make that sort of code more compact to write. They leverage *type annotations*, which were added to the language in version 3.0, and allow you to provide some information about the type of a variable.

The class below is an implemenation, using data classes, of part of the `DataSelectorPlainPython` we wrote above.

In [None]:
from dataclasses import dataclass

@dataclass 
class DataSelectorDataClass:
    """
    Partial implementation of a class to hold a data selector widget using dataclasses.
    """
    year_range: tuple = (1800, 2000)

This is already much more compact than the initial class definition above. Python automatically creates an `__init__` method for this class that sets the class up. It also comes with some extras:

In [None]:
selector_dc = DataSelectorDataClass()

In [None]:
print(selector_dc)

Compare this to what we get for our plain Python class:

In [None]:
print(selector_plain)

As we will see in a few minutes, in addition to getting a nice string representation of the object for free, we also get the ability to test for equality of two instances.

### Exercise

1. Extend `DataSelectorDataClass` so that it also has a `window_size` and a `polynomial_order`.

2. Recall that testing for equality did not work the way we wanted for the plain Python version of our data selector. Try comparing the two selectors below with each and with `selector_dc` to see if equality testing works.

In [None]:
sel_dc_2 = DataSelectorDataClass(year_range=(1991, 2018))
sel_dc_3 = DataSelectorDataClass(year_range=(1991, 2018))

In [None]:
sel_dc_2 == selector_dc

In [None]:
sel_dc_2 == sel_dc_3

### Data class summary

Using `dataclass` to define our selector has several advantages:

+ It is less code.
+ It has a human-readable string representation.
+ You can check whether instances of the class are equal.

We could have done those last two things without `dataclass` by defining a couple of special methods methods in our class definition. It is really nice to just have it happen automatically behind the scenes, though!

*Note:* There is much more to data classes than we have covered. Read more about them XXX or in the [Python documentation](https://docs.python.org/3/library/dataclasses.html).

## Progress check

The data class was relatively straightforward to write and looks promising for representing our controls:

```python
@dataclass 
class DataSelectorDataClass:
    """
    Partial implementation of a class to hold a data selector widget using dataclasses.
    """
    year_range: tuple = (1800, 2000)
    window_size: int = 2
    polynomial_order: int = 1
```

There are a couple of issues, though:

+ You can set any of the attributes to whatever value you want. This will raise no errors: `DataSelectorDataClass(year_range=5, window_size="three", polynomial_order=-3.14159)`.
+ None of the contraints we wanted are enforced.

Pydantic will help us solve these issues.

## Using pydantic to define our class

The `pydantic` library solves several of our problems and gets us a few more abilities for free:

+ It is designed to help enforce type requirements. It can do its best to convert values for you, or not if you prefer that.
+ Simple constraints like "this number must be greater than or equal to two" are easy to express.
+ More complicated constraints like "this number must be smaller than this other one" are possible to express.
+ It is straightforward to save objects to disk.

In [None]:
#| default_exp widgets_pydantic

### Making a class using pydantic

One way to use pydantic to make a class is to import a class called `BaseModel` from it and subclass that. It ends up looking a lot like a data class:

In [None]:
from pydantic import BaseModel

class DataSelectorModelDraft1(BaseModel):
    year_range: tuple = (1800, 2000)
    window_size: int = 2
    polynomial_order: int = 1

Like a data class, attributes are defined by adding a type annotation. Unlike data classes, pydantic enforces types. Try running the cell below, which will raise an exception:

In [None]:
selector_pyd = DataSelectorModelDraft1(year_range=5, window_size="three", polynomial_order=-3.14159)

### Exercise

1. Make a valid instance of `DataSelectorModelDraft1`, i.e. an instance that does not raise an error when you create it. Feel free to try to come up with an instances that might surprise other people.

There a few ways of making an instance of a pydantic model:

1. Provide arguments when you call the class; this is what we did above.
2. From a dictionary of values, using the class method `model_validate`.
3. From json.

We will come back to the third way later in the notebook. An example of the second way is below.

In [None]:
my_choices = {
    "year_range": (1900, 1950),
    "window_size": 10,
    "polynomial_order": 2,
}

DataSelectorModelDraft1.model_validate(my_choices)

### Imposing constraints after object creation

By default, pydantic imposes its constraints only when you create the object. Consider this example:

In [None]:
selector_pyd_simple = DataSelectorModelDraft1()
selector_pyd_simple.window_size = "two"
selector_pyd_simple

Note that `window_size` has been set to a string, not an integer.

However, pydantic can be configured to check types when values are assigned by using the `validate_assignment` configuration. There are many more options available in [pydantic configuration](https://docs.pydantic.dev/latest/concepts/config/).

In [None]:
class DataSelectorModelDraft2(BaseModel, validate_assignment=True):
    year_range: tuple = (1800, 2000)
    window_size: int = 2
    polynomial_order: int = 1

Now an exception is raised when we try to assign a string to `window_size`.

In [None]:
selector_pyd_simple2 = DataSelectorModelDraft2()
selector_pyd_simple2.window_size = "two"

### More specific constraints on types

You might be surprised to see that the line below raises no error.

In [None]:
selector_pyd_simple2.year_range = ("eightteen eighty 5", 8+5j)

The reason is that pydantic simply checks to see that a tuple is being assigned to `year_range` -- the contents of the tuple can be anything at all. This would also raise no errors: `selector_pyd_simple2.year_range = (1, 2, 3)`

Python type annotations provide a way to provide information about what the tuple should consist of by putting the contents in square brackets, as shown in the cell below.

In [None]:
class DataSelectorModelDraft3(BaseModel, validate_assignment=True):
    year_range: tuple[int, int] = (1800, 2000)
    window_size: int = 2
    polynomial_order: int = 1

With this change, trying to assign `("eightteen eighty 5", 8+5j)` to `window_size` will fail.

In [None]:
selector_pyd_simple3 = DataSelectorModelDraft3()
selector_pyd_simple3.year_range = ("eightteen eighty 5", 8+5j)

### Imposing constraints on field values

*Attributes in a pydantic model are typically called fields, terminology we will use for the remainder of the notebook.*

We have made some progress but we still have not imposed the constraints we want on window size and polynomial order. There are a couple new concepts we need to do that:

+ `Annotated` from Python's typing system lets you add additional information about the type of an item. Here we will use it to add information about the constraint on a field.
+ The `Field` class from pydantic is a class you use that contains that extra information. There are a number of possible arguments to `Field`. Here we use `ge`, short for "greater than or equal to," to impose the constraint that the `window_size` be larger greater than or equal to 2. The `Field` class from `pydantic` is somewhat similar to the [`field` class from Python's data classes](https://docs.python.org/3/library/dataclasses.html#dataclasses.field) which also serves the purpose of adding information about a typed field.

In the cell below we define a pydantic model that imposes the constraint that the `window_size` must be greater than or equal to 2. It does that by annotating the `window_size` type, `int`, with `Field(ge=2)`.

In [None]:
from typing import Annotated
from pydantic import Field

class DataSelectorModelDraft4(BaseModel, validate_assignment=True):
    year_range: tuple[int, int] = (1800, 2000)
    window_size: Annotated[int, Field(ge=2)] = 2
    polynomial_order: int = 1

Let's test this by creating an instance and setting the `window_size` to an integer value that should not be allowed.

In [None]:
selector_pyd_simple4 = DataSelectorModelDraft4()

selector_pyd_simple4.window_size = 0

Recall that the `window_size` also had an upper limit of 100 in the earlier dashboard we are trying to reproduce. This version of the class adds that upper limit also.

In [None]:
class DataSelectorModelDraft5(BaseModel, validate_assignment=True):
    year_range: tuple[int, int] = (1800, 2000)
    window_size: Annotated[int, Field(ge=2, le=100)] = 2
    polynomial_order: int = 1

### Exercise

Add a constraint on the `polynomial_order` that requires is to be greater than or equal to 1 and less than or equal to 10.

In [None]:
#TODO write answer 

With these changes we have some of the constraints we want. 

There is one more thing we need to do: the polynomial order should be less than the window size in addition to being 10 or smaller.

To do that we will add a *model validator* to our pydantic class. The model validator has access to all of the proposed model values and can check them in whatever way it wants. If the values are acceptable then the method returns `self`. If the values are not acceptable then the validator should raise a `ValueError`.

A draft class with the model validator is below.

In [None]:
#| export

from typing import Annotated
from pydantic import model_validator, BaseModel, Field

In [None]:
#| export

class DataSelectorModelDraft6(BaseModel, validate_assignment=True):
    year_range: tuple[int, int] = (1800, 2000)
    window_size: Annotated[int, Field(ge=2, le=100)] = 2
    polynomial_order: Annotated[int, Field(ge=1, le=10)] = 1

    # mode="after" means the validator runs after pydantic has checked that the individual
    # fields have values that are valid.
    @model_validator(mode="after")
    def limit_polynomial_order(self):
        
        if self.polynomial_order > self.window_size - 1:
            # Handle a bad polynomial order or window size
            raise ValueError("Polynomial order must be smaller than window size")
            
        # If we got this far the polynomial order is consistent with the window size
        # so return self. Failing to return self will end up causing an error.
        return self

### Exercise

1. Check whether the new validation works by creating a valid `DataSelectorModelDraft6` and then trying to set the `polynomial_order` or `window_size` to inconsistent values. 

In [None]:
DataSelectorModelDraft6()

## Another benefit of pydantic: easy to save to a file

One additional benefit of using pydantic to model our control is that pydantic classes come with easy conversion to json, which is in turn easy to save to disk.

We make a model in the cell below.

In [None]:
model = DataSelectorModelDraft6()

We can convert this model to a few different forms:

+ The `model_dump` method converts the pydantic model to a dictionary of values.
+ The `model_dump_json` method converts the pydantice model to json with the model's values.
+ The `model_json_schema` method produces a json schema for the model, which is a description of the model and its restrictions.

In [None]:
print(model.model_dump())

In [None]:
# the indent argument causes the json to have line breaks, with the indentation
# of each new level given by indent
print(model.model_dump_json(indent=2))

In [None]:
model.model_json_schema()

You might not be surprised to learn that you can also create a model instance from json. In the cell below, we take the json from `model` and use it to create a new instance.

To do that you use the class method `model_validate_json` to make the model.

In [None]:
model_json = model.model_dump_json()

# Use the class method model_validate_json to make a new model
new_model = DataSelectorModelDraft6.model_validate_json(model_json)

new_model

In [None]:
with open("my_selections.json", "w") as f:
    f.write(model_json)

Though we will not have occasion to use it much in this tutorial, the next cell shows how to load the json from disk make a model from it.

In [None]:
with open("my_selections.json") as f:
    disk_json = f.read()

DataSelectorModelDraft6.model_validate_json(disk_json)

2. Use nbdev to export the code we want to reuse from this notebook 

In [None]:
from nbdev.export import nb_export

nb_export('03a_pydantic.ipynb', 'dashboard')