# Data Validation with Pydantic

**Data validation** is a method for checking the accuracy and quality of data. This is a crucial step for any data-driven project. 

Data Validation ensures that data is complete (no blank or null values), unique (contains distinct values that are not duplicated) and in range of values that we expect, and much more depending on our field types, possible values and project.

Having the right tools and techniques can help us improve the reliability of our ML models and APIs.  

Currently, Pydantic is one of the best libraries for Data Validation with more than 100M monthly downloads.

### Key Features of Pydantic:
- **Powered by type hints**: In Pydantic, serialization and schema validation are controlled by Python's type annotations which are familiar to most people using Python and easily integrated with IDEs and static analysis tools.
- **Speed**: Pydantic's core validation logic is written in Rust. As a result, Pydantic is among the fastest data validation libraries for Python.
- **Async Support**: Pydantic supports asynchronous validation and parsing, which is useful in asynchronous Python code.
- **JSON Schema**: For any Pydantic schema, JSON schema can be generated, which allows for self-documenting APIs and integration with wide variety of tools which support JSON Schema.
- **Strict** and **Lax** modes: Pydantic can run in two different modes:
    - `strict=True` mode does not convert data (eg.: `"1" >>> 1`); 
    - `strict=False` mode where Pydantic tries to coerce data to the correct type where appropriate.
- **Support of standard library types**: Pydantic supports validation of many standard library types such as `dataclass` and `TypeDict`.
- **Powerful Customization**: Pydantic allows for custom validators and serializers to alter data processing in many powerful ways.
- **Community and Ecosystem**: Pydantic has an active community and is well-maintained. It also integrates well with other Python libraries and tools.

### Supported DataTypes:
Where possible Pydantic uses standard library types to define fields.  
There are also more complex types that can be found in the Pydantic Extra Types.  
If no existing type suits your purpose you can also implement your own Pydantic-compatible types with custom properties and validation.

[Pydantic DataTypes documentation](https://docs.pydantic.dev/latest/usage/types/types/)

### Integration in FastAPI:

The integration of Pydantic with FastAPI enhances the development experience by providing automatic validation, documentation generation, and data serialization while promoting code consistency and reusability. These features help us build robust and well-documented APIs efficiently.

# Pydantic usage examples

### BaseModel usage:

In [1]:
from pydantic import BaseModel

input_data = '''{
                    "user_name": "Igor",
                    "user_age": 32,
                    "is_male": "yes"
                }
            '''

# BaseModel is a basic pydantic class
# We define incoming data via typehinting
class User_Data(BaseModel):
    user_name: str
    user_age: int
    is_male: bool

new_user = User_Data.parse_raw(input_data)
new_user

User_Data(user_name='Igor', user_age=32, is_male=True)

### BaseModel witn nested class:

In [2]:
from typing import List
from pydantic import BaseModel

input_data = {
              'country': 'Russia',
              'city': 'Novosibirsk',
              'user': [
                       {
                           'name': 'Igor',
                           'age': 32,
                           'is_male': 'yes'
                        },
                       {
                           'name': 'Nikulin',
                           'age': 33,
                           'is_male': 1                         
                        },
                       {
                           'name': 'Tukanova',
                           'age': 33,
                           'is_male': 0
                        }
                      ]
             }
              
class User(BaseModel):
    name: str
    age: int
    is_male: bool

class User_Data(BaseModel):
    country: str
    city: str
    # nest "User" BaseModel class via List
    user: List[User]

new_user = User_Data(**input_data)
print(new_user)

country='Russia' city='Novosibirsk' user=[User(name='Igor', age=32, is_male=True), User(name='Nikulin', age=33, is_male=True), User(name='Tukanova', age=33, is_male=False)]


### Account for possibly missing fields:
We can import `Optional` from Python's `typing` module if we expect some fields in input data to be missing.  
- `Optional[str]`: meaning that we expect either string or `None` value. Same goes for other standard types. 

In [3]:
from typing import Optional
from pydantic import BaseModel

input_data = '''{
                    "user_age": 32,
                    "is_male": "yes"
                }
             '''

class User_Data(BaseModel):
    # we expect "user_name" to be either string or be missing
    user_name: Optional[str]
    user_age: int
    is_male: bool

new_user = User_Data.parse_raw(input_data)
new_user

User_Data(user_name=None, user_age=32, is_male=True)

### Validation while expecting input of different types:
We can specify that we expect to recieve input of different types in a field with `Union` from `typing` module.

In [4]:
from typing import Union
from pydantic import BaseModel

input_data = '''{
                    "user_name": "Igor",
                    "user_age": 32,
                    "sex": "Fighter Jet"
                }
            '''

class User_Data(BaseModel):
    user_name: str
    user_age: int
    # we expect either boolean or string value in the "sex" field
    sex: Union[bool, str]

new_user = User_Data.parse_raw(input_data)
new_user

User_Data(user_name='Igor', user_age=32, sex='Fighter Jet')

### Field Aliases:
In Pydantic we can use aliases with the help of `Filed` function imported from Pydantic.

In [5]:
# import Field from pydantic
from pydantic import BaseModel, Field

input_data = """
                {
                    "user_name": "Igor",
                    "user_age": 32,
                    "AbstractUserAbstractNameSexFabricValue": "male"
                }
             """

class User_Data(BaseModel):
    user_name: str
    user_age: int
    # we specify alias parameter in Field to use both in validation and serialization
    # validation_alias can be used to specify alias for validation only
    # serialization_alias can be used to specify alias for serialization only
    sex: str = Field(...,alias ='AbstractUserAbstractNameSexFabricValue')

new_user = User_Data.parse_raw(input_data)
new_user

User_Data(user_name='Igor', user_age=32, sex='male')

###  Business logic in validation:
With `@validator` decorator from Pydantic we can create custom validation functions to include business logic in validation process.

In [6]:
# import validator decorator from pydantic
from pydantic import BaseModel, validator

input_data = """
                {
                    "user_name": "Igor",
                    "user_age": 32,
                    "sex": "Apache Helicopter"
                }
             """

class User_Data(BaseModel):
    user_name: str
    user_age: int
    sex: str

    # use @validator class decorator to implement desired business logic
    @validator('sex', pre=True)
    def no_helicopters_allowed(cls, value: str) -> str:
        if 'Helicopter' in value:
            raise ValueError('No Helicopters allowed:')
        return value

new_user = User_Data.parse_raw(input_data)
new_user

ValidationError: 1 validation error for User_Data
sex
  No Helicopters allowed: (type=value_error)

### Custom types:
We can define a custom data type and use it for validation.

In the follwing example we create custom data type `ThiccStr`, which raises an exception if provided string is not in ALL CAPS.

In [7]:
# import validator decorator from pydantic
from pydantic import BaseModel

input_data = """
                {
                    "user_name": 42,
                    "user_age": 32,
                    "bio": "thinn string with boring info "
                }
             """

class ThiccStr(str):

    @classmethod
    def validate(cls, v):
        if not isinstance(v, str):
            raise ValueError(f"string expected, got {type(v)}")
        if not v.isupper():
            raise ValueError(f"string is not THICC enough")
        
        return v
    # Pydantic need __get_validators__ method from which
    # it will get the validation sequence
    @classmethod
    def __get_validators__(cls):
        yield cls.validate

class User_Data(BaseModel):
    # use typehinting to specify our custom data type
    user_name: ThiccStr
    user_age: int
    # use typehinting to specify our custom data type
    bio: ThiccStr

new_user = User_Data.parse_raw(input_data)
new_user

ValidationError: 2 validation errors for User_Data
user_name
  string expected, got <class 'int'> (type=value_error)
bio
  string is not THICC enough (type=value_error)

### Selective Export:
We can exclude some fields from exporting via `exclude=('<_field_name_>')` parameter to avoid sharing sensitive data.

In [8]:
from pydantic import BaseModel

input_data = '''{
                    "user_name": "Gregor",
                    "user_age": 51,
                    "sex": "male",
                    "dirty_secret": "i feel as being an Apache Helicopter, but please, keep this a secret"
                }
             '''

class User_Data(BaseModel):
    user_name: str
    user_age: int
    sex: str
    dirty_secret: str

new_user = User_Data.parse_raw(input_data)

# we wont share someones dirty secrets
new_user.json(exclude={'dirty_secret'})

'{"user_name": "Gregor", "user_age": 51, "sex": "male"}'

### Validate 2 fields together:
We can validate 2 fields at the same time via custom validators.  
In the example below we check if both **country** and **city** fields are filled, or raise.

**Validator function params**:
- `value`: The value parameter represents the value of the field that the validator is applied to. In other words, it's the value of the field being validated, which is the "city" field in this case.

- `values`: The values parameter is a dictionary that contains all the field values of the Pydantic model being validated. It allows you to access the values of other fields in the model. The keys in the values dictionary are the field names, and the values are the field values.
  - `values.get`: accesses values dictionary

In [9]:
from pydantic import BaseModel,validator

input_data = '''{
                    "user_name": "Gregor",
                    "user_age": 51,
                    "country": "Best Country"
                }
             '''

class User_Data(BaseModel):
    user_name: str
    user_age: int
    country: Optional[str]
    city: Optional[str]

    # buildin custom logic into validator
    # always=True alows for validation even if the value is None
    @validator('country', pre=True, always=True)
    def country_and_city_together(cls, value, values):
        if bool(value) != bool(values.get('city')):
            raise ValueError('Fill country and city together')
        return value
    
new_user = User_Data.parse_raw(input_data)
display(new_user)


ValidationError: 1 validation error for User_Data
country
  Fill country and city together (type=value_error)

### Forbid extra data:
We can specify `extra = 'forbid'` in the Config class inside the model to forbid extra incoming data from the source.

`extra = 'allow'` is the default behaviour, which allows extra incoming data.

In [10]:
from pydantic import BaseModel,validator

input_data = '''{
                    "user_name": "Gregor",
                    "user_age": 51,
                    "height": 180,
                    "age": 27,
                    "bio": "exciting story about Gregor"
                }
             '''

class User_Data(BaseModel):
    user_name: str
    user_age: int

    class Config:
        extra = 'forbid'
    
new_user = User_Data.parse_raw(input_data)
display(new_user)


ValidationError: 3 validation errors for User_Data
age
  extra fields not permitted (type=value_error.extra)
bio
  extra fields not permitted (type=value_error.extra)
height
  extra fields not permitted (type=value_error.extra)

### Constraining Types:
Pydantic provides several classes to impose constraints on types, namely:
- [conint()](https://docs.pydantic.dev/latest/api/types/#pydantic.types.conint)
- [confloat()](https://docs.pydantic.dev/latest/api/types/#pydantic.types.confloat)
- [conbytes()](https://docs.pydantic.dev/latest/api/types/#pydantic.types.conbytes)
- [constr()](https://docs.pydantic.dev/latest/api/types/#pydantic.types.constr)
- [conset()](https://docs.pydantic.dev/latest/api/types/#pydantic.types.conset)
- [confrozenset()](https://docs.pydantic.dev/latest/api/types/#pydantic.types.confrozenset)
- [conlist()](https://docs.pydantic.dev/latest/api/types/#pydantic.types.conlist)
- [condecimal()](https://docs.pydantic.dev/latest/api/types/#pydantic.types.condecimal)
- [condate()](https://docs.pydantic.dev/latest/api/types/#pydantic.types.condate)

Details on those types and possible parameters to implement constraints can be found in Pydantic's documentation.
In example below we will use conint to impose constraint on integer values.

In [11]:
# import conint class
from pydantic import BaseModel, conint

input_data = '''{
                    "user_name": "Bobby",
                    "user_age": 9
                }
             '''

class User_Data(BaseModel):
    user_name: str
    # use conint to restrict 'user_age' values
    user_age: conint(gt=10, lt=75)
    
new_user = User_Data.parse_raw(input_data)
display(new_user)

ValidationError: 1 validation error for User_Data
user_age
  ensure this value is greater than 10 (type=value_error.number.not_gt; limit_value=10)