# Data Validation in Model Serving

In this notebook, we will go through the process of validating data in the serving phase of the data science pipeline.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NatanMish/data_validation/blob/main/notebooks/3_model_serving_data_validation.ipynb)

#### Install the required packages and import them to the notebook

In [1]:
!pip install pydantic pandas hypothesis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting hypothesis
  Downloading hypothesis-6.54.3-py3-none-any.whl (389 kB)
[K     |████████████████████████████████| 389 kB 4.2 MB/s 
Collecting exceptiongroup>=1.0.0rc8
  Downloading exceptiongroup-1.0.0rc8-py3-none-any.whl (11 kB)
Installing collected packages: exceptiongroup, hypothesis
Successfully installed exceptiongroup-1.0.0rc8 hypothesis-6.54.3


In [2]:
from pydantic import BaseModel
import pandas as pd

Let's remind ourselves the architecture of a modern machine learning model:

<div>
<img src="https://raw.githubusercontent.com/NatanMish/data_validation/main/notebooks/serving_diagram.png" width="1000"/>
</div>

So we should expect to receive the input for our model via a REST API in, and return the prediction or classification accordingly. An object arriving in the REST API can be a JSON object, but we can also receive a CSV file, txt file etc. For the purposes of this tutorial, we will assume that the input is a JSON object, and we receive it in batches. Let's create a few model inputs straight from the test data:

In [3]:
test_house_data = pd.read_csv('https://github.com/NatanMish/data_validation/blob/a77b247b25c6622ce0c8f8cbc505228161c31a3c/data/test.csv?raw=true')

In [4]:
feature_names = ['YearBuilt', 'LotFrontage', 'GarageArea', 'OverallQual', 'OverallCond', 'MSZoning','TotalBsmtSF']

In [5]:
# we'll choose one record from the test data
input_dict = test_house_data[feature_names].loc[1108].to_dict()

In [6]:
input_dict

{'YearBuilt': 1968,
 'LotFrontage': nan,
 'GarageArea': 552.0,
 'OverallQual': 6,
 'OverallCond': 6,
 'MSZoning': 'RL',
 'TotalBsmtSF': 1488.0}

We can already see that one of the features is missing. If our model is not made to handle missing values, it will break and no one wants that.

<div>
<img src="https://miro.medium.com/max/959/1*WNd3LXOi5xlDbitxsIARyw.png" width="500"/>
</div>

Pydantic allows data validation and settings management using python type annotations. It enforces type hints at runtime, and provides user-friendly errors when data is invalid. Many great and loved projects are using Pydantic extensively, even the Jupyter project for notebooks we are using right now!
Let's see how we can use it to validate our model inputs:

In [7]:
# The most basic building block of Pydantic is the BaseModel class. We can use it to define our custom models that 
# define the structure of our objects:
class Input(BaseModel):
    YearBuilt: int
    LotFrontage: int
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: str
    TotalBsmtSF: float

In [8]:
# We can now use the Input class, and create a Input object from our sample record like so:
model_input_object = Input(
    YearBuilt=input_dict['YearBuilt'],
    LotFrontage=input_dict['LotFrontage'],
    GarageArea=input_dict['GarageArea'],
    OverallQual=input_dict['OverallQual'],
    OverallCond=input_dict['OverallCond'],
    MSZoning=input_dict['MSZoning'],
    TotalBsmtSF=input_dict['TotalBsmtSF']
)

ValidationError: ignored

As we can see, the Input object raised an error while validating because one of the features is missing.
Let's see how we can adjust the model, so it can handle missing values:

In [9]:
# What's cool about Pydantic is that it allows leveraging the built-in Python typing definitions.
from typing import Optional, Any

class Input(BaseModel):
    YearBuilt: int
    LotFrontage: Optional[Any]
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: str
    TotalBsmtSF: float

In [11]:
input_dict

{'YearBuilt': 1968,
 'LotFrontage': nan,
 'GarageArea': 552.0,
 'OverallQual': 6,
 'OverallCond': 6,
 'MSZoning': 'RL',
 'TotalBsmtSF': 1488.0}

In [10]:
# Instead of specifying each field one by one, we can use the **kwargs syntax to specify a list of fields:
input_object = Input(**input_dict)
inpout_object_2 = Input.parse_obj(input_dict)

In [14]:
# And this is how our object looks like:
input_object

Input(YearBuilt=1968, LotFrontage=nan, GarageArea=552.0, OverallQual=6, OverallCond=6, MSZoning='RL', TotalBsmtSF=1488.0)

Here are a few other cool features of Pydantic:

#### 1. Recursive models

Pydantic supports recursive models. This means that we can define a model that contains other models. For example, we can define a model that contains a batch of Input objects:

In [15]:
from typing import List

class InputBatch(BaseModel):
    inputs: List[Input]

In [16]:
# We'll grab a sample of the data:
data_sample_dict = test_house_data[feature_names].sample(n=5).to_dict('index')
data_sample_dict

{1171: {'YearBuilt': 2005,
  'LotFrontage': 92.0,
  'GarageArea': 660.0,
  'OverallQual': 9,
  'OverallCond': 5,
  'MSZoning': 'RL',
  'TotalBsmtSF': 1390.0},
 142: {'YearBuilt': 1896,
  'LotFrontage': 66.0,
  'GarageArea': 330.0,
  'OverallQual': 4,
  'OverallCond': 7,
  'MSZoning': 'C (all)',
  'TotalBsmtSF': 756.0},
 129: {'YearBuilt': 1975,
  'LotFrontage': 65.0,
  'GarageArea': 352.0,
  'OverallQual': 6,
  'OverallCond': 6,
  'MSZoning': 'RL',
  'TotalBsmtSF': 1008.0},
 1036: {'YearBuilt': 1954,
  'LotFrontage': 70.0,
  'GarageArea': 332.0,
  'OverallQual': 6,
  'OverallCond': 6,
  'MSZoning': 'RL',
  'TotalBsmtSF': 988.0},
 1277: {'YearBuilt': 1967,
  'LotFrontage': nan,
  'GarageArea': 506.0,
  'OverallQual': 5,
  'OverallCond': 5,
  'MSZoning': 'RL',
  'TotalBsmtSF': 1584.0}}

In [17]:
# And we can create a InputBatch object from our sample data:
inputs_list = [Input(**input_dict) for input_dict in data_sample_dict.values()]
input_batch = InputBatch(inputs=inputs_list)

In [18]:
# And we can see that our object looks like this:
input_batch

InputBatch(inputs=[Input(YearBuilt=2005, LotFrontage=92.0, GarageArea=660.0, OverallQual=9, OverallCond=5, MSZoning='RL', TotalBsmtSF=1390.0), Input(YearBuilt=1896, LotFrontage=66.0, GarageArea=330.0, OverallQual=4, OverallCond=7, MSZoning='C (all)', TotalBsmtSF=756.0), Input(YearBuilt=1975, LotFrontage=65.0, GarageArea=352.0, OverallQual=6, OverallCond=6, MSZoning='RL', TotalBsmtSF=1008.0), Input(YearBuilt=1954, LotFrontage=70.0, GarageArea=332.0, OverallQual=6, OverallCond=6, MSZoning='RL', TotalBsmtSF=988.0), Input(YearBuilt=1967, LotFrontage=nan, GarageArea=506.0, OverallQual=5, OverallCond=5, MSZoning='RL', TotalBsmtSF=1584.0)])

#### 2. Enums

Pydantic supports enumerations. This means that we can define a model that has a set of predefined values. For example, we can define a model that has a set of values for the MSZoning feature:

In [19]:
from enum import Enum

class MSZoning(str, Enum):
    C = 'C (all)'
    FV = 'FV'
    RH = 'RH'
    RL = 'RL'
    RM = 'RM'

In [20]:
# We can then include the enum in our model:
class Input(BaseModel):
    YearBuilt: int
    LotFrontage: Optional[float]
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

#### 3. Custom base models

When developing a model you might end up referencing the model's field names many times in many different places using its literal name, for example:
`year_built = model_input_object.YearBuilt`
in the case that the field name changes in the future, you might have to update all the places that use the literal name which could be a pain.
Luckily, we can define an extended base model that allows us to use the field name directly. Then we can create an enum class that holds the field names and use that whenever we set or get a specific field. If the field name changes in the source, we only have to update the enum class.

In [21]:
input_batch.inputs

[Input(YearBuilt=2005, LotFrontage=92.0, GarageArea=660.0, OverallQual=9, OverallCond=5, MSZoning='RL', TotalBsmtSF=1390.0),
 Input(YearBuilt=1896, LotFrontage=66.0, GarageArea=330.0, OverallQual=4, OverallCond=7, MSZoning='C (all)', TotalBsmtSF=756.0),
 Input(YearBuilt=1975, LotFrontage=65.0, GarageArea=352.0, OverallQual=6, OverallCond=6, MSZoning='RL', TotalBsmtSF=1008.0),
 Input(YearBuilt=1954, LotFrontage=70.0, GarageArea=332.0, OverallQual=6, OverallCond=6, MSZoning='RL', TotalBsmtSF=988.0),
 Input(YearBuilt=1967, LotFrontage=nan, GarageArea=506.0, OverallQual=5, OverallCond=5, MSZoning='RL', TotalBsmtSF=1584.0)]

In [22]:
class ExtendedBaseModel(BaseModel):
    def __getitem__(self, item):
        return getattr(self, item)

    def __setitem__(self, item, value):
        return setattr(self, item, value)

In [23]:
# by using the extended base model, we can use a bracket notation to access the field values:
class Input(ExtendedBaseModel):
    YearBuilt: int
    LotFrontage: float
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

input_object = Input(**input_dict)
# now we can use a bracket notation to access the field values:
input_object['YearBuilt']

1968

In [24]:
# Let's create an enum class that holds the field names:
class InputFieldNames(str, Enum):
    YearBuilt = 'YearBuilt'
    LotFrontage = 'LotFrontage'
    GarageArea = 'GarageArea'
    OverallQual = 'OverallQual'
    OverallCond = 'OverallCond'
    MSZoning = 'MSZoning'
    TotalBsmtSF = 'TotalBsmtSF'

In [25]:
# and then we can access the field like so:
input_object[InputFieldNames.YearBuilt.value]

1968

#### 4. Catch invalid data

In the case that an invalid model input arrives, we can catch it, display a warning message, isolate the record and then continue without breaking the model. This can be done using a try except block. This is a common and valid practice in Python called EAFP - "easier to ask for forgiveness than permission" which might not be as well recieved in other languages.

In [26]:
from pydantic import ValidationError

class Input(ExtendedBaseModel):
    YearBuilt: int
    LotFrontage: int
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

try:
    model_input_object = Input(**input_dict)
except ValidationError as e:
    print(f'Record: {input_dict}')
    print(f'Bad data: {e.errors()}')

    # isolate the record and then continue

Record: {'YearBuilt': 1968, 'LotFrontage': nan, 'GarageArea': 552.0, 'OverallQual': 6, 'OverallCond': 6, 'MSZoning': 'RL', 'TotalBsmtSF': 1488.0}
Bad data: [{'loc': ('LotFrontage',), 'msg': 'value is not a valid integer', 'type': 'type_error.integer'}]


#### 5. Custom validation

In [27]:
from pydantic import validator
# Let's create a model input class with a custom validation function that checks that the garage area is greater than 1000
class Input(ExtendedBaseModel):
    YearBuilt: int
    LotFrontage: Optional[float]
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

    @validator('GarageArea')
    def check_garage_area(v):
        if v % 5 != 0:
            raise ValueError('Garage area must be divisible by 5')
        return v
    
    @validator(
        "LotFrontage",
        "GarageArea",
        pre=False,
        each_item=True,
    )
    def set_metrics_precision(cls, v):
        """Round all figures to 2 decimal places"""
        return round(v, 2)

In [28]:
# As we can see by the value, the validator should alert us for this record:
input_dict['GarageArea']

552.0

In [29]:
input_object = Input(**input_dict)

ValidationError: ignored

#### Exercise 1 - Pydantic models

Create a model class that represents the model input. Not all the checks requested were shown above, for some of them you'll need to have a quick search in the Pydantic documentation. The model class should have the following fields:
1. YearBuilt - int - required - non nullable
2. Fireplaces - int - optional - nullable
3. FireplaceQu - FireplaceQu - optional - nullable - enum values of 'Ex', 'Gd', 'TA', 'Fa', 'Po'. If Fireplaces is greater than 0, FireplaceQu is required.
4. Make the input model object immutable so that once it is created, it cannot be changed.


In [None]:
from enum import Enum

class MSZoning(str, Enum):
    C = 'C (all)'
    FV = 'FV'
    RH = 'RH'
    RL = 'RL'
    RM = 'RM'

In [None]:
class FireplaceQu(str, Enum):
    Ex = 'Ex'
    Gd = 'Gd'
    TA = 'TA'
    Fa = 'Fa'
    Po = 'Po'

class Input(BaseModel):
    YearBuilt: int
    Fireplaces: Optional[int]
    FireplaceQu: Optional[FireplaceQu]

    @validator("FireplaceQu")
    @classmethod
    def validate_fire_places_quality_field(cls, field_value, values):
        if values["Fireplaces"]>0 and field_value is None:
            raise ValueError(f"FireplaceQu is required when Fireplaces is greater than 0")
        return field_value

    class Config:
        extra = 'forbid'
        allow_mutation = False

In [31]:
class FireplaceQu(str, Enum):
    Ex = 'Ex'
    Gd = 'Gd'
    TA = 'TA'
    Fa = 'Fa'
    Po = 'Po'

class Input(BaseModel):
    YearBuilt: int
    Fireplaces: Optional[int]
    FireplaceQu: Optional[FireplaceQu]

    @validator("FireplaceQu")
    @classmethod
    def validate_fire_places_quality_field(cls, field_value, values):
        if values["Fireplaces"]>0 and field_value is None:
          raise ValueError(f'FireplaceQu is required when Fireplaces > 0')
        return field_value

#    class Config:
#        <YOUR CODE HERE>

In [32]:
# Let's create a few records and see if the validation works, this is a valid record:
input_dict = {
    'YearBuilt': '1901',
    'Fireplaces': '0',
    'FireplaceQu': 'Ex'
}
input_object = Input(**input_dict)

In [38]:
# and this is an invalid record:
input_dict = {
    'YearBuilt': '1901',
    'Fireplaces': '1',
    'FireplaceQu': None
}
input_object = Input(**input_dict)

ValidationError: ignored

*Exercise solutions can be found in the exercise solutions file in the current directory.*

### 6. JSON Serialization
After we have received an object, and our model generated an output, we need to serialize it to JSON so we can send it back to the client. We'll start by creating an output model class to define how our output will look like.

In [None]:
from datetime import datetime
from typing import Tuple
from pydantic import confloat, Field
from uuid import UUID, uuid4

class ImportantFeature(BaseModel):
    FeatureName: str
    FeatureValue: Any
    ImportanceScore: confloat(ge=0, le=1)

class HousePricePrediction(BaseModel):
    PredictionId: UUID = Field(default_factory=uuid4)
    HousePrice: confloat(ge=0)
    PredictionGenerationTime: datetime = Field(default_factory=datetime.now)
    Explanation: Optional[List[ImportantFeature]]
    ConfidenceInterval: Optional[Tuple[float, float]]

In [None]:
# Let's create an output object:
output_object = HousePricePrediction(
    HousePrice=12345,
    Explanation=[
        ImportantFeature(FeatureName='YearBuilt', FeatureValue=1901, ImportanceScore=0.5),
        ImportantFeature(FeatureName='Fireplaces', FeatureValue=0, ImportanceScore=0.5),
        ImportantFeature(FeatureName='FireplaceQu', FeatureValue='Ex', ImportanceScore=0.5)
    ],
    ConfidenceInterval=(12000, 13000)
)
output_object

When we want to send this object over an API to the client, we need to serialize it to JSON. We can do this using the `json()` method. This method is super useful because it can detect and handle different fields types and convert them to a JSON friendly format, usually better than just using `json.dumps(output_object)`.

In [None]:
output_object.json()

We can even define custom rules for serialization using the `json_encoders` parameter. This is a dictionary that maps the field names to functions that will be used to serialize the field.

In [None]:
def encrypt_feature_value(feature: ImportantFeature, encryption_key="some_secret_key"):
    feature.FeatureValue = feature.FeatureValue.encode(encryption_key)
    return feature

class HousePricePrediction(BaseModel):
    PredictionId: UUID = Field(default_factory=uuid4)
    HousePrice: confloat(ge=0)
    PredictionGenerationTime: datetime = Field(default_factory=datetime.now)
    Explanation: Optional[List[ImportantFeature]]
    ConfidenceInterval: Optional[Tuple[float, float]]

    class Config:
        json_encoders = {
            'prediction_generation_time': datetime.isoformat,
            'explanation': lambda important_features: [encrypt_feature_value(feature_explanation) for feature_explanation in important_features]
        }

### 7. Hypothesis plugin

Hypothesis is a Python package used for property based testing. We can use it in combination with Pydantic as a tool for generating random data and testing the validity of a data model.

In [None]:
from hypothesis import given, strategies as st

@given(st.builds(HousePricePrediction))
def test_property(instance):
    # Hypothesis calls this test function many times with varied Models,
    # so you can write a test that should pass given *any* instance.
    assert len(str(instance.PredictionId)) == 36
    assert instance.HousePrice >= 0
    assert instance.PredictionGenerationTime is not None

test_property()

### 8. Base settings
One of the most useful features in Pydantic is the `BaseSettings` class. This class allows for centralising config keys and variables that are used throughout the project. The feature allows three different ways to parse environment variables into the object:

In [None]:
import os
os.environ['ENCRYPTION_KEY'] = "some_secret_key"
os.environ['MY_API_KEY'] = "my_api_key"
os.environ['REDIS_URL'] = "redis://redis:6379"

In [None]:
from pydantic import BaseSettings, RedisDsn
class ProjectConfig(BaseSettings):
    RedisUrl: RedisDsn
    ApiKey: str = Field(..., env='MY_API_KEY')
    ENCRYPTION_KEY: str

    class Config:
        fields = {
            'RedisUrl': {
                'env': ['REDIS_URL']
            }
        }
        case_sensitive = False

In [None]:
# This is how the config object looks like:
ProjectConfig().dict()

### Exercise 2 - Constrained and Strict Field Types
In this exercise we will explore what the constraints and strict field types in Pydantic offer us.
Create a `Input` model class with the following constraints:
1. Amenities is a list of strings, with no more than 5 items.
2. HousePrice is a non-coercable(strict) integer between 0 and 10000000, and a multiple of 100.
3. PoolBool is a strict boolean, i.e. it must be either `True` or `False`.
4. MSZoning is string with max of 20 characters, turns all letters to lower case and must have the word `zone` in it (hint: use regex).

In [None]:
from pydantic import <YOUR CODE HERE>
class Input(BaseModel):
    <YOUR CODE HERE>

In [None]:
# This Input model class is valid:
Input(
  Amenities=['Amenity1', 'Amenity2', 'Amenity3', 'Amenity4', 'Amenity5'],
  HousePrice=125000,
  PoolBool=True,
  MSZoning='1ZONE!'
)

In [None]:
# This Input model class is invalid:
Input(
  Amenities=['Amenity1', 'Amenity2', 'Amenity3', 'Amenity4', 'Amenity5', 'Amenity6'],
  HousePrice=1250,
  PoolBool=1,
  MSZoning='1Zne'
)