# Data Validation in Model Serving

In this notebook, we will go through the process of validating data in the serving phase of the data science pipeline.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NatanMish/data_validation/blob/main/notebooks/3_model_serving_data_validation.ipynb)

#### Install the required packages and import them to the notebook

In [None]:
!pip install pydantic pandas

In [None]:
from pydantic import BaseModel
import pandas as pd

Let's remind ourselves the architecture of a modern machine learning model:

<div>
<img src="https://raw.githubusercontent.com/NatanMish/data_validation/main/notebooks/serving_diagram.png" width="1000"/>
</div>

So we should expect to receive the input for our model via a REST API in, and return the prediction or classification accordingly. An object arriving in the REST API can be a JSON object, but we can also receive a CSV file, txt file etc. For the purposes of this tutorial, we will assume that the input is a JSON object, and we receive it in batches. Let's create a few model inputs straight from the test data:

In [None]:
test_home_data = pd.read_csv('https://github.com/NatanMish/data_validation/blob/a77b247b25c6622ce0c8f8cbc505228161c31a3c/data/test.csv?raw=true')

In [None]:
feature_names = ['YearBuilt', 'LotFrontage', 'GarageArea', 'OverallQual', 'OverallCond', 'MSZoning','TotalBsmtSF']

In [None]:
# we'll choose one record from the test data
input_dict = test_home_data[feature_names].loc[1108].to_dict()

In [None]:
input_dict

We can already see that one of the features is missing. If our model is not made to handle missing values, it will break and no one wants that.

<div>
<img src="https://miro.medium.com/max/959/1*WNd3LXOi5xlDbitxsIARyw.png" width="500"/>
</div>

Pydantic allows data validation and settings management using python type annotations. It enforces type hints at runtime, and provides user-friendly errors when data is invalid. Many great and loved projects are using Pydantic extensively, even the Jupyter project for notebooks we are using right now!
Let's see how we can use it to validate our model inputs:

In [None]:
# The most basic building block of Pydantic is the BaseModel class. We can use it to define our custom models that define the structure of our objects:
class Input(BaseModel):
    YearBuilt: int
    LotFrontage: int
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: str
    TotalBsmtSF: float

In [None]:
# We can now use the Input class, and create a Input object from our sample record like so:
model_input_object = Input(
    YearBuilt=input_dict['YearBuilt'],
    LotFrontage=input_dict['LotFrontage'],
    GarageArea=input_dict['GarageArea'],
    OverallQual=input_dict['OverallQual'],
    OverallCond=input_dict['OverallCond'],
    MSZoning=input_dict['MSZoning'],
    TotalBsmtSF=input_dict['TotalBsmtSF']
)

As we can see, the Input object raised an error while validating because one of the features is missing.
Let's see how we can adjust the model, so it can handle missing values:

In [None]:
# What's cool about Pydantic is that it allows leveraging the built-in Python typing definitions.
from typing import Optional, Any

class Input(BaseModel):
    YearBuilt: int
    LotFrontage: Optional[Any]
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: str
    TotalBsmtSF: float

In [None]:
# Instead of specifying each field one by one, we can use the **kwargs syntax to specify a list of fields:
input_object = Input(**input_dict)

In [None]:
# And this is how our object looks like:
input_object

Here are a few other cool features of Pydantic:

#### 1. Recursive models

Pydantic supports recursive models. This means that we can define a model that contains other models. For example, we can define a model that contains a batch of Input objects:

In [None]:
from typing import List

class InputBatch(BaseModel):
    inputs: List[Input]

In [None]:
# We'll grab a sample of the data:
data_sample_dict = test_home_data[feature_names].sample(n=5).to_dict('index')
data_sample_dict

In [None]:
# And we can create a InputBatch object from our sample data:
inputs_list = [Input(**input_dict) for input_dict in data_sample_dict.values()]
input_batch = InputBatch(inputs=inputs_list)

In [None]:
# And we can see that our object looks like this:
input_batch

#### 2. Enums

Pydantic supports enumerations. This means that we can define a model that has a set of predefined values. For example, we can define a model that has a set of values for the MSZoning feature:

In [None]:
from enum import Enum

class MSZoning(str, Enum):
    C = 'C (all)'
    FV = 'FV'
    RH = 'RH'
    RL = 'RL'
    RM = 'RM'

In [None]:
# We can then include the enum in our model:
class Input(BaseModel):
    YearBuilt: int
    LotFrontage: Optional[float]
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

#### 3. Custom base models

When developing a model you might end up referencing the model's field names many times in many different places using its literal name, for example:
`year_built = model_input_object.YearBuilt`
in the case that the field name changes in the future, you might have to update all the places that use the literal name which could be a pain.
Luckily, we can define an extended base model that allows us to use the field name directly. Then we can create an enum class that holds the field names and use that whenever we set or get a specific field. If the field name changes in the source, we only have to update the enum class.

In [None]:
input_batch.inputs

In [None]:
class ExtendedBaseModel(BaseModel):
    def __getitem__(self, item):
        return getattr(self, item)

    def __setitem__(self, item, value):
        return setattr(self, item, value)

In [None]:
class ExtendedBaseModel(BaseModel):
    def __getitem__(self, item):
        return getattr(self, item)

    def __setitem__(self, item, value):
        return setattr(self, item, value)

In [None]:
# by using the extended base model, we can use a bracket notation to access the field values:
class Input(ExtendedBaseModel):
    YearBuilt: int
    LotFrontage: float
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

input_object = Input(**input_dict)
# now we can use a bracket notation to access the field values:
input_object['YearBuilt']

In [None]:
# Let's create an enum class that holds the field names:
class InputFieldNames(Enum):
    YearBuilt = 'YearBuilt'
    LotFrontage = 'LotFrontage'
    GarageArea = 'GarageArea'
    OverallQual = 'OverallQual'
    OverallCond = 'OverallCond'
    MSZoning = 'MSZoning'
    TotalBsmtSF = 'TotalBsmtSF'

In [None]:
# and then we can access the field like so:
input_object[InputFieldNames.YearBuilt.value]

#### 4. Catch bad data

In the case that a bad model input arrives, we can catch it, display a warning message, isolate the record and then continue without breaking the model. This can be done using a try except block:

In [None]:
from pydantic import ValidationError

class Input(ExtendedBaseModel):
    YearBuilt: int
    LotFrontage: int
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

try:
    model_input_object = Input(**input_dict)
except ValidationError as e:
    print(f'Record: {input_dict}')
    print(f'Bad data: {e.errors()}')

    # isolate the record and then continue

#### 5. Custom validation

In [None]:
from pydantic import validator
# Let's create a model input class with a custom validation function that checks that the garage area is greater than 
# 1000
class Input(ExtendedBaseModel):
    YearBuilt: int
    LotFrontage: Optional[float]
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

    @validator('GarageArea')
    def check_garage_area(v):
        if v <= 1000:
            raise ValueError('Garage area must be greater than 1000')
        return v

In [None]:
# As we can see by the value, the validator should alert us for this record:
input_dict['GarageArea']

In [None]:
input_object = Input(**input_dict)

#### Exercise 1 - Pydantic models

Create a model class that represents the model input. Not all the checks requested were shown above, for some of them you'll need to have a quick search in the Pydantic documentation. The model class should have the following fields:
1. YearBuilt - int - required - non nullable
2. Fireplaces - int - optional - nullable
3. FireplaceQu - FireplaceQu - optional - nullable - enum values of 'Ex', 'Gd', 'TA', 'Fa', 'Po'. If Fireplaces is greater than 0, FireplaceQu is required.
4. Make the input model object immutable so that once it is created, it cannot be changed.


In [None]:
class FireplaceQu(str, Enum):
    <YOUR CODE HERE>

class Input(BaseModel):
    <YOUR CODE HERE>

    @validator("FireplaceQu")
    @classmethod
    def validate_fire_places_quality_field(cls, field_value, values):
        <YOUR CODE HERE>

    class Config:
        <YOUR CODE HERE>

In [None]:
# Let's create a few records and see if the validation works, this is a valid record:
input_dict = {
    'YearBuilt': '1901',
    'Fireplaces': '0',
    'FireplaceQu': 'Ex'
}
input_object = Input(**input_dict)

In [None]:
# and this is an invalid record:
input_dict = {
    'YearBuilt': '1901',
    'Fireplaces': '1',
    'FireplaceQu': None
}
input_object = Input(**input_dict)

*Exercise solutions can be found in the exercise solutions file in the current directory.*
