# Data Validation in Model Serving

In this notebook, we will go through the process of validating data in the serving phase of the data science pipeline.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NatanMish/data_validation/blob/main/notebooks/model_serving_data_validation.ipynb)

#### Install the required packages and import them to the notebook

In [1]:
!pip install pydantic pandas

Looking in indexes: https://pypi.org/simple, https://natan.mish%40zimmerbiomet.com:****@pkgs.dev.azure.com/zimbio/2a49da0e-2ad9-441b-b709-4db513be52f9/_packaging/ai-pypi-artifacts/pypi/simple/
You should consider upgrading via the '/Users/natanmish/Projects/data_validation/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [126]:
from pydantic import BaseModel
import pandas as pd

Let's remind ourselves the architecture of a modern machine learning model:

<div>
<img src="https://raw.githubusercontent.com/NatanMish/data_validation/main/notebooks/serving_diagram.png" width="1000"/>
</div>

So we should expect to receive the input for our model via a REST API in, and return the prediction or classification accordingly. In most cases, an object arriving in the REST API will be a JSON object, but we can also receive a CSV file. For the purposes of this tutorial, we will assume that the input is a JSON object, and we receive it in batches. Let's create a few model inputs straight from the test data:

In [127]:
test_home_data = pd.read_csv('../data/test.csv')

In [128]:
feature_names = ['YearBuilt', 'LotFrontage', 'GarageArea', 'OverallQual', 'OverallCond', 'MSZoning','TotalBsmtSF']

In [137]:
# we'll choose a random record from the test data
model_input_dict = test_home_data[feature_names].loc[1108].to_dict()

In [138]:
model_input_dict

{'YearBuilt': 1968,
 'LotFrontage': nan,
 'GarageArea': 552.0,
 'OverallQual': 6,
 'OverallCond': 6,
 'MSZoning': 'RL',
 'TotalBsmtSF': 1488.0}

We can already see that one of the features is missing. If our model is not able to handle missing values, it will break and no one wants that.

<div>
<img src="https://miro.medium.com/max/959/1*WNd3LXOi5xlDbitxsIARyw.png" width="500"/>
</div>

Pydantic allows data validation and settings management using python type annotations. It enforces type hints at runtime, and provides user-friendly errors when data is invalid. Many great and loved projects are using Pydantic extensively, even the Jupyter project for notebooks we are using right now!
Let's see how we can use it to validate our model inputs:

In [139]:
# The most basic building block of Pydantic is the BaseModel class. We can use it to define our custom models that define the structure of our objects:
class ModelInput(BaseModel):
    YearBuilt: int
    LotFrontage: int
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: str
    TotalBsmtSF: float

In [140]:
# We can now use the ModelInput class, and create a ModelInput object from our sample record like so:
model_input_object = ModelInput(
    YearBuilt=model_input_dict['YearBuilt'],
    LotFrontage=model_input_dict['LotFrontage'],
    GarageArea=model_input_dict['GarageArea'],
    OverallQual=model_input_dict['OverallQual'],
    OverallCond=model_input_dict['OverallCond'],
    MSZoning=model_input_dict['MSZoning'],
    TotalBsmtSF=model_input_dict['TotalBsmtSF']
)

ValidationError: 1 validation error for ModelInput
LotFrontage
  value is not a valid integer (type=type_error.integer)

As we can see, the ModelInput object raised an error while validating because one of the features is missing.
Let's see how we can adjust the model, so it can handle missing values:

In [141]:
# What's cool about Pydantic is that it allows leveraging the built-in Python typing definitions.
from typing import Optional

class ModelInput(BaseModel):
    YearBuilt: int
    LotFrontage: Optional[float]
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: str
    TotalBsmtSF: float

In [142]:
# Instead of specifying each field one by one, we can use the **kwargs syntax to specify a list of fields:
model_input_object = ModelInput(**model_input_dict)

In [143]:
# And this is how our object looks like:
model_input_object

ModelInput(YearBuilt=1968, LotFrontage=nan, GarageArea=552.0, OverallQual=6, OverallCond=6, MSZoning='RL', TotalBsmtSF=1488.0)

Here are a few more cool features of Pydantic:

#### 1. Recursive models

Pydantic supports recursive models. This means that we can define a model that contains other models. For example, we can define a model that contains a batch of ModelInput objects:

In [144]:
from typing import List

class ModelInputBatch(BaseModel):
    model_inputs: List[ModelInput]

In [145]:
# We'll grab a sample of the data:
data_sample_dict = test_home_data[feature_names].sample(n=5).to_dict('index')
data_sample_dict

{918: {'YearBuilt': 1970,
  'LotFrontage': 80.0,
  'GarageArea': 570.0,
  'OverallQual': 6,
  'OverallCond': 6,
  'MSZoning': 'RL',
  'TotalBsmtSF': 1008.0},
 483: {'YearBuilt': 2005,
  'LotFrontage': 101.0,
  'GarageArea': 683.0,
  'OverallQual': 8,
  'OverallCond': 5,
  'MSZoning': 'RL',
  'TotalBsmtSF': 1168.0},
 540: {'YearBuilt': 1993,
  'LotFrontage': 91.0,
  'GarageArea': 783.0,
  'OverallQual': 7,
  'OverallCond': 5,
  'MSZoning': 'RL',
  'TotalBsmtSF': 1080.0},
 1320: {'YearBuilt': 1925,
  'LotFrontage': 50.0,
  'GarageArea': 216.0,
  'OverallQual': 5,
  'OverallCond': 7,
  'MSZoning': 'RM',
  'TotalBsmtSF': 844.0},
 279: {'YearBuilt': 2008,
  'LotFrontage': nan,
  'GarageArea': 561.0,
  'OverallQual': 6,
  'OverallCond': 5,
  'MSZoning': 'FV',
  'TotalBsmtSF': 1726.0}}

In [146]:
# And we can create a ModelInputList object from our sample data:
model_inputs_list = [ModelInput(**model_input_dict) for model_input_dict in data_sample_dict.values()]
model_input_batch = ModelInputBatch(model_inputs=model_inputs_list)

In [147]:
# And we can see that our object looks like this:
model_input_batch

ModelInputBatch(model_inputs=[ModelInput(YearBuilt=1970, LotFrontage=80.0, GarageArea=570.0, OverallQual=6, OverallCond=6, MSZoning='RL', TotalBsmtSF=1008.0), ModelInput(YearBuilt=2005, LotFrontage=101.0, GarageArea=683.0, OverallQual=8, OverallCond=5, MSZoning='RL', TotalBsmtSF=1168.0), ModelInput(YearBuilt=1993, LotFrontage=91.0, GarageArea=783.0, OverallQual=7, OverallCond=5, MSZoning='RL', TotalBsmtSF=1080.0), ModelInput(YearBuilt=1925, LotFrontage=50.0, GarageArea=216.0, OverallQual=5, OverallCond=7, MSZoning='RM', TotalBsmtSF=844.0), ModelInput(YearBuilt=2008, LotFrontage=nan, GarageArea=561.0, OverallQual=6, OverallCond=5, MSZoning='FV', TotalBsmtSF=1726.0)])

#### 2. Enums

Pydantic also supports enumerations. This means that we can define a model that has a set of predefined values. For example, we can define a model that has a set of values for the MSZoning feature:

In [148]:
from enum import Enum

class MSZoning(str, Enum):
    C = 'C (all)'
    FV = 'FV'
    RH = 'RH'
    RL = 'RL'
    RM = 'RM'

In [149]:
# We can then include the enum in our model:
class ModelInput(BaseModel):
    YearBuilt: int
    LotFrontage: Optional[float]
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

#### 3. Custom base models

When developing a model you might end up referencing the model's field names many times in many different places using its literal name, for example:
`year_built = model_input_object.YearBuilt`
in the case that the field name changes in the future, you might have to update all the places that use the literal name which could be a pain.
Luckily, we can define an extended base model that allows us to use the field name directly. Then we can create an enum class that holds the field names and use that whenever we set or get a specific field. If the field name changes in the source, we only have to update the enum class.

In [150]:
class ExtendedBaseModel(BaseModel):
    def __getitem__(self, item):
        return getattr(self, item)

    def __setitem__(self, item, value):
        return setattr(self, item, value)

In [151]:
# by using the extended base model, we can use a bracket notation to access the field values:
class ModelInput(ExtendedBaseModel):
    YearBuilt: int
    LotFrontage: float
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

model_input_object = ModelInput(**model_input_dict)
# now we can use a bracket notation to access the field values:
model_input_object['YearBuilt']

1968

In [152]:
# Let's create an enum class that holds the field names:
class ModelInputFieldNames(Enum):
    YearBuilt = 'YearBuilt'
    LotFrontage = 'LotFrontage'
    GarageArea = 'GarageArea'
    OverallQual = 'OverallQual'
    OverallCond = 'OverallCond'
    MSZoning = 'MSZoning'
    TotalBsmtSF = 'TotalBsmtSF'

In [153]:
# and then we can access the field like so:
model_input_object[ModelInputFieldNames.YearBuilt.value]

1968

#### 4. Catch bad data

In the case that a bad model input arrives, we can catch it, display a warning message, isolate the record and then continue without breaking the model. This can be done using a try except block:

In [154]:
from pydantic import ValidationError

try:
    model_input_object = ModelInput(**model_input_dict)
except ValidationError as e:
    print(f'Record: {model_input_dict}')
    print(f'Bad data: {e.errors()}')

    # isolate the record and then continue

#### 5. Custom validation

In [155]:
from pydantic import validator
# Let's create a model input class with a custom validation function that checks that the garage area is greater than 1000
class ModelInput(ExtendedBaseModel):
    YearBuilt: int
    LotFrontage: Optional[float]
    GarageArea: float
    OverallQual: int
    OverallCond: int
    MSZoning: MSZoning
    TotalBsmtSF: float

    @validator('GarageArea')
    def check_garage_area(cls, v):
        if v <= 1000:
            raise ValueError('Garage area must be greater than 1000')
        return v

In [156]:
# As we can see by the value, the validator should alert us for this record:
model_input_dict['GarageArea']

552.0

In [157]:
model_input_object = ModelInput(**model_input_dict)

ValidationError: 1 validation error for ModelInput
GarageArea
  Garage area must be greater than 1000 (type=value_error)

#### Exercise 1 - Pydantic models

Create a model class that represents the model input. Not all the checks requested were shown above, for some of them you'll need to have a quick search in the Pydantic documentation. The model class should have the following fields:
1. YearBuilt - int - required - non nullable
2. Fireplaces - int - optional - nullable
3. FireplaceQu - FireplaceQu - optional - nullable - enum values of 'Ex', 'Gd', 'TA', 'Fa', 'Po'. If Fireplaces is greater than 0, FireplaceQu is required.
4. Make the input model object immutable so that once it is created, it cannot be changed.


In [None]:
class FireplaceQu(str, Enum):
    <YOUR CODE HERE>

class ModelInput(BaseModel):
    <YOUR CODE HERE>

    @validator("FireplaceQu")
    @classmethod
    def validate_fire_places_quality_field(cls, field_value, values):
        <YOUR CODE HERE>

    class Config:
        <YOUR CODE HERE>

In [None]:
# Let's create a few records and see if the validation works, this is a valid record:
model_input_dict = {
    'YearBuilt': '1901',
    'Fireplaces': '0',
    'FireplaceQu': 'Ex'
}
model_input_object = ModelInput(**model_input_dict)

In [None]:
# and this is an invalid record:
model_input_dict = {
    'YearBuilt': '1901',
    'Fireplaces': '1',
    'FireplaceQu': None
}
model_input_object = ModelInput(**model_input_dict)

#### Exercise solutions
#### Exercise 1 - Pydantic models

In [117]:
class FireplaceQu(str, Enum):
    Ex = 'Ex'
    Gd = 'Gd'
    TA = 'TA'
    Fa = 'Fa'
    Po = 'Po'

class ModelInput(BaseModel):
    YearBuilt: int
    Fireplaces: Optional[int]
    FireplaceQu: Optional[FireplaceQu]

    @validator("FireplaceQu")
    @classmethod
    def validate_fire_places_quality_field(cls, field_value, values):
        if values["Fireplaces"]>0 and field_value is None:
            raise ValueError(f"FireplaceQu is required when Fireplaces is greater than 0")
        return field_value

    class Config:
        extra = 'forbid'
        allow_mutation = False