# Data Validation in Training Pipelines


In this notebook, we will go through the process of validating dataframes in a training pipeline using Pandera.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NatanMish/data_validation/blob/main/notebooks/2_training_pipeline_data_validation.ipynb)


#### Install the required packages and import them to the notebook

In [3]:
!pip install sklearn pandera\[strategies\]

Looking in indexes: https://pypi.org/simple, https://natan.mish%40zimmerbiomet.com:****@pkgs.dev.azure.com/zimbio/2a49da0e-2ad9-441b-b709-4db513be52f9/_packaging/ai-pypi-artifacts/pypi/simple/
You should consider upgrading via the '/Users/natanmish/Projects/data_validation/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [4]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandera as pa

#### Load the data

In [5]:
home_data = pd.read_csv('https://github.com/NatanMish/data_validation/blob/a77b247b25c6622ce0c8f8cbc505228161c31a3c/data/train.csv?raw=true')

#### Train basic model
We'll start by setting up a training pipeline using Scikit Learn's native class. We only want to select a few basic features for the purpose of this example, so we'll set up a pipeline step class that will select only those features.

In [6]:
class ChooseFeatures(BaseEstimator):
    def __init__(self, features=None):
        self.features = features
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        return X[self.features]

In [7]:
feature_names = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd', 'LotFrontage']

Now we set up the pipeline and fit it to the data.

In [8]:
pipe = Pipeline([
     ('feature_selection', ChooseFeatures(features=feature_names)),
     ('scaler', StandardScaler()),
     ('rf', RandomForestRegressor())
])

In [9]:
X = home_data
y = home_data.SalePrice
pipe.fit(home_data, y)

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Looks like our data has null values and this causes the model to break. Let's take a look at Pandera to see how it can help us with this.

<div>
<img src="https://raw.githubusercontent.com/pandera-dev/pandera/master/docs/source/_static/pandera-banner.png" width="500"/>
</div>

Pandera provides a flexible and expressive API for performing data validation on dataframes to make data processing pipelines more readable and robust. Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical data pipelines or reproducible research settings. We'll take a look at these Pandera features:

1. Check the types and properties of columns in a pd.DataFrame or values in a pd.Series.

2. Perform more complex statistical validation like hypothesis testing.

3. Integrate with existing data analysis/processing pipelines via function decorators.

4. Define schema models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.

5. Synthesize data from schema objects for property-based testing with pandas data structures.

6. Lazily Validate dataframes so that all validation rules are executed before raising an error.

For more information, see [Pandera's documentation](https://pandera.readthedocs.io/en/latest/).

#### 1. DataFrame Schemas - Type Validation

In [10]:
# We'll add one more feature to make it more interesting
feature_names.append('LotConfig')

In [11]:
# Create a basic schema for the home_data DataFrame to check types for just 2 of the feature
basic_types_schema = pa.DataFrameSchema({
    "LotArea": pa.Column(int),
    "LotConfig": pa.Column(str),
    })

In [12]:
# Validate the home_data DataFrame against the basic_schema
# notice that although we only defined two of the features in the dataframe, and Pandera ignored the rest.
basic_types_schema.validate(home_data[feature_names])

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd,LotFrontage,LotConfig
0,8450,2003,856,854,2,3,8,65.0,Inside
1,9600,1976,1262,0,2,3,6,80.0,FR2
2,11250,2001,920,866,2,3,6,68.0,Inside
3,9550,1915,961,756,1,3,7,60.0,Corner
4,14260,2000,1145,1053,2,4,9,84.0,FR2
...,...,...,...,...,...,...,...,...,...
1455,7917,1999,953,694,2,3,7,62.0,Inside
1456,13175,1978,2073,0,2,3,7,85.0,Inside
1457,9042,1941,1188,1152,2,4,9,66.0,Inside
1458,9717,1950,1078,0,1,2,5,68.0,Inside


There is an output from the validation, this means that the data is valid.
There are different ways we can specify the type:
- a string alias, as long as it is recognized by pandas.
- a python type: int, float, double, bool, str
- a numpy data type
- a pandas extension type: it can be an instance (e.g pd.CategoricalDtype([“a”, “b”])) or a class (e.g pandas.CategoricalDtype) if it can be initialized with default values.
- a pandera DataType: it can also be an instance or a class.

In [13]:
# Now let's create a schema that does not fit the data types in home data
bad_types_schema = pa.DataFrameSchema({
    "LotArea": pa.Column(int),
    "LotConfig": pa.Column(float),
})

In [14]:
# The bad schema validation will throw an error
bad_types_schema.validate(home_data[feature_names])

SchemaError: expected series 'LotConfig' to have type float64, got object

#### 2. DataFrame Schemas - Value Ranges Validation

In [None]:
# Pandera also allows validating value ranges for numerical columns
value_range_schema = pa.DataFrameSchema({
    "LotArea": pa.Column(int, pa.Check(lambda s: s <= 1000000), nullable=False),
    "YearBuilt": pa.Column(int, [pa.Check.in_range(1800, 2022)]),
})

In [None]:
# Validate the home_data DataFrame against the value_range_schema
value_range_schema.validate(home_data[feature_names])

#### 3. DataFrame Schemas - Catch Bad Data

What if instead of breaking on error we want to continue processing the dataframe? or we want to skip the bad data? we can use the `failure_cases` attribute of the error message to capture the bad data indices and the `lazy` argument for going over the entire dataframe instead of failing on the first bad row. We can do that by utilizing a try-except block.

In [15]:
# We'll use a small sample of the data to make the example more clear
sample_data = home_data.sample(n=10)

In [16]:
# Create a schema that will fail on the first bad data point
catch_bad_data_schema = pa.DataFrameSchema({
    "LotArea": pa.Column(int, pa.Check(lambda s: s <= 1000000)),
    "YearBuilt": pa.Column(int, pa.Check.in_range(1900,1990)),  # notice that the year built has a restrictive range
})

In [17]:
# Validating the home_data DataFrame against the catch_bad_data_schema will throw an error
catch_bad_data_schema.validate(sample_data[feature_names])

SchemaError: <Schema Column(name=YearBuilt, type=DataType(int64))> failed element-wise validator 0:
<Check in_range: in_range(1900, 1990)>
failure cases:
   index  failure_case
0     56          1999
1    517          1996
2    559          2003
3    832          2003
4    670          2005

Now let's use a try except block to catch the bad data indices. This is a common and valid practice in Python called EAFP - "easier to ask for forgiveness than permission" which might not be as well recieved in other languages.

In [18]:
try:
    catch_bad_data_schema.validate(sample_data[feature_names], lazy=True)
except pa.errors.SchemaErrors as e:
    failure_cases = e.failure_cases

# Failure cases is a dataframe of the bad data only
failure_cases.head()

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
0,Column,YearBuilt,"in_range(1900, 1990)",0,1999,56
1,Column,YearBuilt,"in_range(1900, 1990)",0,1996,517
2,Column,YearBuilt,"in_range(1900, 1990)",0,2003,559
3,Column,YearBuilt,"in_range(1900, 1990)",0,2003,832
4,Column,YearBuilt,"in_range(1900, 1990)",0,2005,670


In [19]:
# We can easily filter out the bad data from the original dataframe using the failure_cases dataframe
filtered_df = sample_data[~sample_data.index.isin(failure_cases["index"])]

In [20]:
# Let's see that the filtered data passes the validation test
catch_bad_data_schema.validate(filtered_df[feature_names])

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd,LotFrontage,LotConfig
422,21750,1954,988,0,1,2,4,100.0,Inside
335,164660,1965,1619,167,2,3,7,,Corner
1242,10625,1974,1173,0,2,3,6,85.0,Inside
308,12342,1940,861,0,1,1,4,,Inside
486,10289,1965,1073,0,1,3,6,79.0,Inside


#### 4. DataFrame Schemas - Validate acceptable categorical values

In [21]:
lot_config_values = ["Inside", "Corner", "CulDSac", "FR3"]

In [22]:
lot_config_values_schema = pa.DataFrameSchema({
    "LotArea": pa.Column(int, pa.Check(lambda s: s <= 1000000)),
    "LotConfig": pa.Column(str, pa.Check.isin(lot_config_values)),
})

In [23]:
# Validating the home_data DataFrame against the lot_config_values_schema will throw an error
lot_config_values_schema.validate(home_data[feature_names])

SchemaError: <Schema Column(name=LotConfig, type=DataType(str))> failed element-wise validator 0:
<Check isin: isin({'CulDSac', 'Inside', 'FR3', 'Corner'})>
failure cases:
    index failure_case
0       1          FR2
1       4          FR2
2      81          FR2
3     140          FR2
4     195          FR2
5     214          FR2
6     223          FR2
7     228          FR2
8     236          FR2
9     266          FR2
10    364          FR2
11    386          FR2
12    421          FR2
13    480          FR2
14    483          FR2
15    537          FR2
16    541          FR2
17    558          FR2
18    574          FR2
19    611          FR2
20    670          FR2
21    687          FR2
22    761          FR2
23    775          FR2
24    805          FR2
25    849          FR2
26    933          FR2
27    941          FR2
28    959          FR2
29    975          FR2
30    994          FR2
31   1018          FR2
32   1057          FR2
33   1117          FR2
34   1158          FR2
35   1164          FR2
36   1178          FR2
37   1193          FR2
38   1232          FR2
39   1237          FR2
40   1259          FR2
41   1362          FR2
42   1369          FR2
43   1436          FR2
44   1437          FR2
45   1444          FR2
46   1450          FR2

Other useful methods for `pa.Check` are:

<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.eq.html">pandera.checks.Check.eq</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.equal_to.html">pandera.checks.Check.equal_to</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.ge.html">pandera.checks.Check.ge</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.greater_than.html">pandera.checks.Check.greater_than</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.greater_than_or_equal_to.html">pandera.checks.Check.greater_than_or_equal_to</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.gt.html">pandera.checks.Check.gt</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.in_range.html">pandera.checks.Check.in_range</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.isin.html">pandera.checks.Check.isin</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.le.html">pandera.checks.Check.le</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.less_than.html">pandera.checks.Check.less_than</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.less_than_or_equal_to.html">pandera.checks.Check.less_than_or_equal_to</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.lt.html">pandera.checks.Check.lt</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.ne.html">pandera.checks.Check.ne</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.not_equal_to.html">pandera.checks.Check.not_equal_to</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.notin.html">pandera.checks.Check.notin</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.str_contains.html">pandera.checks.Check.str_contains</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.str_endswith.html">pandera.checks.Check.str_endswith</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.str_length.html">pandera.checks.Check.str_length</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.str_matches.html">pandera.checks.Check.str_matches</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.str_startswith.html">pandera.checks.Check.str_startswith</a></li>
<li class="toctree-l4"><a class="reference internal" href="methods/pandera.checks.Check.__call__.html">pandera.checks.Check.__call__</a></li>

#### 5. DataFrame Schemas - `Coerce`

`Coerce` allows forcing type onto a specific dataframe column

In [24]:
home_data.LotArea.dtype

dtype('int64')

In [25]:
coerce_schema = pa.DataFrameSchema(
    columns={"LotArea": pa.Column(float)},
    coerce=False,
)

In [26]:
coerce_schema.validate(home_data)

SchemaError: expected series 'LotArea' to have type float64, got int64

In [27]:
# and if we set coerce to True, we can coerce the dataframe to the schema
coerce_schema = pa.DataFrameSchema(
    columns={"LotArea": pa.Column(float)},
    coerce=True,
)

In [28]:
coerce_schema.validate(home_data)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450.0,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600.0,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250.0,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550.0,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260.0,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917.0,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175.0,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042.0,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717.0,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


#### 6. DataFrame Schemas - `Strict`

In [29]:
# Using `strict` we can specify that the dataframe must have the exact columns specified in the schema
strict_schema = pa.DataFrameSchema(
    columns={"LotArea": pa.Column(int), "YearBuilt": pa.Column(int)},
    strict=True,
)

In [30]:
# Another useful feature is setting `strict` to 'filter' which will filter out any columns that are not in the schema
strict_filter_schema = pa.DataFrameSchema(
    columns={"LotArea": pa.Column(int), "YearBuilt": pa.Column(int)},
    strict="filter",
)
filtered_df = strict_filter_schema.validate(home_data)
filtered_df.head()

Unnamed: 0,LotArea,YearBuilt
0,8450,2003
1,9600,1976
2,11250,2001
3,9550,1915
4,14260,2000


### Exercise 1 - DataFrame Schemas

Create a pa.DataFrameSchema object for the `home_data` DataFrame. Not all the checks requested were shown above, for some of them you'll need to have a quick search in the Pandera documentation. It should have the following columns and rules:
1. Id is a required and unique column of an integer type and cannot be null.
2. MSZoning is a non-required column of a string type and can be null. If not null it can only accept these values - 'RL', 'RM', 'C (all)', 'RH' and 'FV'.
3. OverallQual is a required column of an integer type, cannot be null and must be in the range 1-10.
4. BsmtCond is a non-required column of a string type and can be null. If not null it can only accept a string of a length of 2.

Bonus:

5. Add the 1stFlrSF and 2ndFlrSF columns to the schema and validate that on average 1stFlrSF>=2ndFlrSF.

Create the schema such that it filters out any other columns that are not in the schema.


In [31]:
exercise_schema = pa.DataFrameSchema(
    columns={
        "Id": <YOUR ANSWER HERE>,
        "MSZoning": <YOUR ANSWER HERE>,
        "OverallQual": <YOUR ANSWER HERE>,
        "BsmtCond": <YOUR ANSWER HERE>,
        "1stFlrSF": <YOUR ANSWER HERE>,
        "2ndFlrSF": <YOUR ANSWER HERE>,
    },
    strict=<YOUR ANSWER HERE>,
    checks=<YOUR ANSWER HERE>,
)

SyntaxError: invalid syntax (2647979350.py, line 3)

In [32]:
exercise_schema.validate(home_data)

NameError: name 'exercise_schema' is not defined

*Exercise solutions can be found in the exercise solutions file in the current directory.*

### 7. Pandera Decorators

Pandera offers decorators which allow a seamless integration of Pandera validations with our code. The available decorators are:
- @check_input
- @check_output
- @check_io
- @check_types

We will use a different way for defining the schemas in the next example, but the same principles apply. Here we will construct a class based Pandera model which we can use to validate inputs and outputs to our data in Pydantic style syntax.

In [33]:
from pandera.typing import Series

# Define a class based Pandera model for the input data to the feature engineering step
class FeaturesSchemaPreEngineering(pa.SchemaModel):
    LotArea: Series[int] = pa.Field(nullable=False, ge=0)
    YearBuilt: Series[int] = pa.Field(nullable=False, ge=1700)
    FirstFlrSF: Series[int] = pa.Field(nullable=False, ge=0, alias="1stFlrSF") # alias is used to give the column a different name in the schema because the column name starts with a number
    SecondFlrSF: Series[int] = pa.Field(nullable=False, ge=0, alias="2ndFlrSF")
    FullBath: Series[int] = pa.Field(nullable=False, ge=0)
    BedroomAbvGr: Series[int] = pa.Field(nullable=False, ge=0)
    TotRmsAbvGrd: Series[int] = pa.Field(nullable=False, ge=0)
    LotFrontage: Series[float] = pa.Field(nullable=True, ge=0)
    LotConfig: Series[str] = pa.Field(nullable=True, isin=["Inside", "Corner", "FR2", "FR3", "CulDSac"])
    class Config:
        strict=True

# Define a class based Pandera model for the output data to the feature engineering step, notice how we inherit the FeaturesSchemaPreEngineering class and extend it with the output data schema.
class FeaturesSchemaPostEngineering(FeaturesSchemaPreEngineering):
    HouseAge: Series[int] = pa.Field(nullable=False, ge=0)
    AllFloorsSF: Series[int] = pa.Field(nullable=False, ge=0)
    NonBedRmAbvGrd: Series[int] = pa.Field(nullable=False, ge=0)

    class Config:
        strict=True

In [34]:
from pandera import check_types
from pandera.typing import DataFrame as DataFramePa
@check_types
def feature_engineering(df: DataFramePa[FeaturesSchemaPreEngineering]) -> DataFramePa[FeaturesSchemaPostEngineering]:
    df = df.copy()
    df["HouseAge"] = 2022 - df["YearBuilt"]
    df["AllFloorsSF"] = df["1stFlrSF"] + df["2ndFlrSF"]
    df["NonBedRmAbvGrd"] = df["TotRmsAbvGrd"] - df["BedroomAbvGr"]
    return df

In [35]:
# This run should complete without error
feature_engineering(home_data[feature_names])

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd,LotFrontage,LotConfig,HouseAge,AllFloorsSF,NonBedRmAbvGrd
0,8450,2003,856,854,2,3,8,65.0,Inside,16,1710,5
1,9600,1976,1262,0,2,3,6,80.0,FR2,43,1262,3
2,11250,2001,920,866,2,3,6,68.0,Inside,18,1786,3
3,9550,1915,961,756,1,3,7,60.0,Corner,104,1717,4
4,14260,2000,1145,1053,2,4,9,84.0,FR2,19,2198,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1455,7917,1999,953,694,2,3,7,62.0,Inside,20,1647,4
1456,13175,1978,2073,0,2,3,7,85.0,Inside,41,2073,4
1457,9042,1941,1188,1152,2,4,9,66.0,Inside,78,2340,5
1458,9717,1950,1078,0,1,2,5,68.0,Inside,69,1078,3


In [36]:
# If we don't filter out any columns, we should get an error, since it is incompatible with the schema
feature_engineering(home_data)

SchemaError: error in check_types decorator of function 'feature_engineering': column 'Id' not in DataFrameSchema {'LotArea': <Schema Column(name=LotArea, type=DataType(int64))>, 'YearBuilt': <Schema Column(name=YearBuilt, type=DataType(int64))>, '1stFlrSF': <Schema Column(name=1stFlrSF, type=DataType(int64))>, '2ndFlrSF': <Schema Column(name=2ndFlrSF, type=DataType(int64))>, 'FullBath': <Schema Column(name=FullBath, type=DataType(int64))>, 'BedroomAbvGr': <Schema Column(name=BedroomAbvGr, type=DataType(int64))>, 'TotRmsAbvGrd': <Schema Column(name=TotRmsAbvGrd, type=DataType(int64))>, 'LotFrontage': <Schema Column(name=LotFrontage, type=DataType(float64))>, 'LotConfig': <Schema Column(name=LotConfig, type=DataType(str))>}

### 8. Data Synthesis

Pandera offers a simple way to generate synthetic data. We can use the `example` method to generate a DataFrame with a given schema.

In [39]:
FeaturesSchemaPreEngineering.example(size=5)

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd,LotFrontage,LotConfig
0,63227,6277358704709369663,8485997686888187618,86,166,55732,255,,
1,83,66253,1,174,7672,61,12937,1.1754939999999999e-38,FR3
2,94,21493,53,6339,329,2243079942065244720,22,,CulDSac
3,36,18356,127,16192,101,50882,195,3.4028230000000003e+38,Inside
4,35129,46389,49948,1827837779226171415,60171,57510,4073655496092455121,,Corner


Notice how some of the columns have 'crazy' values, this is because the random data generating process is using the checks we have defined in the schema for detecting the acceptable ranges possible.

We can use the hypothesis library to generate data for our schema and then use it in a unit test:

In [44]:
import hypothesis
@hypothesis.given(FeaturesSchemaPreEngineering.strategy(size=5))
def test_processing_fn(dataframe):
    feature_engineering(dataframe)

### 9. Schema Inference

Pandera can infer schemas from data. This is useful when you have a large dataset and you don't want to define a schema manually.

In [None]:
schema = pa.infer_schema(home_data)
print(schema)

### Exercise 2 - Incorporating validation in a training pipeline

Let's use what we have learned to implement data validation in a training pipeline.

In [65]:
# Choose which features to use for training. You can use these, or add your own.
feature_names = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd', 'LotFrontage']

In [66]:
# Define the schemas and checks
class FeaturesSchemaPreEngineering(pa.SchemaModel):
    LotArea: Series[int] = pa.Field(nullable=False, ge=0)
    YearBuilt: Series[int] = pa.Field(nullable=False, ge=1700)
    FirstFlrSF: Series[int] = pa.Field(nullable=False, ge=0, alias="1stFlrSF") # alias is used to give the column a different name in the schema because the column name starts with a number
    SecondFlrSF: Series[int] = pa.Field(nullable=False, ge=0, alias="2ndFlrSF")
    FullBath: Series[int] = pa.Field(nullable=False, ge=0)
    BedroomAbvGr: Series[int] = pa.Field(nullable=False, ge=0)
    TotRmsAbvGrd: Series[int] = pa.Field(nullable=False, ge=0)
    LotFrontage: Series[float] = pa.Field(nullable=False, ge=0)
    class Config:
        strict=True

# Define a class based Pandera model for the output data to the feature engineering step, notice how we inherit the FeaturesSchemaPreEngineering class and extend it with the output data schema.
class FeaturesSchemaPostEngineering(FeaturesSchemaPreEngineering):
    HouseAge: Series[int] = pa.Field(nullable=False, ge=0)
    AllFloorsSF: Series[int] = pa.Field(nullable=False, ge=0)
    NonBedRmAbvGrd: Series[int] = pa.Field(nullable=False, ge=0)

    class Config:
        strict=True

Define a class to perform feature engineering on the data with these requirements:


In [74]:
from sklearn.preprocessing import FunctionTransformer

In [75]:
@check_types
def feat_eng_step_1(df: DataFramePa[FeaturesSchemaPreEngineering]) -> DataFramePa[FeaturesSchemaPostEngineering]:
    df = df.copy()
    df["HouseAge"] = 2022 - df["YearBuilt"]
    df["AllFloorsSF"] = df["1stFlrSF"] + df["2ndFlrSF"]
    df["NonBedRmAbvGrd"] = df["TotRmsAbvGrd"] - df["BedroomAbvGr"]
    return df

def feat_eng_all_steps(X, y):
    try:
        return feat_eng_step_1(X), y
    except pa.errors.SchemaError as e:
        isolated_failure_cases = e.failure_cases
        # Send isolated_failure_cases to a separate function to handle them
        return X[~X.index.isin(isolated_failure_cases["index"])], y[~y.index.isin(isolated_failure_cases["index"])]

In [76]:
feat_eng_with_validation = FunctionTransformer(func=feat_eng_all_steps, validate=False)

In [78]:
pipe = Pipeline([
    ('feature_selection', ChooseFeatures(features=feature_names)),
    ('feature_engineering', feat_eng_with_validation),
    ('scaler', StandardScaler()),
    ('rf', RandomForestRegressor())
])

In [79]:
# Train the model
X = home_data
y = home_data.SalePrice
pipe.fit(home_data, y)

AttributeError: 'NoneType' object has no attribute 'index'