Following [this](https://pandera.readthedocs.io/en/stable/schema_models.html#)

# Schema Models

*new in v0.5.0*

`pandera` provides a class-based API that’s heavily inspired by `pydantic`.

In contrast to the [object-based API](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#dataframeschemas),
you can define schema models in much the same way you’d define `pydantic` models.

Schema Models are annotated with the [`pandera.typing`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.typing.html#module-pandera.typing)
module using the standard *typing syntax*.

> Models can be explicitly converted to a `DataFrameSchema` or used to validate a `DataFrame` directly.

> ⚠️ **NOTE**
>
> Due to current limitations in the `pandas` library (see discussion [here](https://github.com/pandera-dev/pandera/issues/253#issuecomment-665338337)),
> `pandera` annotations are only used for **run-time validation** and *cannot be leveraged by static-type checkers like `mypy`*.
>
> See the discussion [here](https://github.com/pandera-dev/pandera/issues/253#issuecomment-665338337) for more details.

## Basic Usage

In [1]:
import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series  # NOTE.


class InputSchema(pa.SchemaModel):  # NOTE.
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)


class OutputSchema(InputSchema):  # NOTE.
    revenue: Series[float]


@pa.check_types  # NOTE.
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    return df.assign(revenue=100.0)  # pyright: ignore


df = pd.DataFrame(
    {
        "year": ["2001", "2002", "2003"],
        "month": ["3", "6", "12"],
        "day": ["200", "156", "365"],
    }
)

transform(df)  # pyright: ignore

invalid_df = pd.DataFrame(
    {
        "year": ["2001", "2002", "1999"],
        "month": ["3", "6", "12"],
        "day": ["200", "156", "365"],
    }
)

try:
    transform(invalid_df)  # pyright: ignore
except Exception as e:
    print(e)

error in check_types decorator of function 'transform': <Schema Column(name=year, type=DataType(int64))> failed element-wise validator 0:
<Check greater_than: greater_than(2000)>
failure cases:
   index  failure_case
0      2          1999


> NOTE: `pyright` has issues in terms of compatibility with this.
>
> It doesn't like the equivalence of `pd.DataFrame` with `pandera.typing.DataFrame` 

As you can see in the example above, you can define a schema by sub-classing `SchemaModel` and defining
column/index *fields* as class attributes.

The `check_types()` decorator is required to perform validation of the dataframe **at run-time**.

> Note that `Field` s apply to both `Column` and `Index` objects, exposing the built-in `Check` s via key-word arguments.

*(New in 0.6.2)* When you access a class attribute defined on the schema,
it will return the name of the column used in the validated `pd.DataFrame`.
In the example above, this will simply be the string `"year"`.

In [2]:
print(f"Column name for 'year' is {InputSchema.year}\n")
print(df.loc[:, [InputSchema.year, "day"]])

Column name for 'year' is year

   year  day
0  2001  200
1  2002  156
2  2003  365


## Validate on Initialization

*new in 0.8.0*

Pandera provides an interface for validating dataframes on initialization.

This API uses the `pandera.typing.pandas.DataFrame` *generic type* to validated against the `SchemaModel` type variable on initialization:

In [3]:
import pandas as pd
import pandera as pa

from pandera.typing import DataFrame, Series


class Schema(pa.SchemaModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})


# NOTE the below initialization DIRECTLY FROM pandera.typing.DataFrame!
df = DataFrame[Schema](
    {
        "state": ["NY", "FL", "GA", "CA"],
        "city": ["New York", "Miami", "Atlanta", "San Francisco"],
        "price": [8, 12, 10, 16],
    }
)

print(df)

  state           city  price
0    NY       New York      8
1    FL          Miami     12
2    GA        Atlanta     10
3    CA  San Francisco     16


Refer to [Supported DataFrame Libraries](https://pandera.readthedocs.io/en/stable/supported_libraries.html#supported-dataframe-libraries)
to see how this syntax applies to other supported dataframe types.

## Converting to `DataFrameSchema`

You can easily convert a `SchemaModel` class into a `DataFrameSchema`:

In [6]:
print(InputSchema.to_schema())

"""
<Schema DataFrameSchema(
    columns={
        'year': <Schema Column(name=year, type=DataType(int64))>
        'month': <Schema Column(name=month, type=DataType(int64))>
        'day': <Schema Column(name=day, type=DataType(int64))>
    },
    checks=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False
    name=InputSchema,
    ordered=False,
    unique_column_names=False
)>
""";

<Schema DataFrameSchema(
    columns={
        'year': <Schema Column(name=year, type=DataType(int64))>
        'month': <Schema Column(name=month, type=DataType(int64))>
        'day': <Schema Column(name=day, type=DataType(int64))>
    },
    checks=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False
    name=InputSchema,
    ordered=False,
    unique_column_names=False
)>


You can also use the `validate()` method to validate dataframes:

In [8]:
print(Schema.validate(df))

  state           city  price
0    NY       New York      8
1    FL          Miami     12
2    GA        Atlanta     10
3    CA  San Francisco     16


Or you can use the `SchemaModel()` class *directly* to validate dataframes, which is *syntactic sugar* that simply delegates to the `validate()` method.

In [9]:
print(Schema(df))

  state           city  price
0    NY       New York      8
1    FL          Miami     12
2    GA        Atlanta     10
3    CA  San Francisco     16


## Excluded attributes

Class variables which begin with an *underscore* (`_`) will be automatically **excluded** from the model.

[`Config`](https://pandera.readthedocs.io/en/stable/schema_models.html#schema-model-config) is also a reserved name.

However, [`aliases`](https://pandera.readthedocs.io/en/stable/schema_models.html#schema-model-alias)
can be used to circumvent these limitations.

## Supported dtypes

Any `dtype`s supported by `pandera` can be used as type parameters for `Series` and `Index`.

⚠️ There are, however, a couple of gotchas.

### Dtype aliases

In [10]:
import pandera as pa
from pandera.typing import Series, String  # pyright: ignore

# NOTE: The above String is a "Dtype alias"


class Schema(pa.SchemaModel):
    a: Series[String]  # NOTE.

### Type Vs instance

You must give a **type**, not an **instance**.

✔ Good:

In [11]:
import pandas as pd


class Schema(pa.SchemaModel):
    a: Series[pd.StringDtype]

✘ Bad:

In [14]:
try:

    class Schema(pa.SchemaModel):
        a: Series[pd.StringDtype()]  # NOTE!

except Exception as e:
    display(type(e))
    print(e)

TypeError

Parameters to generic types must be types. Got string[python].


### Parametrized dtypes

Pandas supports a couple of **parametrized dtypes**.

As of pandas 1.2.0:


Kind of Data | Data Type | Parameters
--- | --- | ---
tz-aware datetime| `DatetimeTZDtype` | `unit`, `tz`
Categorical | `CategoricalDtype` | `categories`, `ordered`
period | `PeriodDtype` | `freq` 
sparse | `SparseDtype` | `dtype`, `fill_value`
intervals | `IntervalDtype` | `subtype`

#### Annotated

Parameters can be given via `typing.Annotated`.

It requires `python >= 3.9` or `typing_extensions`, *which is already a requirement of Pandera*.

> Unfortunately `typing.Annotated` has not been backported to python 3.6.

✔ Good:

In [15]:
try:
    from typing import Annotated  # pyright: ignore # python 3.9+
except ImportError:
    from typing_extensions import Annotated


class Schema(pa.SchemaModel):
    col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]]  # NOTE!

⚠️ Furthermore, you must pass all parameters **in the order defined in the dtype’s constructor**
(see [table](https://pandera.readthedocs.io/en/stable/schema_models.html#parameterized-dtypes)).

✘ Bad:

In [17]:
class Schema(pa.SchemaModel):
    col: Series[Annotated[pd.DatetimeTZDtype, "utc"]]


try:
    Schema.to_schema()
except Exception as e:
    display(type(e))
    print(e)

TypeError

Annotation 'DatetimeTZDtype' requires all positional arguments ['unit', 'tz'].


#### Field

✔ Good:

In [18]:
class SchemaFieldDatetimeTZDtype(pa.SchemaModel):
    col: Series[pd.DatetimeTZDtype] = pa.Field(dtype_kwargs={"unit": "ns", "tz": "EST"})

You **cannot** use both `typing.Annotated` and `dtype_kwargs`.

✘ Bad:

In [19]:
class SchemaFieldDatetimeTZDtype(pa.SchemaModel):
    col: Series[Annotated[pd.DatetimeTZDtype, "ns", "est"]] = pa.Field(dtype_kwargs={"unit": "ns", "tz": "EST"})


try:
    Schema.to_schema()
except Exception as e:
    display(type(e))
    print(e)

TypeError

Annotation 'DatetimeTZDtype' requires all positional arguments ['unit', 'tz'].


## Required Columns

By default all columns specified in the schema are **required**,
meaning that if a column is missing in the input `DataFrame` an exception will be thrown.

If you want to make a column optional, annotate it with `typing.Optional`.

In [20]:
from typing import Optional

import pandas as pd
import pandera as pa
from pandera.typing import Series


class Schema(pa.SchemaModel):
    a: Series[str]
    b: Optional[Series[int]]  # NOTE.


df = pd.DataFrame({"a": ["2001", "2002", "2003"]})
Schema.validate(df)

Unnamed: 0,a
0,2001
1,2002
2,2003


## Schema Inheritance

You can also use inheritance to build schemas on top of a base schema.

In [21]:
class BaseSchema(pa.SchemaModel):
    year: Series[str]


class FinalSchema(BaseSchema):  # NOTE inheritance.
    year: Series[int] = pa.Field(ge=2000, coerce=True)  # overwrite the base type
    passengers: Series[int]
    idx: Index[int] = pa.Field(ge=0)


df = pd.DataFrame(
    {
        "year": ["2000", "2001", "2002"],
    }
)


@pa.check_types
def transform(df: DataFrame[BaseSchema]) -> DataFrame[FinalSchema]:
    return (
        df.assign(passengers=[61000, 50000, 45000]).set_index(pd.Index([1, 2, 3])).astype({"year": int})
    )  # pyright: ignore


print(transform(df))  # pyright: ignore

   year  passengers
1  2000       61000
2  2001       50000
3  2002       45000


## Config

Schema-wide options can be controlled via the `Config` class on the `SchemaModel` subclass.

The full set of options can be found in the [`BaseConfig`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.model.BaseConfig.html#pandera.model.BaseConfig) class.

In [22]:
class Schema(pa.SchemaModel):

    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

    # NOTE:
    class Config:
        name = "BaseSchema"
        strict = True
        coerce = True
        foo = "bar"  # Interpreted as dataframe check

It is not required for the `Config` to subclass `BaseConfig` but it must be named "`Config`".

See [Registered Custom Checks with the Class-based API](https://pandera.readthedocs.io/en/stable/extensions.html#class-based-api-dataframe-checks)
for details on using registered dataframe checks.

## MultiIndex

The `MultiIndex` capabilities are also supported with the class-based API:

In [24]:
import pandera as pa
from pandera.typing import Index, Series


class MultiIndexSchema(pa.SchemaModel):

    year: Index[int] = pa.Field(gt=2000, coerce=True)
    month: Index[int] = pa.Field(ge=1, le=12, coerce=True)
    passengers: Series[int]

    class Config:
        # provide multi index options in the config
        multiindex_name = "time"
        multiindex_strict = True
        multiindex_coerce = True


index = MultiIndexSchema.to_schema().index
print(index)

"""
<Schema MultiIndex(
    indexes=[
        <Schema Index(name=year, type=DataType(int64))>
        <Schema Index(name=month, type=DataType(int64))>
    ]
    coerce=True,
    strict=True,
    name=time,
    ordered=True
)>
""";

<Schema MultiIndex(
    indexes=[
        <Schema Index(name=year, type=DataType(int64))>
        <Schema Index(name=month, type=DataType(int64))>
    ]
    coerce=True,
    strict=True,
    name=time,
    ordered=True
)>


In [26]:
from pprint import pprint

pprint({name: col.checks for name, col in index.columns.items()})

"""
{'month': [<Check greater_than_or_equal_to: greater_than_or_equal_to(1)>,
        <Check less_than_or_equal_to: less_than_or_equal_to(12)>],
'year': [<Check greater_than: greater_than(2000)>]}
""";

{'month': [<Check greater_than_or_equal_to: greater_than_or_equal_to(1)>,
           <Check less_than_or_equal_to: less_than_or_equal_to(12)>],
 'year': [<Check greater_than: greater_than(2000)>]}


Multiple Index annotations are **automatically converted** into a [`MultiIndex`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.schema_components.MultiIndex.html#pandera.schema_components.MultiIndex).

`MultiIndex` options are given in the [`Config`](https://pandera.readthedocs.io/en/stable/schema_models.html#schema-model-config).

## Index Name

Use `check_name` to validate the index name of a single-index dataframe:

In [27]:
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series


class Schema(pa.SchemaModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    passengers: Series[int]
    idx: Index[int] = pa.Field(ge=0, check_name=True)  # NOTE: check_name


df = pd.DataFrame(
    {
        "year": [2001, 2002, 2003],
        "passengers": [61000, 50000, 45000],
    }
)

try:
    Schema.validate(df)
except Exception as e:
    display(type(e))
    print(e)

pandera.errors.SchemaError

Expected <class 'pandera.schema_components.Index'> to have name 'idx', found 'None'


`check_name` default value of `None` translates to `True` for columns and multi-index.

## Custom Checks

Unlike the object-based API, custom checks can be specified *as class methods*.

### Column/Index checks

In [28]:
import pandera as pa
from pandera.typing import Index, Series


class CustomCheckSchema(pa.SchemaModel):

    a: Series[int] = pa.Field(gt=0, coerce=True)
    abc: Series[int]
    idx: Index[str]

    # NOTE all these custom check methods below.
    @pa.check("a", name="foobar")
    def custom_check(cls, a: Series[int]) -> Series[bool]:
        return a < 100  # pyright: ignore

    @pa.check("^a", regex=True, name="foobar")
    def custom_check_regex(cls, a: Series[int]) -> Series[bool]:
        return a > 0  # pyright: ignore

    @pa.check("idx")
    def check_idx(cls, idx: Index[int]) -> Series[bool]:
        return idx.str.contains("dog")  # pyright: ignore

ℹ️ **NOTE**

* You can supply the key-word arguments of the [`Check`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.checks.Check.html#pandera.checks.Check)
class initializer to get the flexibility of [groupby checks](https://pandera.readthedocs.io/en/stable/checks.html#column-check-groups)
* Similarly to pydantic, [`classmethod()`](https://docs.python.org/3/library/functions.html#classmethod) decorator is added behind the scenes if omitted.
* You still may need to add the `@classmethod` decorator after the `check()` decorator if your static-type checker or linter complains.
* Since `checks` are class methods, the first argument value they receive is a SchemaModel subclass, not an instance of a model.

In [30]:
from typing import Dict


class GroupbyCheckSchema(pa.SchemaModel):

    value: Series[int] = pa.Field(gt=0, coerce=True)
    group: Series[str] = pa.Field(isin=["A", "B"])

    @pa.check("value", groupby="group", regex=True, name="check_means")
    def check_groupby(cls, grouped_value: Dict[str, Series[int]]) -> bool:
        return grouped_value["A"].mean() < grouped_value["B"].mean()


df = pd.DataFrame(
    {
        "value": [100, 110, 120, 10, 11, 12],
        "group": list("AAABBB"),
    }
)

try:
    print(GroupbyCheckSchema.validate(df))
except Exception as e:
    display(type(e))
    print(e)

pandera.errors.SchemaError

<Schema Column(name=value, type=DataType(int64))> failed series or dataframe validator 1:
<Check check_means>


### DataFrame Checks

You can also define dataframe-level checks, similar to the object-based API, using the `dataframe_check()` decorator:

In [31]:
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series


class DataFrameCheckSchema(pa.SchemaModel):

    col1: Series[int] = pa.Field(gt=0, coerce=True)
    col2: Series[float] = pa.Field(gt=0, coerce=True)
    col3: Series[float] = pa.Field(lt=0, coerce=True)

    @pa.dataframe_check  # NOTE.
    def product_is_negative(cls, df: pd.DataFrame) -> Series[bool]:
        return df["col1"] * df["col2"] * df["col3"] < 0  # pyright: ignore


df = pd.DataFrame(
    {
        "col1": [1, 2, 3],
        "col2": [5, 6, 7],
        "col3": [-1, -2, -3],
    }
)

DataFrameCheckSchema.validate(df)

Unnamed: 0,col1,col2,col3
0,1,5.0,-1.0
1,2,6.0,-2.0
2,3,7.0,-3.0


### Inheritance

The custom checks *are inherited* and therefore can be overwritten by the subclass.


In [32]:
import pandas as pd
import pandera as pa
from pandera.typing import Index, Series


class Parent(pa.SchemaModel):

    a: Series[int] = pa.Field(coerce=True)

    @pa.check("a", name="foobar")
    def check_a(cls, a: Series[int]) -> Series[bool]:
        return a < 100  # pyright: ignore


class Child(Parent):

    a: Series[int] = pa.Field(coerce=False)

    @pa.check("a", name="foobar")
    def check_a(cls, a: Series[int]) -> Series[bool]:
        return a > 100  # pyright: ignore


is_a_coerce = Child.to_schema().columns["a"].coerce
print(f"coerce: {is_a_coerce}")

coerce: False


In [33]:
df = pd.DataFrame({"a": [1, 2, 3]})

try:
    Child.validate(df)
except Exception as e:
    display(type(e))
    print(e)

pandera.errors.SchemaError

<Schema Column(name=a, type=DataType(int64))> failed element-wise validator 0:
<Check foobar>
failure cases:
   index  failure_case
0      0             1
1      1             2
2      2             3


## Aliases

`SchemaModel` supports columns which are *not valid python variable names* via the argument `alias` of [`Field`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.model_components.Field.html#pandera.model_components.Field).

Checks must reference the aliased names.

In [34]:
import pandera as pa
import pandas as pd


class Schema(pa.SchemaModel):
    col_2020: pa.typing.Series[int] = pa.Field(alias=2020)
    idx: pa.typing.Index[int] = pa.Field(alias="_idx", check_name=True)  # NOTE!

    @pa.check(2020)
    def int_column_lt_100(cls, series):
        return series < 100


df = pd.DataFrame({2020: [99]}, index=[0])
df.index.name = "_idx"  # NOTE!

print(Schema.validate(df))

      2020
_idx      
0       99


*(New in 0.6.2)* The alias is respected when using the class attribute to get the underlying `pd.DataFrame` column name or index level name.

In [36]:
print(Schema.col_2020)

# Will show the alias: 2020

2020


Very similar to the example above, you can also use the variable name directly within the class scope, and it will respect the alias.

ℹ️ Note. To access a variable from the class scope, you need to make it a class attribute, and therefore assign it a default `Field`.

In [37]:
import pandera as pa
import pandas as pd


class Schema(pa.SchemaModel):
    a: pa.typing.Series[int] = pa.Field()
    col_2020: pa.typing.Series[int] = pa.Field(alias=2020)

    @pa.check(col_2020)
    def int_column_lt_100(cls, series):
        return series < 100

    @pa.check(a)
    def int_column_gt_100(cls, series):
        return series > 100


df = pd.DataFrame({2020: [99], "a": [101]})
print(Schema.validate(df))

   2020    a
0    99  101
