Following [this](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html)

# DataFrame Schemas

The `DataFrameSchema` class enables the specification of a schema that verifies the columns and index of a pandas `DataFrame` object.

The `DataFrameSchema` object consists of
* `Columns` and
* an `Index`.

> You can refer to [Schema Models](https://pandera.readthedocs.io/en/stable/schema_models.html#schema-models) to see
how to define dataframe schemas using the *alternative* pydantic/dataclass-style syntax.

In [1]:
import pandera as pa

from pandera import Column, DataFrameSchema, Check, Index

schema = DataFrameSchema(
    {
        "column1": Column(int),
        "column2": Column(float, Check(lambda s: s < -1.2)),
        # you can provide a list of validators
        "column3": Column(
            str,
            [Check(lambda s: s.str.startswith("value")), Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)],
        ),
    },
    index=Index(int),
    strict=True,
    coerce=True,
)

## Column Validation

A `Column` must specify the properties of a column in a dataframe object.

It can be *optionally* verified for
* its data type,
* null values
* or duplicate values.

Also:
* The column can be *coerced* into the specified type,
* and the `required` parameter allows control over whether or not the column is allowed to be missing.

Similarly to pandas, the data type can be specified as:
* a string alias, as long as it is recognized by pandas.
* a python type: `int`, `float`, `double`, `bool`, `str`
* a `numpy` data type
* a [pandas extension type](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes):
it can be an instance (e.g `pd.CategoricalDtype([“a”, “b”])`) or a class (e.g `pandas.CategoricalDtype`) if it can be
initialized with default values.
* a pandera [`DataType`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.dtypes.DataType.html#pandera.dtypes.DataType):
it can also be an instance or a class.

[Column checks](https://pandera.readthedocs.io/en/stable/checks.html#checks) allow for the `DataFrame`'s values to be
checked against a *user-provided function*. `Check` objects also support [grouping](https://pandera.readthedocs.io/en/stable/checks.html#grouping)
by a different column so that the user can make assertions about subsets of the column of interest.

Column Hypotheses enable you to perform statistical hypothesis tests on a DataFrame in either wide or tidy format.
See [Hypothesis Testing](https://pandera.readthedocs.io/en/stable/hypothesis.html#hypothesis) for more details.

### Null Values in Columns

By default, `SeriesSchema`/`Column` objects assume that values are ***not* nullable**.

In order to accept null values, you need to *explicitly specify `nullable=True`*, or else you’ll get an error.

In [2]:
import numpy as np
import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame({"column1": [5, 1, np.nan]})

non_null_schema = DataFrameSchema({"column1": Column(float, Check(lambda x: x > 0))})

try:
    non_null_schema.validate(df)
except pa.errors.SchemaError as e:
    print(e)

non-nullable series 'column1' contains null values:
2   NaN
Name: column1, dtype: float64


In [3]:
null_schema = DataFrameSchema(
    {"column1": Column(float, Check(lambda x: x > 0), nullable=True)}  # NOTE: `nullable=True`
)

print(null_schema.validate(df))  # Good now.

   column1
0      5.0
1      1.0
2      NaN


### Coercing Types on Columns

If you specify `Column(dtype, ..., coerce=True)` as part of the `DataFrameSchema` definition,
calling `schema.validate` will **first coerce the column into the specified dtype** before applying validation checks.


In [4]:
import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

df = pd.DataFrame({"column1": [1, 2, 3]})
schema = DataFrameSchema({"column1": Column(str, coerce=True)})  # NOTE.

validated_df = schema.validate(df)

validated_df.info()

assert isinstance(validated_df.column1.iloc[0], str)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   column1  3 non-null      object
dtypes: object(1)
memory usage: 152.0+ bytes


> Note the special case of *integers* columns not supporting `nan` values.
>
>In this case, `schema.validate` will complain if `coerce == True` **and** *null values are allowed* in the column.

In [5]:
df = pd.DataFrame({"column1": [1.0, 2.0, 3, np.nan]})  # NOTE: Integer values with a NaN.

schema = DataFrameSchema(
    {"column1": Column(int, coerce=True, nullable=True)}  # NOTE: Now try to coerce this and allow null...
)

try:
    validated_df = schema.validate(df)
except Exception as e:
    print(e)

Error while coercing 'column1' to type int64: Could not coerce <class 'pandas.core.series.Series'> data_container into type int64:
   index  failure_case
0      3           NaN


> The best way to handle this case is to simply specify the column as a `Float` or `Object`.

🤔 Note the above... The weird int/nan case is still tricky.

In [6]:
schema_object = DataFrameSchema({"column1": Column(object, coerce=True, nullable=True)})  # Treat as object.
schema_float = DataFrameSchema({"column1": Column(float, coerce=True, nullable=True)})  # Treat as float.

# NOTE: 🤔 But again, then they're not strictly integer, are they?!

print(schema_object.validate(df).dtypes)
print(schema_float.validate(df).dtypes)

column1    object
dtype: object
column1    float64
dtype: object


If you want to coerce *all of the columns specified in the `DataFrameSchema`*,
you can specify the `coerce` argument with `DataFrameSchema(..., coerce=True)`.

### Required Columns

By default all columns specified in the schema are required, meaning that if a column is missing in the input `DataFrame`
an exception will be thrown.

If you want to make a column *optional*, specify `required=False` in the column constructor:

In [7]:
import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

df = pd.DataFrame({"column2": ["hello", "pandera"]})

schema = DataFrameSchema({"column1": Column(int, required=False), "column2": Column(str)})  # NOTE>

validated_df = schema.validate(df)
display(validated_df)

Unnamed: 0,column2
0,hello
1,pandera


### Ordered Columns

**I'm guessing that's a future feature?**

### Stand-alone Column Validation

In addition to being used in the context of a `DataFrameSchema`, `Column` objects
**can also be used to validate columns in a dataframe on its own**:

In [8]:
import pandas as pd
import pandera as pa

df = pd.DataFrame(
    {
        "column1": [1, 2, 3],
        "column2": ["a", "b", "c"],
    }
)

column1_schema = pa.Column(int, name="column1")  # NOTE.
column2_schema = pa.Column(str, name="column2")  # NOTE.

# pass the dataframe as an argument to the Column object callable
df = column1_schema(df)  # NOTE.
validated_df = column2_schema(df)  # NOTE.

# or explicitly use the validate method
df = column1_schema.validate(df)  # pyright: ignore  # NOTE.
validated_df = column2_schema.validate(df)  # NOTE.

# use the DataFrame.pipe method to validate two columns
validated_df = df.pipe(column1_schema).pipe(column2_schema)  # NOTE.

For multi-column use cases, the `DataFrameSchema` *is still recommended*,
but if you have one or a small number of columns to verify, using Column objects by themselves is appropriate.

### Column Regex Pattern Matching

In the case that your dataframe has multiple columns that share common statistical properties,
you might want to *specify a regex pattern* that matches a set of meaningfully grouped columns that have `str` names.


In [9]:
import numpy as np
import pandas as pd
import pandera as pa

categories = ["A", "B", "C"]

np.random.seed(100)

dataframe = pd.DataFrame(
    {
        "cat_var_1": np.random.choice(categories, size=100),
        "cat_var_2": np.random.choice(categories, size=100),
        "num_var_1": np.random.uniform(0, 10, size=100),
        "num_var_2": np.random.uniform(20, 30, size=100),
    }
)

schema = pa.DataFrameSchema(
    {
        # NOTE:
        "num_var_.+": pa.Column(
            float,
            checks=pa.Check.greater_than_or_equal_to(0),
            regex=True,
        ),
        # NOTE:
        "cat_var_.+": pa.Column(
            pa.Category,
            checks=pa.Check.isin(categories),
            coerce=True,
            regex=True,
        ),
    }
)

display(schema.validate(dataframe).head())

Unnamed: 0,cat_var_1,cat_var_2,num_var_1,num_var_2
0,A,A,6.804147,24.743304
1,A,C,3.684308,22.774633
2,A,C,5.911288,28.416588
3,C,A,4.790627,21.95125
4,C,B,4.504166,28.563142


You can also regex pattern match on `pd.MultiIndex` columns:

In [10]:
np.random.seed(100)

# NOTE: Multi level columns.
dataframe = pd.DataFrame(
    {
        ("cat_var_1", "y1"): np.random.choice(categories, size=100),
        ("cat_var_2", "y2"): np.random.choice(categories, size=100),
        ("num_var_1", "x1"): np.random.uniform(0, 10, size=100),
        ("num_var_2", "x2"): np.random.uniform(0, 10, size=100),
    }
)

schema = pa.DataFrameSchema(
    {
        ("num_var_.+", "x.+"): pa.Column(  # NOTE.
            float,
            checks=pa.Check.greater_than_or_equal_to(0),
            regex=True,
        ),
        ("cat_var_.+", "y.+"): pa.Column(  # NOTE.
            pa.Category,
            checks=pa.Check.isin(categories),
            coerce=True,
            regex=True,
        ),
    }
)

display(schema.validate(dataframe).head())

Unnamed: 0_level_0,cat_var_1,cat_var_2,num_var_1,num_var_2
Unnamed: 0_level_1,y1,y2,x1,x2
0,A,A,6.804147,4.743304
1,A,C,3.684308,2.774633
2,A,C,5.911288,8.416588
3,C,A,4.790627,1.95125
4,C,B,4.504166,8.563142


### Handling Dataframe Columns *not in the Schema*

By default, columns that *aren’t specified in the schema aren’t checked*.

If you want to check that the `DataFrame` only contains columns in the schema, specify `strict=True`:

In [11]:
import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

schema = DataFrameSchema({"column1": Column(int)}, strict=True)  # NOTE strict=True

df = pd.DataFrame({"column2": [1, 2, 3]})

try:
    schema.validate(df)
except Exception as e:
    display(type(e))
    print(e)

pandera.errors.SchemaError

column 'column2' not in DataFrameSchema {'column1': <Schema Column(name=column1, type=DataType(int64))>}


Alternatively, if your `DataFrame` contains columns that are not in the schema,
and you would like these **to be dropped** on validation, you can specify `strict='filter'`.

In [12]:
import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema

df = pd.DataFrame({"column1": ["drop", "me"], "column2": ["keep", "me"]})
schema = DataFrameSchema({"column2": Column(str)}, strict="filter")  # NOTE: strict='filter'

validated_df = schema.validate(df)

display(df)
display(validated_df)

Unnamed: 0,column1,column2
0,drop,keep
1,me,me


Unnamed: 0,column2
0,keep
1,me


### Validating the order of the columns

For some applications **the order of the columns** is important. For example:
* If you want to use selection by position instead of the more common selection by label.
* ℹ️ **Machine learning**: Many ML libraries will cast a Dataframe to `numpy` arrays, for which order becomes crucial.

To validate the order of the Dataframe columns, specify `ordered=True`:

In [13]:
import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(columns={"a": pa.Column(int), "b": pa.Column(int)}, ordered=True)  # NOTE ordered=True

df = pd.DataFrame({"b": [1], "a": [1]})  # Order is "wrong"!

try:
    schema.validate(df)
except Exception as e:
    display(type(e))
    print(e)

pandera.errors.SchemaError

column 'b' out-of-order


### Validating the joint uniqueness of columns

In some cases you might want to ensure that *a group of columns are unique*:

> That is, they must be unique in terms of their joint contents (values)

In [14]:
import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={col: pa.Column(int) for col in ["a", "b", "c"]},
    unique=["a", "c"],
)
df = pd.DataFrame.from_records(
    [
        {"a": 1, "b": 2, "c": 3},
        {"a": 1, "b": 2, "c": 3},
    ]
)

try:
    schema.validate(df)
except Exception as e:
    display(type(e))
    print(e)

pandera.errors.SchemaError

columns '('a', 'c')' not unique:
  column  index  failure_case
0      a      0             1
1      a      1             1
2      c      0             3
3      c      1             3


To control how unique errors are *reported*, the `report_duplicates` argument accepts:
* `exclude_first`: (default) report all duplicates except first occurrence
* `exclude_last`: report all duplicates except last occurrence
* `all`: report all duplicates

In [15]:
import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={col: pa.Column(int) for col in ["a", "b", "c"]},
    unique=["a", "c"],
    report_duplicates="exclude_first",  # NOTE.
)
df = pd.DataFrame.from_records(
    [
        {"a": 1, "b": 2, "c": 3},
        {"a": 1, "b": 2, "c": 3},
    ]
)

try:
    schema.validate(df)
except Exception as e:
    display(type(e))
    print(e)

pandera.errors.SchemaError

columns '('a', 'c')' not unique:
  column  index  failure_case
0      a      1             1
1      c      1             3


## Index Validation

You can also specify an `Index` in the `DataFrameSchema`.


In [16]:
import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index, Check

schema = DataFrameSchema(
    columns={"a": Column(int)},
    index=Index(str, Check(lambda x: x.str.startswith("index_"))),  # NOTE: Index check.
)

df = pd.DataFrame(data={"a": [1, 2, 3]}, index=["index_1", "index_2", "index_3"])

print(schema.validate(df))

         a
index_1  1
index_2  2
index_3  3


In the case that the `DataFrame` index doesn’t pass the `Check`:

In [17]:
df = pd.DataFrame(data={"a": [1, 2, 3]}, index=["foo1", "foo2", "foo3"])

try:
    schema.validate(df)
except Exception as e:
    display(type(e))
    print(e)

pandera.errors.SchemaError

<Schema Index(name=None, type=DataType(str))> failed element-wise validator 0:
<Check <lambda>>
failure cases:
   index failure_case
0      0         foo1
1      1         foo2
2      2         foo3


## MultiIndex Validation

`pandera` also supports **multi-index *column* and *index* validation**.

### MultiIndex Columns

Specifying multi-index columns follows the `pandas` syntax of
**specifying tuples for each level in the index hierarchy**:

In [18]:
import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index

schema = DataFrameSchema({("foo", "bar"): Column(int), ("foo", "baz"): Column(str)})

df = pd.DataFrame(
    {
        ("foo", "bar"): [1, 2, 3],
        ("foo", "baz"): ["a", "b", "c"],
    }
)

display(schema.validate(df))

Unnamed: 0_level_0,foo,foo
Unnamed: 0_level_1,bar,baz
0,1,a
1,2,b
2,3,c


### MultiIndex Indexes

The `MultiIndex` class allows you to define multi-index indexes by *composing a list of `pandera.Index` objects*.

In [19]:
import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index, MultiIndex, Check  # NOTE: MultiIndex here.

schema = DataFrameSchema(
    columns={"column1": Column(int)},
    # NOTE:
    index=MultiIndex(
        [
            Index(str, Check(lambda s: s.isin(["foo", "bar"])), name="index0"),
            Index(int, name="index1"),
        ]
    ),
)

df = pd.DataFrame(
    data={"column1": [1, 2, 3]},
    index=pd.MultiIndex.from_arrays(
        [["foo", "bar", "foo"], [0, 1, 2]],
        names=["index0", "index1"],
    ),
)

display(schema.validate(df))

Unnamed: 0_level_0,Unnamed: 1_level_0,column1
index0,index1,Unnamed: 2_level_1
foo,0,1
bar,1,2
foo,2,3


## Get Pandas Data Types

Pandas provides a `dtype` parameter for *casting a dataframe to a specific **dtype schema***.

`DataFrameSchema` provides a `dtypes` property which returns a dictionary whose keys are column names and values are `DataType`.

Some examples of where this can be provided to pandas are:
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html


In [20]:
import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={
        "column1": pa.Column(int),
        "column2": pa.Column(pa.Category),
        "column3": pa.Column(bool),
    },
)

df = (
    pd.DataFrame.from_dict(
        {
            "a": {"column1": 1, "column2": "valueA", "column3": True},
            "b": {"column1": 1, "column2": "valueB", "column3": True},
        },
        orient="index",
    )
    .astype({col: str(dtype) for col, dtype in schema.dtypes.items()})  # NOTE: schema.dtypes
    .sort_index(axis=1)
)

print(schema.dtypes)

display(schema.validate(df))

{'column1': DataType(int64), 'column2': DataType(category), 'column3': DataType(bool)}


Unnamed: 0,column1,column2,column3
a,1,valueA,True
b,1,valueB,True


## `DataFrameSchema` *Transformations*

ℹ️ Once you’ve defined a schema, **you can then make modifications to it**,
* both on the *schema level* - such as *adding or removing columns* and *setting or resetting the index*;
* or on the *column level* - such as changing the data type or checks.

> This is useful for re-using schema objects in a data pipeline when additional computation has been done on a dataframe,
> where the column objects may have changed or perhaps where additional checks may be required.

In [21]:
import pandas as pd
import pandera as pa

data = pd.DataFrame({"col1": range(1, 6)})

schema = pa.DataFrameSchema(
    columns={"col1": pa.Column(int, pa.Check(lambda s: s >= 0))},
    strict=True,
)

# NOTE: .add_columns():
transformed_schema = schema.add_columns(
    {
        "col2": pa.Column(str, pa.Check(lambda s: s == "value")),
        "col3": pa.Column(float, pa.Check(lambda x: x == 0.0)),
    }
)
assert isinstance(transformed_schema, pa.DataFrameSchema)  # For static type checker's sake.

# validate original data
data = schema.validate(data)
print("data")
display(data)

# transformation
transformed_data = data.assign(col2="value", col3=0.0)
print("transformed_data")
display(transformed_data)

# validate transformed data
print("Validating...")
display(transformed_schema.validate(transformed_data))

data


Unnamed: 0,col1
0,1
1,2
2,3
3,4
4,5


transformed_data


Unnamed: 0,col1,col2,col3
0,1,value,0.0
1,2,value,0.0
2,3,value,0.0
3,4,value,0.0
4,5,value,0.0


Validating...


Unnamed: 0,col1,col2,col3
0,1,value,0.0
1,2,value,0.0
2,3,value,0.0
3,4,value,0.0
4,5,value,0.0


Similarly, if you want *dropped columns* to be explicitly validated in a data pipeline:

In [4]:
import pandera as pa

schema = pa.DataFrameSchema(
    columns={
        "col1": pa.Column(int, pa.Check(lambda s: s >= 0)),
        "col2": pa.Column(str, pa.Check(lambda x: x <= 0)),
        "col3": pa.Column(object, pa.Check(lambda x: x == 0)),
    },
    strict=True,
)

new_schema = schema.remove_columns(["col2", "col3"])
display(new_schema)

"""
<Schema DataFrameSchema(
    columns={
        'col1': <Schema Column(name=col1, type=DataType(int64))>
    },
    checks=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=True
    name=None,
    ordered=False,
    unique_column_names=False
)>
""";

<Schema DataFrameSchema(columns={'col1': <Schema Column(name=col1, type=DataType(int64))>}, checks=[], index=None, coerce=False, dtype=None, strict=True, name=None, ordered=False, unique_column_names=False)>

If during the course of a data pipeline *one of your columns is moved into the index*,
you can simply *update the initial input schema using the `set_index()` method* to create a schema for the pipeline output.

In [6]:
import pandera as pa

from pandera import Column, DataFrameSchema, Check, Index

schema = DataFrameSchema(
    {"column1": Column(int), "column2": Column(float)},
    index=Index(int, name="column3"),
    strict=True,
    coerce=True,
)

print(schema.set_index(["column1"], append=True))  # NOTE schema.set_index

"""
<Schema DataFrameSchema(
    columns={
        'column2': <Schema Column(name=column2, type=DataType(float64))>
    },
    checks=[],
    coerce=True,
    dtype=None,
    index=<Schema MultiIndex(
        indexes=[
            <Schema Index(name=column3, type=DataType(int64))>
            <Schema Index(name=column1, type=DataType(int64))>
        ]
        coerce=False,
        strict=False,
        name=None,
        ordered=True
    )>,
    strict=True
    name=None,
    ordered=False,
    unique_column_names=False
)>
""";

<Schema DataFrameSchema(
    columns={
        'column2': <Schema Column(name=column2, type=DataType(float64))>
    },
    checks=[],
    coerce=True,
    dtype=None,
    index=<Schema MultiIndex(
        indexes=[
            <Schema Index(name=column3, type=DataType(int64))>
            <Schema Index(name=column1, type=DataType(int64))>
        ]
        coerce=False,
        strict=False,
        name=None,
        ordered=True
    )>,
    strict=True
    name=None,
    ordered=False,
    unique_column_names=False
)>


ℹ️ The available methods for altering the schema are:
* `add_columns()`,
* `remove_columns()`,
* `update_columns()`,
* `rename_columns()`,
* `set_index()`,
* and `reset_index()`.