Following [this](https://pandera.readthedocs.io/en/stable/dtypes.html)

# Pandera Data Types

*new in 0.7.0*

> ⚠️ Tricky

## Motivations

ℹ️ Pandera **defines its own interface for data types** in order to abstract the specifics of dataframe-like data
structures in the python ecosystem, such as Apache Spark, Apache Arrow and xarray.

**❗ Terminology:**

In the following section:
* *Pandera Data Type* refers to a `pandera.dtypes.DataType` object
* whereas *native data type* refers to data types used by third-party libraries that Pandera supports (e.g. `pandas`).

Most of the time, it is transparent to end users since pandera columns and indexes accept native data types.

However, it is possible to *extend* the pandera interface by:
* modifying the **data type check** performed during schema validation.
* modifying the behavior of the `coerce` argument for `DataFrameSchema`.
* adding your **own custom data types**.


## DataType basics

All pandera data types inherit from [`pandera.dtypes.DataType`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.dtypes.DataType.html#pandera.dtypes.DataType)
and **must be hashable**.

A data type implements *three* key methods:
* [`pandera.dtypes.DataType.check()`](https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.dtypes.DataType.check.html#pandera.dtypes.DataType.check)
which validates that data types are equivalent.
* [`pandera.dtypes.DataType.coerce()`](https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.dtypes.DataType.coerce.html#pandera.dtypes.DataType.coerce)
which coerces a data container (e.g. `pandas.Series`) to the data type.
* The dunder method `__str__()` which should output the native alias. For example `str(pandera.Float64) == "float64"`

For pandera’s validation methods to be aware of a data type, it has to be **registered** with the *targeted engine* via
[`pandera.engines.engine.Engine.register_dtype()`](https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.engines.engine.Engine.register_dtype.html#pandera.engines.engine.Engine.register_dtype).

An engine is in charge of mapping a pandera [`DataType`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.dtypes.DataType.html#pandera.dtypes.DataType)
with a native data type counterpart belonging to a third-party library.

The mapping can be queried with [`pandera.engines.engine.Engine.dtype()`](https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.engines.engine.Engine.dtype.html#pandera.engines.engine.Engine.dtype).

> As of pandera *0.7.0*, only the pandas `Engine` is supported.

### Example

> ℹ️ Potentially very useful for my use cases!

Let’s extend `pandas.BooleanDtype` coercion to handle the string literals `"True"` and `"False"`.


In [4]:
import pandas as pd
import pandera as pa
from pandera import dtypes
from pandera.engines import pandas_engine


@pandas_engine.Engine.register_dtype  # NOTE: step 1
@dtypes.immutable  # NOTE: step 2
class LiteralBool(pandas_engine.BOOL):  # NOTE: step 3
    def coerce(self, series: pd.Series) -> pd.Series:  # NOTE: Here we're setting up `coerce`.
        """Coerce a pandas.Series to boolean types."""
        if pd.api.types.is_string_dtype(series):
            series = series.replace({"True": 1, "False": 0})
        return series.astype("boolean")


data = pd.Series(["True", "False"], name="literal_bools")
print(data)
print()

# step 4
print(pa.SeriesSchema(LiteralBool(), coerce=True, name="literal_bools").validate(data).dtype)

0     True
1    False
Name: literal_bools, dtype: object

boolean


> [❗] Note the below very carefully!

The example above performs the following steps:
* Register the data type with the pandas engine.
* [❗] [`pandera.dtypes.immutable()`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.dtypes.immutable.html#pandera.dtypes.immutable)
creates an **immutable** (and **hashable**) `dataclass()`.
* [❗] Inherit `pandera.engines.pandas_engine.BOOL`, which is **the `pandera` representation of `pandas.BooleanDtype`**.
This is not mandatory **but it makes our life easier by having already implemented *all the required methods***.
* Check that our new data type can coerce the string literals.

So far we did not override the default behavior:

In [5]:
import pandera as pa

print(data)
print()

try:
    pa.SeriesSchema("boolean", coerce=True).validate(data)
except Exception as e:
    print(e)

0     True
1    False
Name: literal_bools, dtype: object

Error while coercing 'None' to type boolean: Could not coerce <class 'pandas.core.series.Series'> data_container into type boolean:
   index failure_case
0      0         True
1      1        False


⚠️ To **completely replace the default `BOOL`**, we need to supply all the **`equivalent` representations** to `register_dtype()`.

How does it work?:

Behind the scenes, when `pa.SeriesSchema("boolean")` is called the corresponding pandera data type is looked up using `pandera.engines.engine.Engine.dtype()`.

In [12]:
print(f"before: {pandas_engine.Engine.dtype('boolean').__class__}")


@pandas_engine.Engine.register_dtype(
    equivalents=["boolean", pd.BooleanDtype, pd.BooleanDtype()],  # NOTE equivalents
)
@dtypes.immutable
class LiteralBool(pandas_engine.BOOL):  # As before...
    def coerce(self, series: pd.Series) -> pd.Series:
        """Coerce a pandas.Series to boolean types."""
        if pd.api.types.is_string_dtype(series):
            series = series.replace({"True": 1, "False": 0})
        return series.astype("boolean")


print(f"after: {pandas_engine.Engine.dtype('boolean').__class__}")

print()
print(data)

for dtype in ["boolean", pd.BooleanDtype, pd.BooleanDtype()]:
    pa.SeriesSchema(dtype, coerce=True).validate(data)

"""
before: <class 'pandera.engines.pandas_engine.BOOL'>
after: <class 'LiteralBool'>
""";


# NOTE:
# So what now happens in this example is that a the data represented with "True" "False" strings is successfully
# COERCED to the standard bool representation!

before: <class '__main__.LiteralBool'>
after: <class '__main__.LiteralBool'>

0     True
1    False
Name: literal_bools, dtype: object


> ℹ️ For convenience, we specified both `pd.BooleanDtype` and `pd.BooleanDtype()` as equivalents.
>
> That gives us more flexibility in what pandera schemas can recognize (see last for-loop above).

### Parametrized data types

> ⚠️ Tricky

Some data types can be **parametrized**. One common example is `pandas.CategoricalDtype`.

The `equivalents` argument of `register_dtype()` does not handle this situation but will automatically register
a `classmethod()` with signature `from_parametrized_dtype(cls, equivalent:...)` if the decorated `DataType` defines it.

The `equivalent` argument **must be type-annotated** because it is leveraged to dispatch the input of dtype to the
appropriate `from_parametrized_dtype` class method.

For example, here is a snippet from `pandera.engines.pandas_engine.Category`:


In [13]:
from typing import Union

import pandas as pd
from pandera import dtypes


@classmethod
def from_parametrized_dtype(cls, cat: Union[dtypes.Category, pd.CategoricalDtype]):
    """Convert a categorical to
    a Pandera :class:`pandera.dtypes.pandas_engine.Category`."""
    return cls(categories=cat.categories, ordered=cat.ordered)  # type: ignore

> **Note**
>
> The dispatch mechanism relies on `functools.singledispatch()`.
> 
> Unlike the built-in implementation, `typing.Union` is recognized.

### Defining the `coerce_value` method

For pandera datatypes to understand how to correctly report coercion errors, it needs to know how to coerce an individual value into the specified type.

All `pandas` data types are supported: `numpy` -based datatypes use the *underlying numpy `dtype`* to coerce an individual value.

The `pandas`-native datatypes like `CategoricalDtype` and `BooleanDtype` are also supported.

As an example of a special-cased `coerce_value` implementation, see the source code for
[`pandera.engines.pandas_engine.Category.coerce_value()`](https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.engines.pandas_engine.Category.coerce_value.html#pandera.engines.pandas_engine.Category.coerce_value):

In [15]:
from typing import Any


def coerce_value(self, value: Any) -> Any:
    """Coerce an value to a particular type."""
    if value not in self.categories:  # type: ignore
        raise TypeError(f"value {value} cannot be coerced to type {self.type}")
    return value

### Logical data types

Taking inspiration from the [visions project](https://dylan-profiler.github.io/visions/visions/background/data_type_view.html#decoupling-physical-and-logical-types),
pandera provides an interface for defining **logical data types**.

Physical types represent the actual, underlying representation of the data. e.g.: `Int8`, `Float32`, `String`, etc.,
whereas *logical* types represent the **abstracted understanding of that data**. e.g.: `IPs`, `URLs`, `paths`, etc.

Validating a *logical* data type consists of:
* validating the supporting physical data type (see [Motivations](https://pandera.readthedocs.io/en/stable/dtypes.html#dtypes-intro)) and
* a check on actual values.

For example, an IP address data type would validate that:
* The data container type is a `String`.
* The actual values are well-formed addresses.

💡 Non-native Pandas dtype can also be wrapped in a `numpy.object_` and verified using the data, since the `object`
dtype alone is *not enough to verify the correctness*. 
An example would be the standard [`decimal.Decimal`](https://docs.python.org/3/library/decimal.html#decimal.Decimal)
class that can be validated via the pandera `DataType` [`Decimal`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.dtypes.Decimal.html#pandera.dtypes.Decimal).

To implement a logical data type, you just need to:
* implement the method [`pandera.dtypes.DataType.check()`](https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.dtypes.DataType.check.html#pandera.dtypes.DataType.check) and
* make use of the `data_container` argument to perform checks on the values of the data.

**For example**, you can create an `IPAddress` datatype that inherits from the numpy `string` physical type,
thereby storing the values as strings, and checks whether the values actually match an IP address regular expression.

In [17]:
import re
from typing import Optional, Iterable, Union


@pandas_engine.Engine.register_dtype
@dtypes.immutable
class IPAddress(pandas_engine.NpString):

    # NOTE: This implementation.
    def check(
        self,
        pandera_dtype: dtypes.DataType,
        data_container: Optional[pd.Series] = None,
    ) -> Union[bool, Iterable[bool]]:

        # ensure that the data container's data type is a string,
        # using the parent class's check implementation
        correct_type = super().check(pandera_dtype)
        if not correct_type:
            return correct_type

        # ensure the filepaths actually exist locally
        exp = re.compile(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")
        return data_container.map(lambda x: exp.match(x) is not None)  # pyright: ignore

    def __str__(self) -> str:
        return str(self.__class__.__name__)

    def __repr__(self) -> str:
        return f"DataType({self})"


schema = pa.DataFrameSchema(columns={"ips": pa.Column(IPAddress)})

df = pd.DataFrame({"ips": ["0.0.0.0", "0.0.0.1", "0.0.0.a"]})
display(df)

try:
    schema.validate(df)
except Exception as e:
    print(e)

Unnamed: 0,ips
0,0.0.0.0
1,0.0.0.1
2,0.0.0.a


expected series 'ips' to have type IPAddress:
failure cases:
   index failure_case
0      2      0.0.0.a
