# Introduction

This notebook explains what `pandabear` is.

> TL;DR: Library that makes it super easy to define schemas for your `pd.DataFrame`s and then actually *validate* dataframes against those schemas as they are passed in and out of functions.

## Problem

Let's say you have some marketing performance data that looks like this:

In [1]:
import pandas as pd

df = pd.DataFrame({
    'spend_fb': [1000, 2000, 3000],
    'impressions_fb': [10000, 20000, 30000],
    'clicks_fb': [100, 200, 300],
    'spend_gads': [4000, 5000, 6000],
    'impressions_gads': [40000, 50000, 60000],
    'clicks_gads': [400, 500, 600],
})

df

Unnamed: 0,spend_fb,impressions_fb,clicks_fb,spend_gads,impressions_gads,clicks_gads
0,1000,10000,100,4000,40000,400
1,2000,20000,200,5000,50000,500
2,3000,30000,300,6000,60000,600


And somewhere in your pipeline you're dropping the `"spend_"` columns using a function.

In [2]:
import pandas as pd

def drop_spend(df: pd.DataFrame) -> pd.DataFrame:
    """Drop columns that contain the substring 'spend'"""
    return df.drop(df.filter(regex='spend').columns, axis=1)

drop_spend(df)

Unnamed: 0,impressions_fb,clicks_fb,impressions_gads,clicks_gads
0,10000,100,40000,400
1,20000,200,50000,500
2,30000,300,60000,600


Even with this simple example, if you don't know what `df` looks like when it comes *into* the function you will struggle to know its state when returned.

**PROBLEM**: With complex dataframes and functions that **lack schema definitions**, it is **impossible to know** the state data as it is passed around in a pipeline.

## Solution

`pandabear` let's you
1. Define dataframe schemas
2. Pass those schemas to your function as type hints
3. Decorate functions with a `check_schemas` decorator

If you see this pattern in your code, you KNOW the state of the data by looking at the schema. **No more guessing or running the debugger to find out what a `df` contains!**

### Basic example

In [3]:
import pandabear as pb

# 1. Define dataframe schemas (put this in a separate file to keep things clean)
# ---------------------------

class PlatformPerformanceMetrics(pb.DataFrameModel):
    """A model for platform performance data"""
    spend_fb: int = pb.Field(ge=0)
    impressions_fb: int = pb.Field(ge=0)
    clicks_fb: int = pb.Field(ge=0)
    spend_gads: int = pb.Field(ge=0)
    impressions_gads: int = pb.Field(ge=0)
    clicks_gads: int = pb.Field(ge=0)

class AdPressureMetrics(pb.DataFrameModel):
    """A model for ad pressure data"""
    impressions_fb: int = pb.Field(ge=0)
    clicks_fb: int = pb.Field(ge=0)
    impressions_gads: int = pb.Field(ge=0)
    clicks_gads: int = pb.Field(ge=0)


# 2-3. Pass schemas to function and decorate with `check_schemas`
# ---------------------------------------------------------------

@pb.check_schemas
def drop_spend(df: pb.DataFrame[PlatformPerformanceMetrics]) -> pb.DataFrame[AdPressureMetrics]:
    """Drop columns that contain the substring 'spend'"""
    return df.drop(df.filter(regex='spend').columns, axis=1)

drop_spend(df)

Unnamed: 0,impressions_fb,clicks_fb,impressions_gads,clicks_gads
0,10000,100,40000,400
1,20000,200,50000,500
2,30000,300,60000,600


### Using regex aliases

The schema API allows for a much more concise way to define schemas, using aliases and regex! Basically, if you have lots of columns that are named similarly following some convention, you can define one field that matches them all using a regex alias.

In [4]:
# 1. Define dataframe schemas
# ---------------------------

class PlatformPerformanceMetrics(pb.DataFrameModel):
    """A model for platform performance data"""
    spend: int = pb.Field(ge=0, alias="spend_.+", regex=True)
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)

class AdPressureMetrics(pb.DataFrameModel):
    """A model for ad pressure data"""
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)


# 2-3. Pass schemas to function and decorate with `check_schemas`
# ---------------------------------------------------------------

@pb.check_schemas
def drop_spend(df: pb.DataFrame[PlatformPerformanceMetrics]) -> pb.DataFrame[AdPressureMetrics]:
    """Drop columns that contain the substring 'spend'"""
    return df.drop(df.filter(regex='spend').columns, axis=1)

drop_spend(df)

Unnamed: 0,impressions_fb,clicks_fb,impressions_gads,clicks_gads
0,10000,100,40000,400
1,20000,200,50000,500
2,30000,300,60000,600


### Coercing dtypes

Sometimes you want to coerce dtypes of specific columns to be a certain type. For example, so far we have defined "spend" columns as `int`, but really they should be `float`.

In [5]:
# 1. Define dataframe schemas
# ---------------------------

class PlatformPerformanceMetrics(pb.DataFrameModel):
    """A model for platform performance data"""
    spend: float = pb.Field(ge=0, alias="spend_.+", regex=True, coerce=True)  # <--- LOOK HERE!
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)

class AdPressureMetrics(pb.DataFrameModel):
    """A model for ad pressure data"""
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)


# 2-3. Pass schemas to function and decorate with `check_schemas`
# ---------------------------------------------------------------

@pb.check_schemas
def drop_spend(df: pb.DataFrame[PlatformPerformanceMetrics]) -> pb.DataFrame[AdPressureMetrics]:
    """Drop columns that contain the substring 'spend'"""
    assert df.filter(regex='spend').iloc[:, 0].dtype == float  # <--- LOOK HERE!
    return df.drop(df.filter(regex='spend').columns, axis=1)

drop_spend(df)

Unnamed: 0,impressions_fb,clicks_fb,impressions_gads,clicks_gads
0,10000,100,40000,400
1,20000,200,50000,500
2,30000,300,60000,600


Type coercing can also be defined on the schema-level by overriding the `Config` attribute.

In [6]:
class PlatformPerformanceMetrics(pb.DataFrameModel):
    """A model for platform performance data"""
    spend: float = pb.Field(ge=0, alias="spend_.+", regex=True)
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)

    class Config:
        coerce = True

This would apply coercing to all fields in the schema.

### Optional fields

When a column is defined in the schema, but is missing from the dataframe it will raise an error. But maybe this is not always desired. If you have optional columns you can use the `Optional` type, in the field definition.

In [7]:
from typing import Optional

# 1. Define dataframe schemas
# ---------------------------

class PlatformPerformanceMetrics(pb.DataFrameModel):
    """A model for platform performance data"""
    spend: Optional[float] = pb.Field(ge=0, alias="spend_.+", regex=True)  # <--- LOOK HERE!
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)

    class Config:
        coerce = True

class AdPressureMetrics(pb.DataFrameModel):
    """A model for ad pressure data"""
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)


# 2-3. Pass schemas to function and decorate with `check_schemas`
# ---------------------------------------------------------------

@pb.check_schemas
def drop_spend(df: pb.DataFrame[PlatformPerformanceMetrics]) -> pb.DataFrame[AdPressureMetrics]:
    """Drop columns that contain the substring 'spend'"""
    return df.drop(df.filter(regex='spend').columns, axis=1)

df_no_spend = drop_spend(df)  # df without 'spend_.+' columns...
drop_spend(df_no_spend)       # not raising an error!

Unnamed: 0,impressions_fb,clicks_fb,impressions_gads,clicks_gads
0,10000,100,40000,400
1,20000,200,50000,500
2,30000,300,60000,600


### Handling unexpected columns
If a column is defined in the schema (unless `Optional`) it MUST be present in the dataframe. But what about columns defined in the dataframe that are not in the schema?

#### Strict mode
Unexpected columns in `df` will, by default, raise an error.

In [9]:
import pytest
from pandabear.exceptions import SchemaValidationError

# 1. Define dataframe schemas
# ---------------------------

class PlatformPerformanceMetrics(pb.DataFrameModel):
    """A model for platform performance data"""
    spend: float = pb.Field(ge=0, alias="spend_.+", regex=True)  # <--- LOOK HERE!
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    # clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)

    class Config:
        coerce = True

class AdPressureMetrics(pb.DataFrameModel):
    """A model for ad pressure data"""
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    # clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)  # <--- LOOK HERE!


# 2-3. Pass schemas to function and decorate with `check_schemas`
# ---------------------------------------------------------------

@pb.check_schemas
def drop_spend(df: pb.DataFrame[PlatformPerformanceMetrics]) -> pb.DataFrame[AdPressureMetrics]:
    """Drop columns that contain the substring 'spend'"""
    return df.drop(df.filter(regex='spend').columns, axis=1)

with pytest.raises(SchemaValidationError):
    drop_spend(df)  # df without 'spend_.+' columns...

This throws a `SchemaValidationError`, saying that some "spend_" columns are present in `df` but not in the schema. `strict=True` is the default, but if you want to allow for unexpected columns, you can simply set `strict=False` in the `Config` object.

In [19]:
# 1. Define dataframe schemas
# ---------------------------

class PlatformPerformanceMetrics(pb.DataFrameModel):
    """A model for platform performance data"""
    spend: float = pb.Field(ge=0, alias="spend_.+", regex=True)  # <--- LOOK HERE!
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    # clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)

    class Config:
        coerce = True
        strict = False # <--- LOOK HERE!

class AdPressureMetrics(pb.DataFrameModel):
    """A model for ad pressure data"""
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    # clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)  # <--- LOOK HERE!

    class Config:
        strict = False # <--- LOOK HERE!


# 2-3. Pass schemas to function and decorate with `check_schemas`
# ---------------------------------------------------------------

@pb.check_schemas
def drop_spend(df: pb.DataFrame[PlatformPerformanceMetrics]) -> pb.DataFrame[AdPressureMetrics]:
    """Drop columns that contain the substring 'spend'"""
    return df.drop(df.filter(regex='spend').columns, axis=1)

drop_spend(df)  # df without 'spend_.+' columns...

Unnamed: 0,impressions_fb,clicks_fb,impressions_gads,clicks_gads
0,10000,100,40000,400
1,20000,200,50000,500
2,30000,300,60000,600


#### Filter mode
Another way to deal with unexpected columns, is to just remove them, so they never enter/exit the function. This is done by setting `filter=True` in the `Config` object.

In [20]:
# 1. Define dataframe schemas
# ---------------------------

class PlatformPerformanceMetrics(pb.DataFrameModel):
    """A model for platform performance data"""
    spend: float = pb.Field(ge=0, alias="spend_.+", regex=True)  # <--- LOOK HERE!
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    # clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)

    class Config:
        coerce = True
        filter = True # <--- LOOK HERE!

class AdPressureMetrics(pb.DataFrameModel):
    """A model for ad pressure data"""
    impressions: int = pb.Field(ge=0, alias="impressions_.+", regex=True)
    # clicks: int = pb.Field(ge=0, alias="clicks_.+", regex=True)  # <--- LOOK HERE!


# 2-3. Pass schemas to function and decorate with `check_schemas`
# ---------------------------------------------------------------

@pb.check_schemas
def drop_spend(df: pb.DataFrame[PlatformPerformanceMetrics]) -> pb.DataFrame[AdPressureMetrics]:
    """Drop columns that contain the substring 'spend'"""
    assert not df.filter(regex='clicks').columns.tolist(), "`pandabear` filtered these out in this example!"
    return df.drop(df.filter(regex='spend').columns, axis=1)

drop_spend(df)  # df without 'spend_.+' columns...

Unnamed: 0,impressions_fb,impressions_gads
0,10000,40000
1,20000,50000
2,30000,60000


The `"clicks_*"` columns filtered before the enter the function, and so they are also not returned.

## FAQ

**What is `pandabear`?**
> `pandabear` is a runtime pandas data- and schema validator. It is a lightweight alternative to [pandera](https://github.com/unionai-oss/pandera), with an almost identical API.

**What is `pandabear` not?**
> `pandabear` is not a statistical testing library. It is not a data validation library. It is not a type checker. It is not a linter. It is not a static validator. It is not a schema generator. It is not a data generator. It is not a data profiler. It is not a data transformer. It is not a data visualizer. It is not a data explorer.
>
> ... it just checks if your dataframes match their schemas.

**Why use it?**
> Because it has a really simple API that anyone can understand, and adds little to no complexity to code. And because tells developers exactly what data is being passed around in a pipeline, without having to run the code.

**What is a schema?**
> A schema defines the structure of a pandas dataframe. It is a set of rules that a dataframe must follow in order to be considered valid. At the core of a schema definition is the `Field`. A `Field` is a single rule that a column—or group of columns—must follow. Here is an example of a `Field`:
> ```python
> clicks: int = Field(ge=0, le=100)
> ```
> This `Field` says that the column `clicks` must be an integer between 0 and 100. A schema typically has multiple such fields that define the structure of a dataframe. Fields can also match multiple columns, using the `regex` argument:
> ```python
> clicks: int = Field(regex="clicks_.+", ge=0, le=100)
> ```
> This `Field` says that all columns that match the regex `clicks_.+` must be integers between 0 and 100. This is a very powerful feature that allows you to define schemas for dataframes with many columns, without having to define a `Field` for each column.
>
> Here is an example of a simple schema:
> ```python
> class MySimpleSchema(pandabear.DataFrameModel):
>     spend: float = Field(ge=0)
>     clicks: int = Field(ge=0)
>     impressions: int = Field(ge=0)
>     conversions: int = Field(ge=0)
> ```
> This schema says that a dataframe must have the columns `spend`, `clicks`, `impressions`, and `conversions`, and that all these columns must be non-negative integers or floats.
>
> Here is an example of a more complex schema:
> ```python
> class MyComplexSchema(pandabear.DataFrameModel):
>     geo_zone: Index[str] = Field(unique=True)
>     date: Index[Datetime]  # yes, you can have multiple dataframe indexes!
>     spend: float = Field(alias="spend___.+", regex=True, ge=0)
>     clicks: int = Field(alias="clicks___.+", regex=True, ge=0)
>     impressions: int = Field(alias="impressions___.+", regex=True, ge=0)
>     conversions: int = Field(alias="conversions___.+", regex=True, ge=0)
>     cpa: Optional[float] = Field(alias="cpa___.+", regex=True, ge=0)
>     ctr: Optional[float] = Field(alias="cpc___.+", regex=True, ge=0, le=1)
>
>     @check('date')
>     def check_date_is_ordered(se: pd.Series) -> bool:
>         return se.is_monotonic_increasing
>   
>     @check('cpa')
>     def check_cpa_is_valid(se: pd.Series) -> bool:
>         return se.mean() < 1000
>
>     class Config:
>         filter = True
>         coerce = True
> ```
> This schema says that a dataframe must have the indexes `geo_zone` and `date`, and columns `spend`, `clicks`, `impressions`, `conversions`, `cpa`, and `ctr`. The indexes must be unique strings and datetimes, respectively. The columns must be non-negative integers or floats, except for `ctr` which must be between 0 and 1. Each column field is aliased to match multiple columns using regex (in `df` there may be "spend___fb", "spend___gads", etc.). The `cpa` and `ctr` columns are optional. Then there are custom checks saying the `date` index must be ordered and `cpa` must have a mean less than 1000.



**What is a runtime validator?**
> A runtime validator is called at runtime, meaning that it is called when the code is executed. This is in contrast to a static validator, which is called when the code is compiled. For example, if one defines a schema check using `pandabear` like so:
> ```python
> @check_schemas
> def my_function(df: DataFrame[MySchema]):
>     ...
> my_function(df)
> ```
> and `df` fails the check, then an error is throw when `my_function(df)` is executed (not when the function is defined).


**What does it support?**
> You can specify schemas to match *nearly* any form of `pd.DataFrame` data. It supports specifying:
> * Column names (optionally with regex to match multiple columns)
> * Column dtypes
> * Optional columns
> * Indexes (multiple index levels are also supported)
> * Index data types
> * Optional indexes
> * Simple statistical checks (e.g. `ge`, `le`, `gt`, `lt`, `eq`, `ne`)
> * Custom column checks (e.g. `lambda x: x.mean() > 999`)
> * Custom dataframe checks (e.g. `lambda df: df["spend"]/df["clicks"].mean() < 1000`)
> * Coercing dtypes
> * Filtering columns (at runtime, remove `df` columns that are not in the schema)
> * All the above, but for `pd.Series`.
> * more ...

**What does it not support?**
> * Multi-index columns. Fringe, too complicated. Let us know if you need it!
> 

**Why not just use `pandera`?**
>Because `pandera` really an extensive statistical testing library. Runtime schema validation is just a minor feature. Also it has some problems:
>* Single core developer
>* 4,249 dependencies [2023-10-23]
>* 263 open issues (97 bugs) [as of 2023-10-23]
>* It doesn't play nice with other runtime type checkers

**Why not just use `pydantic`?**
>Because `pydantic` is a data validation library, not a pandas dataframe validation library. It has a more versatile API and it not as easy to use as `pandabear`.

**How do I use it in practice?**
> It's up to you. We (Pierre and Ulf) typically define schemas in a `schemas.py` file, and then import them where necessary. Then we decoreate functions with `@check_schemas` and pass the schemas as type hints. But you can also define schemas in the same file as the function, or even inline in the function definition (but don't, ok). It's up to you.
>
> We designed it to have as minimal a footprint as possible. But you will have to add the `@check_schemas` decorators in your code, and you will have to pass the schemas as type hints. But that's it. No other changes to your code are necessary.

**Does it work with other runtime type checkers?**
> Yes! At least it should. The trick we use to enable this is to let `pb.DataFrame[MySchema]` evaluate to `pd.DataFrame | MySchema`, so that `isinstance(df, pb.DataFrame[MySchema]) == True`. We have tested it with `beartype` (our favorite). But it should work with `pydantic`, `dataclasses`, `typing`, etc. If you find a bug, please let us know!