# AS a Data Scientist / ML engineer, how do I make my pipeline robust to data errors
(side benefit: can i annotate my dataframe)

```mermaid
graph TD;

A[Raw Data] -->|Data Corruption|B[Data Cleaning]
B -->|Incorrect implementation| C[Data Preprocessing]
C --> D[Feature Selection]
D --> E[Model Training or prediction]

subgraph "Data Validation Steps"
    B -- Data Validation --> BA[Check for known corruption]
    C -- Data Validation --> CA[Validate data function]
end
 
                    
```

## What is Data validation?
### __It is the act of falsifying data agaisnt explicit assumptions for some downstream purpose, like analysis, modeling and visualization__

## Benefits:
- Faster debugging --> focus on what it really matters, that is, modelling and data analysis
- Test assumptions about our data
- By validating and annotating data, data documentation becomes an artifact  

# Pandera 
data testing and statistical typing library for DS/ML- oriented data containers

### Schemas
With `pandera`, we can create schemas, which specify types for dataframe-like
objects. We can then use these schemas to assert properties about data at runtime
and try parsing it into a desired state.

Suppose you're working with a transactions dataset of grocery `item`s and their
associated `price`s. We can state our assumptions about the data upfront by
defining a `Schema`.

In [1]:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series


class PriceSchema(pa.DataFrameModel):
    item: Series[str] = pa.Field(isin=["apple", "orange"], coerce=True, description="item bought by the customers in one transaction ")
    price: Series[float] = pa.Field(gt=0, coerce=True, description= "The price paid by the customers")

You can see that the `PriceSchema` class inherits from [`pandera.DataFrameModel`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.model.DataFrameModel.html#pandera.api.pandas.model.DataFrameModel),
and defines two fields: `item` and `price`.

`pandera` gives you a flexible and concise way to specify the datatypes associated with
each column, but also other properties about it like set equivalence, with `isin=...` and value ranges, with `gt=...`.

Setting `coerce=True` will cause pandera to parse the columns into the expected data types, giving you the ability
to ensure that data flowing through your pipeline is of the expected type.

### Runtime and inline checks 
We can now use the `Schema` class to validate data passing through a function or to perform inline validation, which provide important flexibility

In [2]:
valid_data = pd.DataFrame.from_records([
    {"item": "apple", "price": 0.5},
    {"item": "orange", "price": 0.75}
])
invalid_data = pd.DataFrame.from_records([
    {"item": "applee", "price": 0.5},
    {"item": "orange", "price": -1000}
])

### Runtime validation

In [3]:
""" using the `@pa.check_types` decorator and specifying the `data: DataFrame[Schema]` annotation will ensure that dataframe inputs are validated
at runtime before being passed into the `transform_data` function body."""
@pa.check_types(lazy=True)
def transform_data(data: DataFrame[PriceSchema]):
    ...

In [4]:
transform_data(valid_data) ## passed

In [5]:
try:
    transform_data(invalid_data)
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases) # The `exc.failure_cases` attribute points to a dataframe that contains metadata about the failure cases that occurred when validating the data.

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
0,Column,item,"isin(['apple', 'orange'])",0,applee,0
1,Column,price,greater_than(0),0,-1000.0,1


### inline validation

In [6]:
PriceSchema.validate(valid_data)

Unnamed: 0,item,price
0,apple,0.5
1,orange,0.75


## Schemas as Data Quality Checkpoints

With `pandera`, you can use inheritance to indicate changes in
the contents of a dataframe that some function is has to implement.

In [7]:
class TransformedSchema(PriceSchema):
    expiry: Series[pd.Timestamp] = pa.Field(coerce=True)

In [8]:
from datetime import datetime
from typing import List


@pa.check_types(lazy=True)
def transform_data(
    data: DataFrame[PriceSchema], ## old schema
    expiry: List[datetime],
) -> DataFrame[TransformedSchema]: ## the new schema 
    return data.assign(expiry=expiry)


transform_data(valid_data, [datetime.now()] * valid_data.shape[0])

Unnamed: 0,item,price,expiry
0,apple,0.5,2023-07-20 15:20:04.253904
1,orange,0.75,2023-07-20 15:20:04.253904


Now every time we call the `transform_data` function, not only is the
`data` input argument validated, but the output dataframe is validated
against `TransformedSchema`.

This means that you can catch bugs in your data transformation code
more easily:

In [9]:
@pa.check_types(lazy=True)
def transform_data(
    data: DataFrame[PriceSchema],
    expiry: List[datetime],
) -> DataFrame[TransformedSchema]:
    return data.assign(expiryy=expiry)  # typo bug: 🐛


try:
    transform_data(valid_data, [datetime.now()] * valid_data.shape[0])
except pa.errors.SchemaErrors as exc:
    display(exc.failure_cases)

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
0,DataFrameSchema,,column_in_dataframe,,expiry,


## Built-in and Customs checks
Checks are one of the fundamental constructs of pandera. 
They allow you to specify properties about dataframes, columns, indexes, and series objects, which are applied after data type validation/coercion and the core pandera checks are applied to the data to be validated.

In [10]:
## rich built-in checks for common validation tasks
import pandera as pa
from pandera import Column, Check, DataFrameSchema

schema = DataFrameSchema({
    "small_values": Column(float, Check.less_than(100)),
    "one_to_three": Column(int, Check.isin([1, 2, 3])),
    "phone_number": Column(str, Check.str_matches(r'^[a-z0-9-]+$')),
})

In [11]:
## Custom checks 
from typing import Dict
from pandas import Series
from pandera.typing import Series

class GroupbyCheckSchema(pa.DataFrameModel):

    value: Series[int] = pa.Field(gt=0, coerce=True)
    group: Series[str] = pa.Field(isin=["A", "B"])

    @pa.check("value", groupby="group",  name="check_means")
    def check_groupby(cls, grouped_value: Dict[str, Series[int]]) -> bool:
        return grouped_value["A"].mean() < grouped_value["B"].mean()

bad_df = pd.DataFrame({
    "value": [100, 110, 120, 10, 11, 10],
    "group": list("AAABBB"),
})
df = pd.DataFrame({
    "value": [100, 110, 120, 10, 11, 10],
    "group": list("BBBAAA"),
})
try:
    GroupbyCheckSchema.validate(bad_df)
except pa.errors.SchemaError as exc:
    display(exc.failure_cases)
    

Unnamed: 0,index,failure_case
0,,False


## Integration with hypothesis (fake data for unit testing)

Think of a normal unit test as being something like the following:

- Set up some data.

- Perform some operations on the data.

- Assert something about the result.

Hypothesis lets you write tests which instead look like this:

- For all data matching some specification.

- Perform some operations on the data.

- Assert something about the result.

In [12]:
import pandera as pa

schema = pa.DataFrameSchema(
    {
        "column1": pa.Column(int, pa.Check.eq(10)),
        "column2": pa.Column(float, pa.Check.le(0.25)),
        "column3": pa.Column(str, pa.Check.eq("foo")),
    }
)
print(schema.example(size=3))

   column1        column2 column3
0       10 -8.107493e+264     foo
1       10 -8.107493e+264     foo
2       10  -6.993283e+16     foo


In [13]:
import hypothesis

out_schema = schema.add_columns({"column4": pa.Column(float)})

@pa.check_output(out_schema)
def processing_fn(df):
    return df.assign(column4=df.column1 * df.column2)

@hypothesis.given(schema.strategy(size=5))
def test_processing_fn(dataframe):
    processing_fn(dataframe) 