# Data validation
   
    
This use-case is about validation of the data before model training.  
The scenarios are numerous and by using Cascade you can easily implement solution for any case.  

In [1]:
import numpy as np
import cascade.data as cdd
import cascade.meta as cde

In [2]:
from cascade.utils.pa_schema_validator import PaSchemaValidator
from cascade.utils.tables import TableDataset

In [3]:
import cascade
cascade.__version__

'0.11.0'

## Validation in general
Cascade has basic validation building blocks as it has some specific validation solutions. In this section general cases will be explained.  
In general one having a dataset can validate either all elements in the dataset one by one or a dataset as a whole. For these purposes Cascade has `PredicateValidator` and `AggregateValidator` classes. Let's see them on a real example.

### When everything is OK!

Let's load the data. Tabular datasets will be used in the later section, let's now load the data for optical character recognition to demonstrate data validation features.
  
  
  

In [4]:
from sklearn.datasets import load_digits

data = load_digits()


We need to encapsulate the data using Cascade's default `Wrapper` to be able to use it later.  

In [5]:
digits_ds = cdd.Wrapper([(item, label) for item, label in zip(data['data'], data['target'])])

    
  
This dataset will give tuples of data and labels, which will be useful for training the model.
  

In [6]:
digits_ds[0]

(array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
        15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
        12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
         0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
        10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.]),
 0)

To validate the data we need to state our assumptions about it. Let's see the boundaries of values in images of digits.

In [7]:
np.percentile(data['data'], [0, 50, 100])

array([ 0.,  1., 16.])

We see that the values are not lower than 0 and not higher than 16. Assume that we don't want future data to be outside these boundaries. We don't want new values silently breaking our pipeline.  
Let's apply `PredicateValidator`. We pass our dataset and a callable that returns boolean value. If every item in the dataset passes this check the exception will not be raised.

In [8]:
check_of_data = lambda x: x[0].max() <= 16 and x[0].min() >= 0
cde.PredicateValidator(digits_ds, check_of_data)

                            

OK!




cascade.meta.validator.PredicateValidator

This check is more strong - we don't want to add more labels by mistake, so let's check them similarly!

In [9]:
check_of_label = lambda x: x[1] >= 0 and x[1] < 10
cde.PredicateValidator(digits_ds, check_of_label)

                            

OK!




cascade.meta.validator.PredicateValidator

Validators are simple `Modifiers` that apply no transformation on dataset, so they can simply be chained.

In [10]:
validated_digits_ds = cde.PredicateValidator(digits_ds, check_of_data)
validated_digits_ds = cde.PredicateValidator(validated_digits_ds, check_of_label)

                            

OK!


                            

OK!




But if we chain two Validators, does it mean that we iterate over the whole dataset two times? Actually, yes, so for this the solutions exists. If your dataset is too big to iterate over multiple times, you can encapsulate your checks into one Validator.

In [11]:
validated_digits_ds = cde.PredicateValidator(digits_ds, [check_of_data, check_of_label])

                            

OK!




It's time to check the dataset as a whole. To demonstrate the mechanic let's check that dataset is big enough. Now our callable accepts the dataset and still returns boolean.

In [12]:
cde.AggregateValidator(digits_ds, lambda ds: len(ds) > 1000)

OK!


cascade.meta.validator.AggregateValidator

But what if we want to check that our dataset (or pipeline) is **the same in different runs**? Can we do this by not specifying each parameter in the `AggregateValidator`?  
Cascade has a special solution for this. It is `MetaValidator`.  
`MetaValidator` works like the following. During first run it saves metadata into `./.cascade` folder. In the subsequent runs it checks whether some fields in meta changed and raises an exception if they did.

In [13]:
cde.MetaValidator(digits_ds, meta_fmt='.yml')

OK!


cascade.meta.meta_validator.MetaValidator

Let's see what values were saved by validator.

In [14]:
digits_ds.get_meta()

[{'name': 'cascade.data.dataset.Wrapper',
  'type': 'dataset',
  'len': 1797,
  'obj_type': "<class 'list'>"}]

It is not much, but we can add to the meta everything we want before the first run of validator and it will be recorded in meta and checked.  
Now let's simulate second run.

In [15]:
cde.MetaValidator(digits_ds, meta_fmt='.yml')

OK!


cascade.meta.meta_validator.MetaValidator

Everything is ok, meta is unchanged. But what if we change the pipeline? MetaValidator works for unique pipelines. If we add new stage to it, it will make another record and will validate against it in the future. To identify pipelines It uses the list of dataset names.

### When everything is not OK
What if our hypotheses are false due to some errors? Validators will raise `cascade.meta.DataValidationException` with the detailed description of what gone wrong where it possible. Let's see how it works.

For the purpose of experiment, let's suppose that we don't wanna see zeros in labels. Let's check for it.

In [16]:
cde.PredicateValidator(digits_ds, lambda x: x[1] != 0)

                            

DataValidationException: Checks in positions [0] failed
Items failed by check:
0: 0, 10, 20, 30, 36 ... 1739, 1745, 1746, 1768, 1793

In the exception items causing the error are listed. This can be helpful to identify the problem.

In [17]:
cde.AggregateValidator(digits_ds, lambda ds: len(ds) < 1000)

DataValidationException: Checks in positions [0] failed

The exceptions provide info about what got wrong.

Now let's simulate the change in meta data of dataset. Let's manually change the length of it for example and see how `MetaValidator` works.

In [18]:
digits_ds._data = digits_ds._data[:1000]
digits_ds.get_meta()

[{'name': 'cascade.data.dataset.Wrapper',
  'type': 'dataset',
  'len': 1000,
  'obj_type': "<class 'list'>"}]

In [19]:
cde.MetaValidator(digits_ds, meta_fmt='.yml')

Value of root[0]['len'] changed from 1797 to 1000.


DataValidationException: {'values_changed': {"root[0]['len']": {'new_value': 1000, 'old_value': 1797}}}

It provides detailed description of what values changed and how.

## Validation of tables
Validation of tabular data is more specific case and is more developed. For this purpose Cascade can use already made solutions in the familiar form of the `Validator`.  
Now let's load tabular data for this section.

In [20]:
from sklearn.datasets import load_iris

data = load_iris()

In [21]:
import pandas as pd

df = pd.DataFrame(data['data'], columns=data['feature_names'])

We will use `TableDataset` - special container for `pandas.DataFrame`s in Cascade.

In [22]:
iris_ds = TableDataset(t=df)
iris_ds

cascade.utils.tables.TableDataset
      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]

  
For the purpose of tabular data validation Cascade uses Pandera. The workflow is simple - you define schema of the table and the checks that should be made. Then you run `PaSchemaValidator` and that's all!  
For the documentation of Pandera's classes, please see: [pandera docs](https://pandera.readthedocs.io/en/stable/index.html).
  

In [23]:
import pandera as pa

We add all columns and check that all values are greater than zero.

In [24]:
schema = pa.DataFrameSchema({
    "sepal length (cm)": pa.Column(float, checks=pa.Check.gt(0)),
    "sepal width (cm)": pa.Column(float, checks=pa.Check.gt(0)),
    "petal length (cm)": pa.Column(float, checks=pa.Check.gt(0)),
    "petal width (cm)": pa.Column(float, checks=pa.Check.gt(0)),
})

In [25]:
PaSchemaValidator(iris_ds, schema)

OK!


cascade.utils.pa_schema_validator.PaSchemaValidator

For future uses we can save schema to yaml and use Validator with the path.

In [26]:
schema.to_yaml('./iris_schema.yml')

In [27]:
PaSchemaValidator(iris_ds, './iris_schema.yml')

OK!


cascade.utils.pa_schema_validator.PaSchemaValidator

Let's manually violate our assumption and see what will happen.

In [28]:
iris_ds._table['sepal length (cm)'] *= -1

In [29]:
PaSchemaValidator(iris_ds, './iris_schema.yml')

DataValidationException: <Schema Column(name=sepal length (cm), type=DataType(float64))> failed element-wise validator 0:
<Check greater_than: greater_than(0)>
failure cases:
     index  failure_case
0        0          -5.1
1        1          -4.9
2        2          -4.7
3        3          -4.6
4        4          -5.0
..     ...           ...
145    145          -6.7
146    146          -6.3
147    147          -6.5
148    148          -6.2
149    149          -5.9

[150 rows x 2 columns]

We obtained large traceback which shows which values violate our assumptions.

Data validation is an important part of any established ML-pipeline. Simple checks can speed up problem identification. By using Cascade one can easily develop own dataset checks and use already made solutions.

## See also:
- [Documentation](https://oxid15.github.io/cascade/)
- [Key concepts](https://oxid15.github.io/cascade/concepts.html)
- [Code reference](https://oxid15.github.io/cascade/modules.html)