# Chain ``Preprocessor`` class overview

In this tutorial, we will look at the functionality of the sequential ``Preprocessor``, which combines in its methods most of the data processing classes implemented in *Ambrosia*.

To demonstrate the capabilities of the class, we will use synthetic data on the time spent by users on video and audio content.

In [1]:
import sys, os
sys.path.insert(1, os.path.realpath(os.path.pardir))

In [2]:
import numpy as np
import pandas as pd

from ambrosia.preprocessing import Preprocessor

Load data

In [3]:
data = pd.read_csv('../tests/test_data/pipeline_test.csv')

This is daily data for users on a period of a week

In [4]:
data.head()

Unnamed: 0,id,gender,watched,audio,day,platform
0,0,Male,7.912889,2.210973,1,web
1,1,Male,6.67869,0.020715,1,ios
2,2,Female,721.434299,59.99687,1,ios
3,3,Male,135.248218,18.982887,1,ios
4,4,Female,38.962917,8.324667,1,android


The ``Preprocessor`` class allows one to create custom sequential pipelines that include the steps of data aggregation, outlier removal, and metric transformation. These pipelines can be saved and loaded from files, making them suitable for ongoing data processing.

Let's create a class instance and pass data to it

In [14]:
preprocessor = Preprocessor(dataframe=data, verbose=True)

Now we will apply a number of preprocessing steps: aggregation, outliers removal and CUPED metric transformation for variance reduction

For almost all of the individual data processing classes in *Ambrosia*, the ``Preprocessor`` class has a corresponding method. Check the class documentation to find out their aliases and capabilities.

In [9]:
### Set detailed aggregation parameters
agg_params = {
    'watched': 'sum',
    'audio': 'sum',
    'gender': 'simple',  # simple - choose the first possible value
    'platform': 'mode'
}

In [15]:
processed_data = preprocessor.aggregate(groupby_columns='id', agg_params=agg_params)\
                  .robust(['watched', 'audio'], alpha=0.01, tail='right') \
                  .cuped('watched', by='audio', transformed_name='watched_cuped') \
                  .data()

ambrosia LOGGER: Making right-tail robust transformation of columns ['watched', 'audio']
                 with alphas = [0.01 0.01]
ambrosia LOGGER: 

ambrosia LOGGER: Change Mean watched: 5343.8899 ===> 5170.2892
ambrosia LOGGER: Change Variance watched: 10951522.1717 ===> 8739833.1681
ambrosia LOGGER: Change IQR watched: 3958.8107 ===> 3856.7420
ambrosia LOGGER: Change Range watched: 35983.1570 ===> 15681.7113
ambrosia LOGGER: 

ambrosia LOGGER: Change Mean audio: 350.3962 ===> 344.7125
ambrosia LOGGER: Change Variance audio: 17724.3973 ===> 15469.6160
ambrosia LOGGER: Change IQR audio: 176.0167 ===> 172.6091
ambrosia LOGGER: Change Range audio: 1098.9677 ===> 683.7463
ambrosia LOGGER: After transformation СUPED for watched, the variance is 7.8360 % of the original
ambrosia LOGGER: Variance transformation 8739833.1681 ===> 684853.6668


Note, that final ``data()`` method returns the result data frame.

In [19]:
processed_data.head()

Unnamed: 0,id,watched,audio,gender,platform,watched_cuped
0,0,2489.224016,213.81713,Male,web,5476.097797
1,1,3970.775664,281.958297,Male,ios,5402.751034
2,2,5900.186483,416.94415,Female,ios,4251.949148
3,3,5557.860998,384.78201,Male,web,4643.524511
4,4,7588.37499,448.263748,Female,android,5225.462582


Method ``transformations()`` allow to get a list of all applied transformations. Parameters of these transformations were fitted when the methods were executed

In [21]:
preprocessor.transformations()

[<ambrosia.preprocessing.aggregate.AggregatePreprocessor at 0x134dad6d0>,
 <ambrosia.preprocessing.robust.RobustPreprocessor at 0x1347cea30>,
 <ambrosia.preprocessing.cuped.Cuped at 0x1349a2850>]

For many scenarios, it is useful to store executed transformations with fitted parameters for future use. \
For example, we may have some continuous batch data that we would like to transform, or we are waiting for some A/B test to finish and we need to process the data with the same pre-experimental parameters.

For this, the ``Preprocessor`` has two methods that allow to save and load fitted transformations: ``store_transformations()`` and ``transform_from_config()``

First, let's store them

In [25]:
store_path = '_examples_configs/preprocessor.json'

In [26]:
preprocessor.store_transformations(store_path=store_path)

Now imagine that in the future we would like to process the data using these stored transformations. \
For simplicity, we will use the same data

Create new instance with data to process

In [29]:
future_preprocessor = Preprocessor(dataframe=data)

Pass a path to stored transformations

In [30]:
future_preprocessor.transform_from_config(load_path=store_path)

TypeError: transform() got an unexpected keyword argument 'inplace'

---

To learn more about the transformations that can be used in the ``Preprocessor``, their functionality and usage

Check:
- ``Preprocessor`` class documentation
- An overview of *Ambrosia* main data preprocessing tools
- An overview of advanced metric transformation to learn about different methods for reducing variance