# Data preprocessing with ```cause2e```
This notebook shows examples of how ```cause2e``` can be used for preprocessing data. Preprocessing can be performed by the ```discovery.StructureLearner``` before learning the causal graph. Afterwards, the preprocessing steps can be imitated by the ```estimator.Estimator``` before estimating quantitative causal effects. If you know your way around packages like ```Pandas```, you can surely perform all the presented transformations without ```cause2e```. However, it has proven convenient to reduce the number of interfaces between different packages (leading to datatype mismatches etc.) by integrating a few basic steps as built-in methods of ```cause2e```'s classes.

### Imports

In [1]:
import os
import pandas as pd
from cause2e import path_mgr, discovery, estimator

## Set up paths to data and output directories
This step is conveniently handled by the ```PathManager``` class, which avoids having to wrestle with paths throughout the multistep causal analysis. If we want to perform the analysis in a directory ```'dirname'``` that contains ```'dirname/data'``` and ```'dirname/output'``` as subdirectories, we can also use ```PathManagerQuick``` for an even easier setup. The experiment name is used for generating output files with meaningful names, in case we want to study multiple scenarios (e.g. with varying model parameters). For this analysis, we use the sprinkler dataset.

In [2]:
cwd = os.getcwd()
wd = os.path.dirname(cwd)
paths = path_mgr.PathManagerQuick(experiment_name='sprinkler',
                                  data_name='sprinkler.csv',
                                  directory=wd
                                  )

## Initialize the StructureLearner
As in the other notebooks, we set up a ```StructureLearner``` and read our data.

In [3]:
learner = discovery.StructureLearner(paths)
learner.read_csv(index_col=0)

The first step in the analysis should be an assessment of which variables we are dealing with. In the sprinkler dataset, each sample tells us 
- the current season
- whether it is raining
- whether our lawn sprinkler is activated
- whether our lawn is slippery
- whether our lawn is wet.

In [4]:
print(learner.variables)

{'Season', 'Sprinkler', 'Wet', 'Rain', 'Slippery'}


## Delete a variable
In case we are sure that a variable is not related to our quantities of interest, we can always delete it from the data. For demonstration purposes, let us remove the ```'Slippery'``` variable.

In [5]:
learner.delete_variable('Slippery')
print(learner.variables)

{'Sprinkler', 'Wet', 'Rain', 'Season'}


## Add a variable
On the other hand, we might have a column of values from another data source that we want to add.

In [6]:
# generate something to add
n_samples = len(learner.data)
fake_slippery = pd.DataFrame(([0] * n_samples))
# add it to the data
learner.add_variable('Fake_Slippery', fake_slippery)
#check output
print(learner.variables)
print(learner.data)

{'Season', 'Sprinkler', 'Wet', 'Rain', 'Fake_Slippery'}
     Season  Sprinkler  Rain  Wet  Fake_Slippery
0    Winter          0     0    0              0
1    Winter          0     0    0              0
2    Winter          0     1    1              0
3    Autumn          0     1    1              0
4    Summer          1     0    1              0
..      ...        ...   ...  ...            ...
995  Spring          1     1    1              0
996  Spring          0     0    0              0
997  Spring          0     1    1              0
998  Summer          1     0    1              0
999  Summer          1     1    1              0

[1000 rows x 5 columns]


## Rename a variable
For more cleanliness, it can often be helpful to eliminate naming artefacts like prefixes from the variables. We demonstrate this by removing the prefix of the ```'Fake_Slippery'``` variable.

In [7]:
learner.rename_variable('Fake_Slippery', 'Slippery')
print(learner.variables)

{'Season', 'Sprinkler', 'Wet', 'Rain', 'Slippery'}


## Combine variables into a new variable
If we are not interested in the provided variables themselves, but in derived versions of them (e.g. deviations), we can use ```cause2e```'s functionality for combining variables into a new variable. Suppose that we are interested in the deviation of ```'Wet'``` from ```'Sprinkler'```, because we want to see if the sprinkler is the only reason for the lawn being wet. The ```'keep_old'``` argument indicates whether we want to keep the input columns in our data frame after we have created the new column from them. A look at the new column tells us that the information about the lawn being wet is not identical to the one in the ```'Sprinkler'``` column, so it might be a multicausal scenario.

In [8]:
def deviation(data, col1, col2):
    return data[col1] - data[col2]

learner.combine_variables(name='Deviation_Wet_Sprinkler',
                          input_cols=['Wet', 'Sprinkler'],
                          func=deviation,
                          keep_old=True
                          )

print(learner.data)

     Season  Sprinkler  Rain  Wet  Slippery  Deviation_Wet_Sprinkler
0    Winter          0     0    0         0                        0
1    Winter          0     0    0         0                        0
2    Winter          0     1    1         0                        1
3    Autumn          0     1    1         0                        1
4    Summer          1     0    1         0                        0
..      ...        ...   ...  ...       ...                      ...
995  Spring          1     1    1         0                        0
996  Spring          0     0    0         0                        0
997  Spring          0     1    1         0                        1
998  Summer          1     0    1         0                        0
999  Summer          1     1    1         0                        0

[1000 rows x 6 columns]


## Normalize a variable
If our data is measured on different scales (e.g. kilometres vs centimetres), normalization is a vital step. This can be achieved by a suitable call to the ```combine_variables``` method, but for convenience we have added z-score normalization as a separate method. This normalization ensures that the treated data column has mean 0 and standard deviation 1, which helps with putting all the variables on the same scale.

In [9]:
learner.normalize_variable('Sprinkler')
print(learner.data)
print(f"Sprinkler mean: {learner.data['Sprinkler'].mean(axis=0)}")
print(f"Sprinkler standard deviation: {learner.data['Sprinkler'].std(axis=0)}")

     Season  Sprinkler  Rain  Wet  Slippery  Deviation_Wet_Sprinkler
0    Winter  -0.725753     0    0         0                        0
1    Winter  -0.725753     0    0         0                        0
2    Winter  -0.725753     1    1         0                        1
3    Autumn  -0.725753     1    1         0                        1
4    Summer   1.377879     0    1         0                        0
..      ...        ...   ...  ...       ...                      ...
995  Spring   1.377879     1    1         0                        0
996  Spring  -0.725753     0    0         0                        0
997  Spring  -0.725753     1    1         0                        1
998  Summer   1.377879     0    1         0                        0
999  Summer   1.377879     1    1         0                        0

[1000 rows x 6 columns]
Sprinkler mean: -1.4210854715202004e-17
Sprinkler standard deviation: 1.0005003753127735


## Imitate preprocessing steps with the Estimator class
Just like the ```StructureLearner```, the ```Estimator``` also has methods for reading data. Having two independent reading steps instead of passing the data directly comes in handy when we want to use different sample sizes for causal discovery and estimation, when we are dealing with big data that cannot be fully stored in RAM, or when we are dealing with two entirely different datasets that only have the qualitative graph structure in common.

In [10]:
estim = estimator.Estimator(paths)
estim.read_csv(index_col=0)

A drawback of the approach with separate reading methods is that we lose all the preprocessing that has been applied by the ```StructureLearner```. Notice how e.g. ```'Sprinkler'``` is back in its original state before the normalization.

In [11]:
estim.data

Unnamed: 0,Season,Sprinkler,Rain,Wet,Slippery
0,Winter,0,0,0,0
1,Winter,0,0,0,0
2,Winter,0,1,1,1
3,Autumn,0,1,1,1
4,Summer,1,0,1,1
...,...,...,...,...,...
995,Spring,1,1,1,1
996,Spring,0,0,0,0
997,Spring,0,1,1,1
998,Summer,1,0,1,1


If we are dealing with the same samples for both the ```StructureLearner``` and the ```Estimator```, we can just assign ```estim.data = learner.data``` without having to worry about these issues. There is even a convenience constructor for this specific use case.

In [12]:
estim_2 = estimator.Estimator.from_learner(learner, same_data=True)
print(estim_2.data)
print(f"Data identical to data of the StructureLearner: {estim_2.data.equals(learner.data)}")

     Season  Sprinkler  Rain  Wet  Slippery  Deviation_Wet_Sprinkler
0    Winter  -0.725753     0    0         0                        0
1    Winter  -0.725753     0    0         0                        0
2    Winter  -0.725753     1    1         0                        1
3    Autumn  -0.725753     1    1         0                        1
4    Summer   1.377879     0    1         0                        0
..      ...        ...   ...  ...       ...                      ...
995  Spring   1.377879     1    1         0                        0
996  Spring  -0.725753     0    0         0                        0
997  Spring  -0.725753     1    1         0                        1
998  Summer   1.377879     0    1         0                        0
999  Summer   1.377879     1    1         0                        0

[1000 rows x 6 columns]
Data identical to data of the StructureLearner: True


However, if the data for the estimation part consists of different (or just more) samples, this approach is not possible. Luckily, the ```StructureLearner``` stores all its preprocessing steps in its ```transformations``` attribute. We implicitly pass it at the initialization of the ```Estimator``` so that we can easily transform the new samples in the same way.

The only thing that we need to provide additionally is an ordered list of columns that should be used in the ```add_variable``` steps of the preprocessing, since storing the original columns is generally not a good idea with big data.

If you run into any problems, please make sure that you have performed each preprocessing step with the ```StructureLearner``` only once.

In [13]:
# create new Estimator
estim_3 = estimator.Estimator.from_learner(learner)
estim_3.read_csv(index_col=0)
# replicate preprocessing steps from StructureLearner
vals_list = [fake_slippery]
estim_3.imitate_data_trafos(vals_list)
# check results
print(estim_3.data)
print(f"Data identical to data of the StructureLearner: {estim_3.data.equals(learner.data)}")

     Season  Sprinkler  Rain  Wet  Slippery  Deviation_Wet_Sprinkler
0    Winter  -0.725753     0    0         0                        0
1    Winter  -0.725753     0    0         0                        0
2    Winter  -0.725753     1    1         0                        1
3    Autumn  -0.725753     1    1         0                        1
4    Summer   1.377879     0    1         0                        0
..      ...        ...   ...  ...       ...                      ...
995  Spring   1.377879     1    1         0                        0
996  Spring  -0.725753     0    0         0                        0
997  Spring  -0.725753     1    1         0                        1
998  Summer   1.377879     0    1         0                        0
999  Summer   1.377879     1    1         0                        0

[1000 rows x 6 columns]
Data identical to data of the StructureLearner: True


We see that the ```Estimator``` now has the data in the same format as the ```StructureLearner```. This is vital for any subsequent analysis, since the data of the ```StructureLearner``` is used to learn a causal graph which then serves to guide the ```Estimator``` in performing the right estimation steps. If the variables do not match, we will most likely run into problems.