In [None]:
from pathlib import Path

***Notebooks are written for Jupyter and might not display well in Github***


# Loading and processing measured data with MeasuredDats

The goal of this tutorial is to provide a comprehensive workflow for treating measured data using  **CorrAI** <code>MeasuredDats</code>.

## Use case

Measurements were collected from a real-scale benchmark conducted by Nobatek's BEF (Banc d'Essais Façade), which provides experimental cells for testing building façade solutions. Heat exchanges in a cell are limited to five of its faces, while the sixth face is dedicated to the tested solution. Internal temperature and hydrometry conditions can be controlled or monitored, and external conditions, such as temperatures and solar radiation, are measured.

The experimental setup is presented in the following figures:

| Figure 1: picture of the benchmark | Figure 2: wall layers from the inside (right) to the outside (left) |
| :---: | :---: |
|<img src="images/etics_pict.png"  height="300"> | <img src="images/etics_sch.png"  height="300"> |

Additional details about the data:
- The measurement campaign spanned from 07/06/2017 to 20/06/2017.
- The acquisition timestep is probably 1 minute.


# Measured data analysis and correction

Measured data are loaded using <code>pandas</code> python library.

In [None]:
import pandas as pd

In [None]:
raw_data = pd.read_csv(
    Path(r"resources/tuto_data.csv"),
    sep=",",
    index_col=0,
    parse_dates=True
)

Plotting the raw temperatures gives precious information on the dataset.

In [None]:
raw_data['T_ext'].plot()

At first sight, a dataset may look fine, but missing values or incorrect variations are not always visible on a graph. The following steps are proposed to ensure data quality.

#### 1- Identify anomalies:
- __upper__ and __lower vaues__  as boundaries. Measured values outside the interval are considered wrong
- __upper__ and __lower "rates__". Measured value increasing beyond or below a defined threshold are considered wrong

These boundaries are set depending on the measured physical phenomenon.
For example, the boundaries for power and temperature will be configured differently.

#### 2- Missing data interpolation
Physical models do not tolerate missing values well. Therefore, for each sensor, we provide a method to interpolate missing data. We use a linear interpolation method to fill in the gaps between missing points. Errors at the beginning or end of the time series are filled with the first or last correct value.

#### 3- Reducing dataset size
Finally, a 1-minute acquisition timestep provides a heavy dataset.
To make the dataset more manageable, we provide an aggregation method to _resample_ the dataset. Resampling allows the data to be aggregated into larger time intervals without losing critical information.

## Using MeasuredDat to perform operations on data

The <code>MeasuredDats</code> **corrai** object is designed to specify transformations to apply to a measured dataset and to visualize their effects.
The measured are classified in _categories_ (eg. temperature, power, control, etc.).
There are 3 kinds of transformations :
- _Category level_: specify transformations "in parallel" that will be applied to specified categories
- _Common transformation_: apply transformation to all categories
- _Resampling_: process data using a time rule and an aggregation method. It may be used to align data on a regular time index or reduce the size of the dataset

<code>MeasuredDats</code> uses **Scikit Learn** _pipelines_. The transformers are <code>corrai.custom_transfomers</code> objects, they inherit from scikit base class <code>BaseEstimator</code> and <code>TransformerMixin</code>. These transformer ensure that Pandas <code>DataFrame</code> with <code>DateTimeIndex</code> are used through the process

You can get the transformers keys using the function <code>corrai.measure.get_transformers_keys()</code>

Refer to <code>corrai.custom_transformers</code> documentation to configure transformers arguments.

The figure below describes a "pipeline" that apply a series of corrections to the dataset.

<img src="images/pipe.png"  height="300">

4 successive transformations are applied: 2 category transformations, a common transformation and a resampling operation.

The **categories** are specified using <code>data_type_dict</code> :
```
category_dict = {
    "Temperatures": ["temp_1", "temp_2"],
    "Power": ["pow_1"],
    "radiation": ["Sol_rad"]
    }
```

The **category transformations** are specified using <code>category_transformations</code> :
```
category_transformations = {
    "Temperatures":{
        "ANOMALIES": [
            ["drop_threshold", {"upper": 40000, "lower": 0}],
            ["drop_time_gradient", {"upper_rate": 5000, "lower_rate": 0}],
        ],
    },
    "Power": {
        "ANOMALIES": [
            ["drop_threshold", {"upper": 50, "lower": -2}],
            ["drop_time_gradient", {"upper_rate": 5000, "lower_rate": 0}],
        ],
        "PROCESS": [
            ["apply_expression", {"expression": "x / 1000"}]
        ],
    },
    "radiation": {}
},
```

- The dictionary keys must match the category defined in <code>category_dict</code>
- For each category you can specify as much transformer as you want. Similar name must be given in each category if you want transformer to be used in the same _"category transformation"_ (eg. ANOMALIES)
- For each transformer a list of transformation is given. They are defined by a list with two elements [custom_transformer key, {custom transformer args}]
- If the category doesn't require any transformation, specify an empty dictionary

The **common transformations** are specified using <code>common_transformations</code>:

```
common_transformations={
    "COMMON": [
        ["interpolate", {"method": 'linear'}],
        ["fill_na", {"method": 'bfill'}]
    ]
}
```
- The dictionary keys are the names of the common transformers
- For each transformer, a list of transformations is given. They are defined by a list of two elements [custom_transformer key, {custom transformer args}]

The **Resampler** is configured :
```
resampler_agg_methods={
    "Temperatures": "mean"
}
```

- An optional key "RESAMPLE" may be given to specify the category aggregation method in case of resampling. By default, resampling method is mean. If you want "mean" for all categories, an empty dictionary may be specified (default value)


The **transformer list** :
Lastly you can specify the order of the transformations using <code>transformers_list</code>. For example <code>transformers_list = ["ANOMALIES", "COMMON", "RESAMPLER", "PROCESS"]</code>.
- If <code>transformer_list</code> is left to <code>None</code>, transformers list hold all the category_transformers, than all the common transformers
- If <code>"RESAMPLER"</code> is not present in <code>transformer_list</code>, but a <code>resampling_rule</code> is provided, the <code>"RESAMPLER"</code> will automatically be added at the end of the <code>transformers_list</code>.
- This list can be changed at all time.
- You don't have to use all the transformers,


Here is an example for the dataset we just loaded

In [None]:
from corrai.measure import MeasuredDats
from corrai.measure import get_transformers_keys

The function <code>get_transformers_keys()</code> is designed to print the available transformers names and help you configure <code>MeasuredDats</code>. More information on these transformers are available in the <code>corrai.custom_transformers.py</code> script.

In [None]:
get_transformers_keys()

In [None]:
my_data = MeasuredDats(
    data = raw_data,
    category_dict = {
        "temperatures": [
            'T_Wall_Ins_1', 'T_Wall_Ins_2', 'T_Ins_Ins_1', 'T_Ins_Ins_2',
            'T_Ins_Coat_1', 'T_Ins_Coat_2', 'T_int_1', 'T_int_2', 'T_ext', 'T_garde'
        ],
        "illuminance": ["Lux_CW"],
        "radiation": ["Sol_rad"]
    },
    category_transformations = {
        "temperatures": {
            "ANOMALIES": [
                ["drop_threshold", {"upper": 100, "lower": -20}],
                ["drop_time_gradient", {"upper_rate": 2, "lower_rate": 0}]
            ],
        },
        "illuminance": {
            "ANOMALIES": [
                ["drop_threshold", {"upper": 1000, "lower": 0}],
            ],
        },
        "radiation": {
            "ANOMALIES": [
                ["drop_threshold", {"upper": 1000, "lower": 0}],
            ],
        }
    },
    common_transformations={
        "COMMON": [
            ["interpolate", {"method": 'linear'}],
            ["fill_na", {"method": 'bfill'}],
        ]
    },
    transformers_list=["ANOMALIES", "COMMON"]
)

Note that <code>transformers_list</code> could have been left to None. Here, we are applying the _anomalies_ transformer, then the _common_ transformer.

In [None]:
my_data.get_pipeline()

The <code>plot</code> method can be used to plot the data.

Provide a <code>list</code> to the argument <code>cols</code> to specify the entries you want to plot. A new y axis will be created for each data type.

In [None]:
my_data.plot(
    cols=['T_Wall_Ins_1', 'Sol_rad', 'Lux_CW'],
    begin='2018-04-15',
    end='2018-04-18',
    title='Plot uncorrected data',
    marker_raw=True,
    line_raw=False,
    plot_raw = True,
)

Plotted data are the <code>corrected_data</code> obtained after going through the pipeline described in the <code>MeasuredDats</code> object's <code>transformers_list</code>. You can specify an alternative transformers_list using the <code>plot</code> function argument <code>transformers_list</code>.

Use <code>plot_raw=True</code> to display raw data. This is useful to assess the impact of the correction and of the resampling methods.


The <code>get_missing_values_stats</code> method gives information on the amount of missing values.
You can get it for raw, corrected or partially corrected data depending on the transformers specified in <code>transformers_list</code>.

Let's have a look for raw data. We specify an empty <code>transformers_list</code> that create a pipeline with a single Identity transformer.

In [None]:
my_data.get_missing_value_stats(transformers_list=[])

Before correction, the journal shows that ~2% of the data is missing for the temperature sensor and ~3% for external temperature, "garde" temperature and solar radiation. It corresponds to data having a timestamp, but with missing value. In this specific case, this is not related to sensors errors. 2 distinct acquisition device were used to perform the measurement. The merging of the data from the two devices created troubles in timestamp "alignment". Also measurement stopped a bit earlier for the second device.

#### 1- Identification of anomalies:
Now let's apply the ANOMALIES transformer to delete invalid data according to the specifications.

In [None]:
my_data.get_missing_value_stats(["ANOMALIES"])

It looks like the applied corrections removed several data.
For example, the sensors measuring the cell internal temperature have now up to __4.7%__ of missing data.

Few corrections were applied to the outside temperature sensor.

The journal of correction holds further information on the gaps of data.
For example, we might want to know more about the missing values of <code>T_int_1</code>.

In [None]:
my_data.get_gaps_description(cols=["T_int_1"], transformers_list=["ANOMALIES"])

- There are 11220 gaps
- 75% of these gaps do not exceed 1 timestep (~1min)
- The longest is 1h

It is also possible to "aggregate" the gaps to know when at least one of the data is missing.

In [None]:
my_data.get_gaps_description(transformers_list=["ANOMALIES"])["combination"]

- There are 28007 gaps (~10% of the dataset).
- The size of 75% of these gaps do not exceed 2 minutes
- The biggest gap lasts about 1 hour

There is not a lot of difference. It looks like the values are missing at the same timestamps. This is good news, it means that there are a lot of periods with all data available

The plotting method <code>plot_gaps</code> can be used to visualize where the gap(s) happened.

This dataset holds a lot of values, hence we only plot the input <code>'T_int_1'</code> here, as it is supposed to have the more gaps.

We are interested in gaps lasting more than 15 minutes.

In [None]:
import datetime as dt
my_data.plot_gaps(
    cols=['T_Wall_Ins_1', 'Sol_rad', 'Lux_CW'],
    begin='2018-03-25',
    end='2018-03-25',
    gaps_timestep=dt.timedelta(minutes=15),
    transformers_list=["ANOMALIES"]
)

There seem to be only 1 gap greater than 15 minutes, it happens the 2018-03-25 between ~02:00 and ~3:00.
This is the gap we identified in the correction journal.

#### 2- Missing data interpolation
In our example, the same method is used to fill the gaps for all categories.
It is described in <code>common_transformations</code>
Below is the object transformers list:

In [None]:
my_data.transformers_list

To get the corrected data, we just need to call <code>get_corrected_data</code> method, with default arguments

In [None]:
my_data.get_corrected_data()

We can check the effect of the transformation :

In [None]:
my_data.get_missing_value_stats()

Wow, perfect dataset !

**Be careful** 0 missing data doesn't mean 0 problem.
 If you had a crappy dataset, it is still crappy.
 You just filled the gaps by copying values or drawing lines between (_what seems to be_) valid points

#### 3- Reducing the dataset size
As we noted earlier, 1min timestep is too small.
Regarding the physical phenomenon involved here, we could say that 5min is ok.
Let's have a look at the corrected data versus the raw data.
We select a period around the gap we identified (from the 2018-03-24 to the 2018-03-26)

In [None]:
my_data.plot(
    title="Raw data versus corrected data",
    cols=['T_int_1'],
    begin='2018-03-25 00:00:00',
    end='2018-03-25 05:00:00',
    plot_raw=True,
    plot_corrected=True,
    line_raw=False,
    marker_corrected=True,
    resampling_rule='5T'
)

On the above graph, you can see the effects of mean resampling, that diminishes the number of points and smooths out the data.

The gap have been filled out, using linear interpolation at the required timestep.

It is important to compare your data before and after applying the correction methods. For example, resampling with a large timestep can lead to a loss of information.