In [None]:
from pathlib import Path

***Notebooks are written for Jupyter and might not display well in Github***


# Loading and processing measured data with MeasuredDats

The goal of this tutorial is to provide a comprehensive workflow for treating measured data using  **CorrAI** <code>MeasuredDats</code>.

## Use case

Measurements were collected from a real-scale benchmark conducted by Nobatek's BEF (Banc d'Essais Façade), which provides experimental cells for testing building façade solutions. Heat exchanges in a cell are limited to five of its faces, while the sixth face is dedicated to the tested solution. Internal temperature and hydrometry conditions can be controlled or monitored, and external conditions, such as temperatures and solar radiation, are measured.

The experimental setup is presented in the following figures:

| Figure 1: picture of the benchmark | Figure 2: wall layers from the inside (right) to the outside (left) |
| :---: | :---: |
|<img src="images/etics_pict.png"  height="300"> | <img src="images/etics_sch.png"  height="300"> |

Additional details about the data:
- The measurement campaign spanned from 07/06/2017 to 20/06/2017.
- The acquisition timestep is probably 1 minute.


# Measured data analysis and correction

Measured data are loaded using <code>pandas</code> python library

In [None]:
import pandas as pd

In [None]:
raw_data = pd.read_csv(
    Path(r"resources/tuto_data.csv"),
    sep=",",
    index_col=0,
    parse_dates=True
)

Plotting the raw temperatures gives precious information on the dataset

In [None]:
raw_data['T_ext'].plot()

At first sight, a dataset may look fine, but missing values or incorrect variations are not always visible on a graph. The following steps are proposed to ensure data quality.

#### 1- Identify anomalies:
- __upper__ and __lower__ values as boundaries. Measured values outside the interval are considered wrong
- upper and lower "__rates__". Measured value increasing beyond or below a defined threshold are considered wrong

These boundaries are set depending on the measured physical phenomenon.
For example, the boundaries for power and temperature will be configured differently.

#### 2- Missing data interpolation
Physical models do not tolerate missing values well. Therefore, for each sensor, we provide a method to interpolate missing data. We use a linear interpolation method to fill in the gaps between missing points. Errors at the beginning or end of the time series are filled with the first or last correct value.

#### 3- Reducing dataset size
Finally, a 1-minute acquisition timestep provides a heavy dataset.
To make the dataset more manageable, we provide an aggregation method to _resample_ the dataset. Resampling allows the data to be aggregated into larger time intervals without losing critical information.

In [None]:
from corrai.measure import MeasuredDats

In [None]:
my_data = MeasuredDats(
    data = raw_data,
    data_type_dict = {
        "temperatures": [
            'T_Wall_Ins_1', 'T_Wall_Ins_2', 'T_Ins_Ins_1', 'T_Ins_Ins_2',
            'T_Ins_Coat_1', 'T_Ins_Coat_2', 'T_int_1', 'T_int_2', 'T_ext', 'T_garde'
        ],
        "illuminance": ["Lux_CW"],
        "radiation": ["Sol_rad"]
    },
    corr_dict = {
        "temperatures": {
            "minmax": {
                "upper": 100,
                "lower": -20
            },
            "derivative": {
                "upper_rate": 2,
                "lower_rate": 0,
            },
            "fill_nan": [
                "linear_interpolation",
                "bfill",
                "ffill"
            ],
            "resample": 'mean',
        },
        "illuminance": {
            "minmax": {
                "upper": 1000,
                "lower": 0,
            },
            "derivative": {
                "upper_rate": 10E8, # Specifying high value is a way to discard correction
                "lower_rate": -1, # Specifying negative value is a way to discard correction
            },
            "fill_nan": [
                "linear_interpolation",
                "bfill",
                "ffill"
            ],
            "resample": 'mean',
        },
        "radiation": {
            "minmax": {
                "upper": 1000,
                "lower": 0,
            },
            "derivative": {
                "upper_rate": 10E8, # Specifying high value is a way to discard correction
                "lower_rate": -1, # Specifying negative value is a way to discard correction
            },
            "fill_nan": [
                "linear_interpolation",
                "bfill",
                "ffill"
            ],
            "resample": 'mean',
        }
    }
)

The <code>plot</code> method can be used to plot the data.

Provide a <code>list</code> to the argument <code>cols</code> to specify the entry you want to plot.

A new y axis will be created for each data type.

In [None]:
my_data.columns

In [None]:
my_data.plot(
    cols=['T_Wall_Ins_1', 'Sol_rad', 'Lux_CW'],
    begin='2018-04-15',
    end='2018-04-18',
    title='Plot uncorrected data',
)

Plotted data are the <code>corrected_data</code>. Use <code>plot_raw=True</code> to display raw data. This is useful to assess the impact of the correction and of the resampling methods

For now no corrections have been applied, so <code>corrected_data</code> is equal to <code>data</code>


The object <code>my_data</code> contains the original dataset and methode configuration for the correction.

The <code>correction_journal</code> properties holds information on the data.

Let's have a look

In [None]:
my_data.correction_journal

Before correction, the journal shows that ~2% of the data are missing for the temperature sensor and ~3% for external temperature, "garde" temperature and solar radiation. it corresponds to data having a timestamp, but with missing value. In this specific case, this is not related to sensors errors. 2 distinct acquisition device were used to perform the measurement. The merging of the data from the two devices created troubles in timestamp "alignment". Also measurement stopped a bit earlier for the second device.

#### 1- Identify anomalies:
Now let's apply the remove anomalies method to delete invalid data according to the specifications

In [None]:
my_data.remove_anomalies()

Let's have a look at the <code>correction_journal</code>.
Not all of it, as it stores every correction "effect". It will get big rapidly.
First we want to see the new percentage of missing data after correction

In [None]:
my_data.correction_journal["remove_anomalies"]["missing_values"]["Percent_of_missing"]

It looks like the applied corrections removed several data.
For example, the sensors measuring the cell internal temperature have now up to __4.5%__ of missing data.

Few corrections were applied to the outside temperature sensor.

The journal of correction holds further information on the gaps of data.
For example if we want to know more about the missing values of <code>T_int_1</code>

In [None]:
my_data.correction_journal["remove_anomalies"]["gaps_stats"]["T_int_1"]

- There are 11233 gaps.
- The size of 75% of these gaps do not exceed 1 timestep (~1min)
- The biggest is 1h

It is also possible to "aggregate" the gaps in to know when at least one of the data is missing

In [None]:
my_data.correction_journal["remove_anomalies"]["gaps_stats"]["combination"]

- There are 28066 gaps (~10% of the dataset).
- The size of 75% of these gaps do not exceed 1 timestep (~1min)
- The biggest gap is 1h

There is not a lot of difference. It looks like the values are missing at the same timestamps.

This is a good news, it means that there are a lot of periods with all data available

The plotting method <code>plot_gaps</code> can be used to visualize where the gap happened.

This dataset holds a lot of values, sol we just plot the entry <code>'T_int_1'</code> that is supposed to have the more gaps

We are interested in gaps lasting more than 15 minutes.

In [None]:
import datetime as dt
my_data.plot_gaps(cols=['T_Wall_Ins_1', 'Sol_rad', 'Lux_CW'],begin='2018-03-25', end='2018-03-25', gaps_timestep=dt.timedelta(minutes=15))

There seem to be only 1 gap greater than 15 minutes, it happens the 2018-03-25 between ~02:00 and ~3:00.
This is the gap we identified in the correction journal.

We may want to access the new corrected data set to perform further investigations. It is available at <code>corrected_data</code> in <code>MeasuredDats</code> object.

_Note that the original data set is left untouched in <code>data</code>_

#### 2- Missing data interpolation
Fill the missing data using specified interpolation and <code>fill_nan()</code> methods

In [None]:
my_data.fill_nan()

Once again lets ahe a look to the <code>correction_journal</code>

In [None]:
my_data.correction_journal["fill_nan"]["missing_values"]["Percent_of_missing"]

Wow, perfect dataset !

Be careful 0 data missing doesn't mean 0 problem.
 If you had a crappy dataset, it is still crappy.
 You just filled the gaps by copying values or drawing lines between (_what seems to be_) valid points

#### 3- Reducing dataset size
As we said earlier 1min timestep is too small.
Regarding the physical phenomenon involved here, we could say that 5min is ok.

So lets resample the dataset to this value

In [None]:
my_data.resample("5T")

Let's have a look at our corrected data versus the raw data.

We select a period around the gap we identified (from the 2018-03-24 to the 2018-03-26)

In [None]:
my_data.plot(
    title="Raw data versus corrected data",
    cols=['T_int_1'],
    begin='2018-03-25 00:00:00',
    end='2018-03-25 05:00:00',
    plot_raw=True)

On the above graph you can see the effects of mean resampling, that diminishes the number of points and smooths out the data.

The gap have been filled using linear interpolation at the required timestep.

It is important to compare your data before and after applying the correction methods. For example, resampling with a large timestep can lead to a loss of information