# Extract Readings

This notebook shows how to use the CSVLoader class to load readings from a folder
containing readings in the raw format.

Details about the raw readings format can be found in the documentation site.

In this notebook we will:

- Generate a folder with readings in the raw format based on the demo data
- Load the redings needed for our target times
- Explore different options from the CSVLoader
- Load a pipeline and use it on the loaded data
- Load the readings in the unstacked format
- Load an unstacked pipeline and use it on the loaded data

## 0. Setup the logging

This step sets up logging in our environment to increase our visibility over
the steps that GreenGuard performs.

In [1]:
import logging;

logging.basicConfig(level=logging.INFO)
logging.getLogger().setLevel(level=logging.INFO)

import warnings
warnings.simplefilter("ignore")

## 1. Generate Raw Readings

The first step will be to execute the `generate_raw_readings` function, which will create a
folder in the indicated path and populate it with the raw version of the demo readings.

**NOTE**: if you want to use your own dataset you can skip this step and go directly to step 2.

In [2]:
from greenguard.demo import generate_raw_readings

target_times = generate_raw_readings('readings')

2020-02-10 18:41:33,310 - INFO - demo - Generating file readings/T001/2013-01-.csv
2020-02-10 18:41:34,048 - INFO - demo - Generating file readings/T001/2013-02-.csv
2020-02-10 18:41:34,845 - INFO - demo - Generating file readings/T001/2013-03-.csv
2020-02-10 18:41:35,670 - INFO - demo - Generating file readings/T001/2013-04-.csv
2020-02-10 18:41:36,476 - INFO - demo - Generating file readings/T001/2013-05-.csv
2020-02-10 18:41:37,259 - INFO - demo - Generating file readings/T001/2013-06-.csv
2020-02-10 18:41:38,194 - INFO - demo - Generating file readings/T001/2013-07-.csv
2020-02-10 18:41:39,031 - INFO - demo - Generating file readings/T001/2013-08-.csv
2020-02-10 18:41:39,891 - INFO - demo - Generating file readings/T001/2013-09-.csv
2020-02-10 18:41:40,689 - INFO - demo - Generating file readings/T001/2013-10-.csv
2020-02-10 18:41:41,478 - INFO - demo - Generating file readings/T001/2013-11-.csv
2020-02-10 18:41:42,249 - INFO - demo - Generating file readings/T001/2013-12-.csv


This function will generate a set of reading files in the raw format.

We will load one of them to explore it:

### Readings Format

In [3]:
import pandas as pd

readings_sample = pd.read_csv('readings/T001/2013-01-.csv')

In [4]:
readings_sample.head()

Unnamed: 0,signal_id,timestamp,value
0,S01,01/10/13 00:00:00,323.0
1,S02,01/10/13 00:00:00,320.0
2,S03,01/10/13 00:00:00,284.0
3,S04,01/10/13 00:00:00,348.0
4,S05,01/10/13 00:00:00,273.0


Here we can cleary see the format in which the data is stored:

* All the data from all the turbines is inside a single folder.
* Inside this folder, one folder exists for each turbine, named exactly like the turbine:
    * `readings/T001`
    * `readings/T002`
    * ...
* Inside each turbine folder one CSV file exists for each month, named `%Y-%m-.csv`.
    * `readings/T001/2010-01-.csv`
    * `readings/T001/2010-02-.csv`
    * `readings/T001/2010-03-.csv`
    * ...
* Each CSV file contains three columns:
    * `signal_id`: name or id of the signal.
    * ``timestamp``: timestamp of the reading formatted as ``%m/%d/%y %H:%M:%S``.
    * `value`: value of the reading.

### Target Times

The previous function will have also returned us a `target_times` variable,
which is a `pandas.DataFrame` with the three expected columns:

* `turbine_id`
* `cutoff_time`
* `target`

In [7]:
target_times.shape

(353, 3)

In [8]:
target_times.head()

Unnamed: 0,turbine_id,cutoff_time,target
0,T001,2013-01-12,0
1,T001,2013-01-13,0
2,T001,2013-01-14,0
3,T001,2013-01-15,1
4,T001,2013-01-16,0


In [9]:
target_times.target.mean()

0.3002832861189802

In [10]:
target_times.dtypes

turbine_id             object
cutoff_time    datetime64[ns]
target                  int64
dtype: object

## 2. CSVLoader

The readings in raw format can arbitrarily big, which might make it impossible to load
them into memory all at once.

In order to load them in an efficient way that allows us to solve Machine Learning problems
using them, GeenGuard provides the `greenguard.loaders.CVSLoader` class.

This class is prepared to, given a target times table, explore a collection of raw readings
and extract only the information needed to solve the corresponding problem.

The first step in order to use it, is to create an instance passing it the path
to where the reading files are stored.

**NOTE**: If you want to use your own dataset instead of the demo version,
all you have to do is make the `readings_path` variable point at the
folder where you have your CVS files stored and load your `target_times` table:

Make sure to parse the `cutoff_time` column!

```python
readings_path = 'path/to/your/data'
target_times = pd.read_csv('path/to/your/target_times.csv', parse_dates=['cutoff_time'])
```

In [12]:
from greenguard.loaders import CSVLoader

readings_path = 'readings'

csv_loader = CSVLoader(readings_path)

Once we have created our instance, we can load the readings needed for our target times
calling the `load` method with two arguments:

* `target_times (pandas.DataFrame)`: the `target_times` table.
* `window_size (str)`: the size of the training window, as a timedelta specification
  (amount + time unit). This indicates the minimum amount of data that we need to
  load for each training from the `target_times` table.
  
For example, let's load the readings needed for all our `target_times`, using a
`window_size` of one day.

In [17]:
target_times, readings = csv_loader.load(target_times, '1d')

2020-02-10 19:03:18,638 - INFO - csv - Loaded 1298564 readings from turbine T001
2020-02-10 19:03:18,763 - INFO - csv - Loaded 1298564 turbine readings
2020-02-10 19:03:19,115 - INFO - targets - Dropped 2 invalid targets


In [18]:
readings.shape

(1298564, 4)

In [19]:
readings.head()

Unnamed: 0,turbine_id,signal_id,timestamp,value
0,T001,S01,2013-01-12,294.0
1,T001,S02,2013-01-12,310.0
2,T001,S03,2013-01-12,306.0
3,T001,S04,2013-01-12,303.0
4,T001,S05,2013-01-12,265.0


In [20]:
readings.dtypes

turbine_id            object
signal_id             object
timestamp     datetime64[ns]
value                float64
dtype: object

We can see how the readings have been loaded with the expected format, including
the four expected columns:

* `turbine_id`: Unique identifier of the turbine which this reading comes from.
* `signal_id`: Unique identifier of the signal which this reading comes from.
* `timestamp (datetime)`: Time where the reading took place, as a datetime.
* `value (float)`: Numeric value of this reading.

We can also see how there is a message that indicates that there are 2 invalid targets
that have been dropped. This is because within our readings there was not enough
data to cover the entire trainin window for them, so they cannot be included in the
final problem specification.

In [11]:
target_times.shape

(351, 3)

Let's see what happens if we increase the `window_size` to, for example, 30 days.

In [22]:
target_times, readings = csv_loader.load(target_times, '30d')

2020-02-10 19:08:21,859 - INFO - csv - Loaded 1302308 readings from turbine T001
2020-02-10 19:08:21,955 - INFO - csv - Loaded 1302308 turbine readings
2020-02-10 19:08:22,298 - INFO - targets - Dropped 28 invalid targets


We can see that now more targets needed to be dropped, because there was enough data
for them.

In [26]:
target_times.shape

(321, 3)

On the other side, we can see how now the size of the loaded readings table
is a bit bigger, as more data had to be included to properly cover all the
training windows.

In [27]:
readings.shape

(1302308, 4)

## 3. Preprocessing the data

In some cases, if the amount of targets is big enough, fitting high frequency data
into memory will still be a challenge.

For this cases, the `CSVLoader` class also supports passing a resampling rule and
an aggregation function specification, so the data can go through a sampling
frequency reduction aggregation while it is loaded, reducing the amount of spaces
that it occupies in memory once loaded.

In order to use the resampling feature, we will need to create a new instance
of the `CSVLoader` passing the following new arguments:

* `rule (str)`: Time-delta specification (amount+unit) of the new sampling frequency.
* `aggregation (str or function)`: Aggregation to apply when resampling.

In [29]:
csv_loader = CSVLoader(readings_path, rule='4h', aggregation='mean')

And then call the `load` method normally.

In [30]:
target_times, readings = csv_loader.load(target_times, '14d')

2020-02-10 19:31:50,932 - INFO - csv - Loaded 1235535 readings from turbine T001
2020-02-10 19:31:50,938 - INFO - csv - Resampling: 4h - mean
2020-02-10 19:31:51,459 - INFO - csv - Loaded 52130 turbine readings
2020-02-10 19:31:51,689 - INFO - targets - Dropped 2 invalid targets


In [31]:
readings.shape

(52130, 4)

In [32]:
readings.head()

Unnamed: 0,turbine_id,signal_id,timestamp,value
0,T001,S01,2013-01-27 00:00:00,791.333333
1,T001,S01,2013-01-27 04:00:00,746.75
2,T001,S01,2013-01-27 08:00:00,808.75
3,T001,S01,2013-01-27 12:00:00,760.125
4,T001,S01,2013-01-27 16:00:00,720.833333


In [33]:
target_times.shape

(319, 3)

## 4. Unstacking

Some of the pipelines included in **GreenGuard** expect a slightly different input format,
where the data has been unstacked by `signal_id`, putting the values of each signal in a
different column instead of having all of them in a single column.

In such cases, the `CSVLoader` can also take care of the unstacking step.

For this, all you need to do is add `unstack=True` argument when creating the instance
and then use the `load` method as usual.

In [34]:
csv_loader = CSVLoader(readings_path, rule='4h', aggregation='mean', unstack=True)
target_times, readings = csv_loader.load(target_times, '14d')

2020-02-10 19:36:03,403 - INFO - csv - Loaded 1228047 readings from turbine T001
2020-02-10 19:36:03,411 - INFO - csv - Resampling: 4h - mean
2020-02-10 19:36:03,881 - INFO - csv - Loaded 1993 turbine readings
2020-02-10 19:36:04,165 - INFO - targets - Dropped 2 invalid targets


In [35]:
readings.shape

(1993, 28)

In [36]:
readings.head()

Unnamed: 0,turbine_id,timestamp,value_S01,value_S02,value_S03,value_S04,value_S05,value_S06,value_S07,value_S08,...,value_S17,value_S18,value_S19,value_S20,value_S21,value_S22,value_S23,value_S24,value_S25,value_S26
0,T001,2013-01-28 00:00:00,715.75,709.333333,710.208333,796.666667,771.75,732.916667,766.166667,3361627.0,...,13.4875,4272212.0,49.041667,49.041667,49.041667,49.041667,49.041667,49.041667,49.041667,336.0
1,T001,2013-01-28 04:00:00,779.416667,777.5,779.666667,824.125,800.083333,765.291667,791.958333,3362652.0,...,14.695833,4279238.0,43.875,43.875,43.875,43.875,43.916667,43.875,43.916667,301.083333
2,T001,2013-01-28 08:00:00,732.583333,757.375,738.125,794.583333,765.291667,736.541667,766.916667,3364190.0,...,14.1,4289814.0,81.666667,82.375,82.416667,82.875,82.541667,83.25,81.416667,564.041667
3,T001,2013-01-28 12:00:00,743.833333,779.083333,775.833333,804.208333,771.458333,736.166667,761.0,3366258.0,...,13.691667,4304198.0,88.25,90.833333,90.875,91.5,90.166667,90.875,88.916667,616.833333
4,T001,2013-01-28 16:00:00,640.416667,678.0,675.958333,709.166667,675.833333,670.666667,682.166667,3368310.0,...,12.454167,4318658.0,80.458333,83.541667,85.333333,85.916667,83.5,86.375,83.333333,574.958333
