# Data Loading
This tutorial will show different ways of loading data sets and how we can process different data types such as date and time.

In MOGPTK there are two data structures that are important, the `Data` and `DataSet` classes. The `Data` class is the basic component that holds all data and related information for a single channel. For example, it contains the `X` and `Y` coordinates of our input data, which coordinates are for training and testing, the name of the channel, the labels for the X and Y axes, the latent function of our data (if exists), data from predictions, data transformations (see tutorial about data preparation), and data formatters (discussed in this tutorial).

The `DataSet` class is essentially an array of `Data` instances and thus represents a complete data set with multiple output (and input) channels. The separate `Data` instances can be obtained by indexing (e.g. `dataset[0]` returns the first data channel) and `DataSet` contains convenient functions over all channels.

In [1]:
import mogptk
import pandas as pd

## Loading from pandas data frames
We will be using the [LoadDataFrame](https://games-uchile.github.io/MultiOutputGP-Toolkit/data.html#mogptk.data.LoadDataFrame) to load the airline passenger data. Loading data from DataFrames is a versatile way because DataFrames provides functions for loading from CSV, Excel, JSON, SQL, and more.

First we will inspect how the first lines of this file look like:

In [2]:
with open('data/Airline_passenger.csv') as f:
    for i in range(5):
        print(f.readline(), end='')

0.000000000000000000e+00 1.120000000000000000e+02
1.000000000000000000e+00 1.180000000000000000e+02
2.000000000000000000e+00 1.320000000000000000e+02
3.000000000000000000e+00 1.290000000000000000e+02
4.000000000000000000e+00 1.210000000000000000e+02


We note that there are no column names in the first row, nor are columns separated by commas which usually is the case with CSV (comma-separated values) files. Our function is able to load this CSV file, but we have to explicitly pass `sep=' '` to tell that columns are separated by spaces. We also pass the `names=['time','passengers]` to set the column names explicitly as they cannot be extracted from the data.

In [3]:
df = pd.read_table('data/Airline_passenger.csv', names=['time','passengers'], sep=' ')
df

Unnamed: 0,time,passengers
0,0.0,112.0
1,1.0,118.0
2,2.0,132.0
3,3.0,129.0
4,4.0,121.0
...,...,...
139,139.0,606.0
140,140.0,508.0
141,141.0,461.0
142,142.0,390.0


Given the DataFrame we can load this into a `DataSet` as follows. This function will by default load the first column as the `X` axis and the second column as the `Y` axis. Using the `dtypes` of the data frame columns it will automatically load `datetime` and similar `dtypes` as numbers, since GPs can only train numbers.

In [4]:
data = mogptk.LoadDataFrame(df)
data

{}


      time  passengers
0      0.0       112.0
1      1.0       118.0
2      2.0       132.0
3      3.0       129.0
4      4.0       121.0
..     ...         ...
139  139.0       606.0
140  140.0       508.0
141  141.0       461.0
142  142.0       390.0
143  143.0       432.0

[144 rows x 2 columns]

## Selecting columns and loading datetime values
Here we will use the air quality data set which includes column names as well as date and time values. First we inspect the first five lines:

In [5]:
with open('data/AirQualityUCI.csv') as f:
    for i in range(5):
        print(f.readline(), end='')

Date;Time;CO(GT);PT08.S1(CO);NMHC(GT);C6H6(GT);PT08.S2(NMHC);NOx(GT);PT08.S3(NOx);NO2(GT);PT08.S4(NO2);PT08.S5(O3);T;RH;AH;;
10/03/2004;18.00.00;2.6;1360;150;11.9;1046;166;1056;113;1692;1268;13.6;48.9;0.7578;;
10/03/2004;19.00.00;2;1292;112;9.4;955;103;1174;92;1559;972;13.3;47.7;0.7255;;
10/03/2004;20.00.00;2.2;1402;88;9.0;939;131;1140;114;1555;1074;11.9;54.0;0.7502;;
10/03/2004;21.00.00;2.2;1376;80;9.2;948;172;1092;122;1584;1203;11.0;60.0;0.7867;;


We note that that the separator between the columns is a semicolon `;`, so this will be our separator for `pandas.read_table`. There is no need to set the `names` parameter since all the column names are given in the first row of the data file.

In [6]:
df = pd.read_table('data/AirQualityUCI.csv', sep=';')

Data loading will automatically try and parse the `Date` column to see if it can be parsed as a datetime type. However we can also explicitly set the DataFrame column's dtype to `datetime64`. The toolkit will automatically recognize this and convert the date time values to numbers which is needed for training.

In [10]:
df['Date'] = pd.to_datetime(df['Date'])

Next we will load the data frame and set our input dimension to the `Date` column, and our output dimensions to the `CO(GT)` and `PT08.S1(CO)` columns. That means, we will have two channels `CO(GT)` and `PT08.S1(CO)` that share the same `X` coordinates.

In [11]:
data = mogptk.LoadDataFrame(df, x_col='Date', y_col=['CO(GT)', 'PT08.S1(CO)'])
data

{'time': <mogptk.data.FormatNumber object at 0x7f21d8233110>, 'passengers': <mogptk.data.FormatNumber object at 0x7f21d8233590>, 'Date': <mogptk.data.FormatDateTime object at 0x7f218d09a490>, 'CO(GT)': <mogptk.data.FormatNumber object at 0x7f21d8b34650>, 'PT08.S1(CO)': <mogptk.data.FormatNumber object at 0x7f218d09a190>}


TypeError: Parser must be a string or character stream, not datetime64

If the values are not automatically recognized by their types, it may be set explicitly using `formats`. The default formatter is `FormatNumber` which will interpret any numeric value (such as a `float` or `int`). The `FormatDateTime` will parse date time values such as `25-10-2002` or `25-10-2002 15:59:13` and convert them internally to numbers.

In [9]:
data = mogptk.LoadDataFrame(df, x_col='Date', y_col=['CO(GT)', 'PT08.S1(CO)'], formats={'Date': mogptk.FormatDateTime})
data

{'Date': <class 'mogptk.data.FormatDateTime'>}


              Date  CO(GT)
0     1.073174e+09     1.6
1     1.073174e+09     2.0
2     1.073174e+09     2.0
3     1.073174e+09     2.5
4     1.073174e+09     4.9
...            ...     ...
9352  1.133568e+09     0.6
9353  1.133568e+09     0.7
9354  1.133568e+09     0.6
9355  1.133568e+09     0.4
9356  1.133568e+09  -200.0

[9357 rows x 2 columns]
              Date  PT08.S1(CO)
0     1.073174e+09       1143.0
1     1.073174e+09       1203.0
2     1.073174e+09       1186.0
3     1.073174e+09       1192.0
4     1.073174e+09       1536.0
...            ...          ...
9352  1.133568e+09       1136.0
9353  1.133568e+09       1180.0
9354  1.133568e+09       1131.0
9355  1.133568e+09       1152.0
9356  1.133568e+09        992.0

[9357 rows x 2 columns]