# Regression Dataset loading and RegressionDataset initialization

In this notebook, we check that the loading of a multi-axis Regression Dataset works as expected. Also, we check that the
RegressionDataset (for Pipelines) object works as expected.

## Import libraries

In [1]:
from caits.loading import csv_loader_regression
from caits.dataset import RegressionDataset

## Load Dataset

In [2]:
data = csv_loader_regression("../examples/data/AirQuality.csv", X_cols=["Date", "Time", "PT08.S4(NO2)"], y_cols=["C6H6(GT)", "PT08.S1(CO)"], sep=";")
# data = csv_loader_regression("../examples/data/AirQuality.csv", y_cols=["C6H6(GT)", "PT08.S1(CO)"], sep=";")
# data = csv_loader_regression("../examples/data/AirQuality.csv", sep=";")
print(data["X"].head())
print()
print(data["y"].head())

         Date      Time  PT08.S4(NO2)
0  10/03/2004  18.00.00        1692.0
1  10/03/2004  19.00.00        1559.0
2  10/03/2004  20.00.00        1555.0
3  10/03/2004  21.00.00        1584.0
4  10/03/2004  22.00.00        1490.0

  C6H6(GT)  PT08.S1(CO)
0     11,9       1360.0
1      9,4       1292.0
2      9,0       1402.0
3      9,2       1376.0
4      6,5       1272.0


## Initialize Regression Dataset object

In [3]:
regression_data = RegressionDataset(**data)
regression_data

RegressionDataset with 9471 instances

In [4]:
regression_data.X.head()

Unnamed: 0,Date,Time,PT08.S4(NO2)
0,10/03/2004,18.00.00,1692.0
1,10/03/2004,19.00.00,1559.0
2,10/03/2004,20.00.00,1555.0
3,10/03/2004,21.00.00,1584.0
4,10/03/2004,22.00.00,1490.0


In [5]:
regression_data.X.info

<bound method DataFrame.info of             Date      Time  PT08.S4(NO2)
0     10/03/2004  18.00.00        1692.0
1     10/03/2004  19.00.00        1559.0
2     10/03/2004  20.00.00        1555.0
3     10/03/2004  21.00.00        1584.0
4     10/03/2004  22.00.00        1490.0
...          ...       ...           ...
9466         NaN       NaN           NaN
9467         NaN       NaN           NaN
9468         NaN       NaN           NaN
9469         NaN       NaN           NaN
9470         NaN       NaN           NaN

[9471 rows x 3 columns]>

In [6]:
regression_data.y.head()

Unnamed: 0,C6H6(GT),PT08.S1(CO)
0,119,1360.0
1,94,1292.0
2,90,1402.0
3,92,1376.0
4,65,1272.0


In [7]:
regression_data.y.info

<bound method DataFrame.info of      C6H6(GT)  PT08.S1(CO)
0        11,9       1360.0
1         9,4       1292.0
2         9,0       1402.0
3         9,2       1376.0
4         6,5       1272.0
...       ...          ...
9466      NaN          NaN
9467      NaN          NaN
9468      NaN          NaN
9469      NaN          NaN
9470      NaN          NaN

[9471 rows x 2 columns]>

## Loop over the Regression Dataset object

In [8]:
print("---------------------------------------------------")
for X_row, y_row in regression_data:
    # print(X_row, "\n\n", y_row)
    # print("---------------------------------------------------")
    pass


---------------------------------------------------


## Unify X and y in the same DataFrame

In [9]:
regression_data.unify().head()

Unnamed: 0,Date,Time,PT08.S4(NO2),C6H6(GT),PT08.S1(CO)
0,10/03/2004,18.00.00,1692.0,,
1,10/03/2004,19.00.00,1559.0,,
2,10/03/2004,20.00.00,1555.0,,
3,10/03/2004,21.00.00,1584.0,,
4,10/03/2004,22.00.00,1490.0,,


## Make Regression Dataset to dictionary

In [10]:
reg_dict = regression_data.to_dict()

In [11]:
reg_dict.keys()

dict_keys(['X', 'y'])

In [12]:
print(reg_dict["X"].head())
print()
print(reg_dict["y"].head())

         Date      Time  PT08.S4(NO2)
0  10/03/2004  18.00.00        1692.0
1  10/03/2004  19.00.00        1559.0
2  10/03/2004  20.00.00        1555.0
3  10/03/2004  21.00.00        1584.0
4  10/03/2004  22.00.00        1490.0

  C6H6(GT)  PT08.S1(CO)
0     11,9       1360.0
1      9,4       1292.0
2      9,0       1402.0
3      9,2       1376.0
4      6,5       1272.0


## Index the Regression Dataset

In [13]:
print(*regression_data[100], sep="\n\n")

Date            14/03/2004
Time              22.00.00
PT08.S4(NO2)        1587.0
Name: 100, dtype: object

C6H6(GT)          9,0
PT08.S1(CO)    1324.0
Name: 100, dtype: object


## Slice the Regression Dataset

In [14]:
regression_data[20:30]

RegressionDataset with 11 instances

In [15]:
regression_data[20:30].unify().head()

Unnamed: 0,Date,Time,PT08.S4(NO2),C6H6(GT),PT08.S1(CO)
20,11/03/2004,14.00.00,1730.0,,
21,11/03/2004,15.00.00,1647.0,,
22,11/03/2004,16.00.00,1591.0,,
23,11/03/2004,17.00.00,1719.0,,
24,11/03/2004,18.00.00,2083.0,,


## Split Dataset to train/test

In [16]:
train_data, test_data = regression_data.train_test_split(test_size=0.2)
print(train_data)
print(test_data)

RegressionDataset with 7576 instances
RegressionDataset with 1895 instances


## Print length of train/test splits

In [17]:
print(len(train_data), len(test_data))

7576 1895


In [18]:
print(len(data["X"]), len(train_data) + len(test_data))

9471 9471
