# Regression Dataset loading and RegressionDataset initialization

In this notebook, we check that the loading of a multi-axis Regression Dataset works as expected. Also, we check that the
RegressionDataset (for Pipelines) object works as expected.

## Import libraries

In [1]:
from caits.loading import csv_loader_regression
from caits.dataset import RegressionDataset

## Load Dataset

In [2]:
data = csv_loader_regression("../examples/data/AirQuality.csv", X_cols=["Date", "Time", "PT08.S4(NO2)"], y_cols=["C6H6(GT)", "PT08.S1(CO)"], sep=";")
# data = csv_loader_regression("../examples/data/AirQuality.csv", y_cols=["C6H6(GT)", "PT08.S1(CO)"], sep=";")
# data = csv_loader_regression("../examples/data/AirQuality.csv", sep=";")

print(len(data["X"]), len(data["X"][0]))
print(len(data["y"]), len(data["y"][0]))
print(type(data["X"][0]), type(data["y"][0]))

3 9471
2 9471
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>


## Initialize Regression Dataset object

In [3]:
regression_data = RegressionDataset(**data)
regression_data

RegressionDataset with 9471 instances

In [4]:
regression_data.X[0].head()

Unnamed: 0,Date
0,10/03/2004
1,10/03/2004
2,10/03/2004
3,10/03/2004
4,10/03/2004


In [5]:
regression_data.X[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9471 entries, 0 to 9470
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Date    9357 non-null   object
dtypes: object(1)
memory usage: 74.1+ KB


In [6]:
regression_data.y[0].head()

Unnamed: 0,C6H6(GT)
0,119
1,94
2,90
3,92
4,65


In [7]:
regression_data.y[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9471 entries, 0 to 9470
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   C6H6(GT)  9357 non-null   object
dtypes: object(1)
memory usage: 74.1+ KB


## Loop over the Regression Dataset object

In [8]:
print("---------------------------------------------------")
counter = 0
for X_row, y_row in regression_data:
    # print(X_row, "\n\n", y_row)
    # print("---------------------------------------------------")
    counter += 1
    pass

print(counter)


---------------------------------------------------
9471


## Make Regression Dataset to dictionary

In [9]:
reg_dict = regression_data.to_dict()

In [10]:
reg_dict.keys()

dict_keys(['X', 'y'])

In [11]:
print(reg_dict["X"][0].head())
print()
print(reg_dict["y"][0].head())

         Date
0  10/03/2004
1  10/03/2004
2  10/03/2004
3  10/03/2004
4  10/03/2004

  C6H6(GT)
0     11,9
1      9,4
2      9,0
3      9,2
4      6,5


## Index the Regression Dataset

In [12]:
print(*regression_data[100], sep="\n\n")

[Date    14/03/2004
Name: 100, dtype: object, Time    22.00.00
Name: 100, dtype: object, PT08.S4(NO2)    1587.0
Name: 100, dtype: float64]

[C6H6(GT)    9,0
Name: 100, dtype: object, PT08.S1(CO)    1324.0
Name: 100, dtype: float64]


## Slice the Regression Dataset

In [13]:
regression_data[20:30].X

[          Date
 20  11/03/2004
 21  11/03/2004
 22  11/03/2004
 23  11/03/2004
 24  11/03/2004
 25  11/03/2004
 26  11/03/2004
 27  11/03/2004
 28  11/03/2004
 29  11/03/2004
 30  12/03/2004,
         Time
 20  14.00.00
 21  15.00.00
 22  16.00.00
 23  17.00.00
 24  18.00.00
 25  19.00.00
 26  20.00.00
 27  21.00.00
 28  22.00.00
 29  23.00.00
 30  00.00.00,
     PT08.S4(NO2)
 20        1730.0
 21        1647.0
 22        1591.0
 23        1719.0
 24        2083.0
 25        2333.0
 26        2191.0
 27        1707.0
 28        1333.0
 29        1252.0
 30        1375.0]

## Split Dataset to train/test

In [14]:
train_data, test_data = regression_data.train_test_split(test_size=0.2, random_state=42)
print(train_data, test_data, sep="\n")
print(*train_data.X, sep="\n")
print(*train_data.y, sep="\n")


RegressionDataset with 7576 instances
RegressionDataset with 1895 instances
            Date
774   12/04/2004
4625  19/09/2004
6214  24/11/2004
6465  05/12/2004
2362  17/06/2004
...          ...
4783  26/09/2004
5208  13/10/2004
3232  23/07/2004
5704  03/11/2004
9129  26/03/2005

[7576 rows x 1 columns]
          Time
774   00.00.00
4625  11.00.00
6214  16.00.00
6465  03.00.00
2362  04.00.00
...        ...
4783  01.00.00
5208  18.00.00
3232  10.00.00
5704  10.00.00
9129  03.00.00

[7576 rows x 1 columns]
      PT08.S4(NO2)
774         1094.0
4625        1485.0
6214        1582.0
6465        1105.0
2362        1479.0
...            ...
4783        1413.0
5208        1370.0
3232        1925.0
5704        1786.0
9129        1360.0

[7576 rows x 1 columns]
     C6H6(GT)
774       1,6
4625     10,2
6214     17,4
6465      3,0
2362      3,3
...       ...
4783     11,7
5208     10,1
3232     15,5
5704     21,0
9129      6,4

[7576 rows x 1 columns]
      PT08.S1(CO)
774         840.0
4625    

## Print length of train/test splits

In [15]:
print(len(train_data), len(test_data))

7576 1895


In [16]:
print(len(data["X"][0]), len(train_data) + len(test_data))

9471 9471


## Unify

In [17]:
print(*train_data.X, sep="\n")
print()
print(*train_data.y, sep="\n")
print("------------------------------------")
print(*test_data.X, sep="\n")
print()
print(*test_data.y, sep="\n")


            Date
774   12/04/2004
4625  19/09/2004
6214  24/11/2004
6465  05/12/2004
2362  17/06/2004
...          ...
4783  26/09/2004
5208  13/10/2004
3232  23/07/2004
5704  03/11/2004
9129  26/03/2005

[7576 rows x 1 columns]
          Time
774   00.00.00
4625  11.00.00
6214  16.00.00
6465  03.00.00
2362  04.00.00
...        ...
4783  01.00.00
5208  18.00.00
3232  10.00.00
5704  10.00.00
9129  03.00.00

[7576 rows x 1 columns]
      PT08.S4(NO2)
774         1094.0
4625        1485.0
6214        1582.0
6465        1105.0
2362        1479.0
...            ...
4783        1413.0
5208        1370.0
3232        1925.0
5704        1786.0
9129        1360.0

[7576 rows x 1 columns]

     C6H6(GT)
774       1,6
4625     10,2
6214     17,4
6465      3,0
2362      3,3
...       ...
4783     11,7
5208     10,1
3232     15,5
5704     21,0
9129      6,4

[7576 rows x 1 columns]
      PT08.S1(CO)
774         840.0
4625       1083.0
6214       1374.0
6465        884.0
2362        804.0
...        

In [18]:
train_data.unify(test_data)
print(*train_data.unify(test_data).X, sep="\n")

     Date
0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
...   ...
9466  NaN
9467  NaN
9468  NaN
9469  NaN
9470  NaN

[9471 rows x 1 columns]
     Time
0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
...   ...
9466  NaN
9467  NaN
9468  NaN
9469  NaN
9470  NaN

[9471 rows x 1 columns]
      PT08.S4(NO2)
0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
...            ...
9466           NaN
9467           NaN
9468           NaN
9469           NaN
9470           NaN

[9471 rows x 1 columns]
