# Transformers and Pipelines test on DatasetArray object

In this notebook we check the `caits.transformers` and Sklearn Pipelines consisting of `caits.transformers`.


## Importing libraries

In [1]:
import pandas as pd
from caits.filtering import filter_butterworth
from caits.fe import mean_value, std_value, stft, istft, melspectrogram
from caits.dataset._dataset3 import CaitsArray, DatasetArray
from caits.transformers._func_transformer_v2 import FunctionTransformer
from caits.transformers._feature_extractor_v2 import FeatureExtractor
from caits.transformers._func_transformer_2d_v2 import FunctionTransformer2D
from caits.transformers._feature_extractor_2d_v2 import FeatureExtractor2D
from caits.transformers._sliding_window_v2 import SlidingWindow

## Dataset loading

For this notebook, we will use the data/AirQuality.csv dataset.

In [2]:
data = pd.read_csv("data/AirQuality.csv", sep=";", decimal=",")
data_X = data.iloc[:, 2:-4]
data_X = data_X.fillna(data_X.mean())
data_y = data.iloc[:, -4:-2]
data_y = data_y.fillna(data_y.mean())

In [3]:
data_X_vals = data_X.values
data_X_axis_names = {"axis_1": {name: i for i, name in enumerate(list(data_X.columns))}}
data_y_vals = data_y.values
data_y_axis_names = {"axis_1": {name: i for i, name in enumerate((data_y.columns))}}
data_X = CaitsArray(values=data_X_vals, axis_names=data_X_axis_names)
data_y = CaitsArray(values=data_y_vals, axis_names=data_y_axis_names)
datasetArrayObj = DatasetArray(data_X, data_y)

## FunctionTransformer

This transformer is mainly used for transforming the `X` attribute of the `DatasetArray` object into a `CaitsArray`s with the shape maintained.

We test the `caits.transformer.FunctionTransformer` using the `caits.fe.filter_butterworth` function.


In [4]:
functionTransformer = FunctionTransformer(filter_butterworth, fs=200, filter_type='lowpass', cutoff_freq=50)
transformedArray = functionTransformer.fit_transform(datasetArrayObj)

In [5]:
datasetArrayObj.X

                CO(GT)         PT08.S1(CO)             NMHC(GT)            C6H6(GT)      PT08.S2(NMHC)            NOx(GT)  \
   0               2.6              1360.0                150.0                11.9             1046.0              166.0  
   1               2.0              1292.0                112.0                 9.4              955.0              103.0  
   2               2.2              1402.0                 88.0                 9.0              939.0              131.0  
   3               2.2              1376.0                 80.0                 9.2              948.0              172.0  
   4               1.6              1272.0                 51.0                 6.5              836.0              131.0  
 ...               ...                 ...                  ...                 ...                ...                ...  
9466  -34.207523778989  1048.9900609169606  -159.09009297851875  1.8656834455487867  894.5952762637597  168.6169712514695  
9467  -

In [6]:
transformedArray.X

                  CO(GT)         PT08.S1(CO)             NMHC(GT)            C6H6(GT)       PT08.S2(NMHC)             NOx(GT)  \
   0   2.600272319669329  1360.0048242090925   150.01459442014138    11.9003076656988  1045.9967266517287  165.99348725991996  
   1  -1.420593185721463  1349.2114462959894   114.21039387234133   9.880820965996692   971.8875957681379  121.05917607668276  
   2  2.1179588334242263  1366.2579340007192    90.11432905570959   9.048350444158793   942.9659888266447  131.59650643638588  
   3  7.2455165546245945  1359.5097783999854    73.37218810168771   8.493980117895351   920.3551721055646    154.425981261632  
   4  1.7061240496375671  1290.0090330180367    55.14793984613857   6.920656043476639   850.6661770831126  132.42904581122934  
 ...                 ...                 ...                  ...                 ...                 ...                 ...  
9466  -34.20752377898899  1048.9900609169604   -159.0900929785187  1.8656834455487856   894.59527626375

In [7]:
datasetArrayObj.y

                     RH                  AH  
   0               48.9              0.7578  
   1               47.7              0.7255  
   2               54.0              0.7502  
   3               60.0              0.7867  
   4               59.6              0.7888  
 ...                ...                 ...  
9466  39.48537992946458  -6.837603644330447  
9467  39.48537992946458  -6.837603644330447  
9468  39.48537992946458  -6.837603644330447  
9469  39.48537992946458  -6.837603644330447  
9470  39.48537992946458  -6.837603644330447  

CaitsArray with shape (9471, 2)

In [8]:
transformedArray.y

                      RH                  AH  
   0   48.89877956947963  0.7577997353557496  
   1  49.215686988358584  0.7342169806437069  
   2   53.44762785706151  0.7472629765094313  
   3   58.92329987167203  0.7807942388828646  
   4  60.651428752417154   0.794322577154374  
 ...                 ...                 ...  
9466   39.48537992946457  -6.837603644330446  
9467   39.48537992946457  -6.837603644330446  
9468   39.48537992946457  -6.837603644330445  
9469   39.48537992946457  -6.837603644330446  
9470   39.48537992946457  -6.837603644330446  

CaitsArray with shape (9471, 2)

# FeatureExtractor

This transformer is mainly used for extracting single values per column or per row (if axis=1) for each instance of `DatasetArray.X`.

We test the `caits.transformer.FeatureExtractor` using the `caits.fe.mean_value` and `caits.fe.std_value`.

In [9]:
featureExtractor = FeatureExtractor([
    {
        "func": mean_value,
        "params": {}
    },
    {
        "func": std_value,
        "params": {
            "ddof": 0
        }
    }
])

In [10]:
tmp = featureExtractor.fit_transform(datasetArrayObj)
tmp

DatasetArray object with 2 instances.

In [11]:
tmp.X

                       CO(GT)         PT08.S1(CO)             NMHC(GT)            C6H6(GT)      PT08.S2(NMHC)             NOx(GT)  \
mean_value   -34.207523778989  1048.9900609169606  -159.09009297851875  1.8656834455487865  894.5952762637597   168.6169712514695  
 std_value  77.18426094286016  327.82412536597025    138.9378182970468    41.1282131087734  340.2485424943651  255.86616950626888  

                 PT08.S3(NOx)             NO2(GT)        PT08.S4(NO2)        PT08.S5(O3)                   T  
mean_value  794.9901677888212   58.14887250187026  1391.4796409105484  975.0720316340708   9.778305012290264  
 std_value  320.0327052589704  126.16742509610036  464.36495185057805  454.1555648716221  42.940525662335475  

CaitsArray with shape (2, 11)

In [12]:
tmp.y

                           RH                  AH  
mean_value  39.48537992946458  -6.837603644330447  
 std_value  50.90425366174098   38.73931367013947  

CaitsArray with shape (2, 2)

## FeatureExtractor2D

This transformer is mainly used for extracting 2D features per column of `DatasetArray.X`.

We test this using the `caits.fe.melspectrogram` and `caits.fe.stft`.
Applying each of these functions will transform the `CaitsArray` of `DatasetArray.X` into a 3D `CaitsArray`.


In [13]:
featureExtractor2D = FeatureExtractor2D(melspectrogram, n_fft=100, hop_length=10)
tmp = featureExtractor2D.fit_transform(datasetArrayObj)

  mel_basis = mel_filter(sr=sr, n_fft=n_fft, **kwargs)


In [14]:
tmp.X.shape

(11, 128, 948)

In [15]:
featureExtractor2D = FeatureExtractor2D(stft, n_fft=100, hop_length=10)
tmp1 = featureExtractor2D.fit_transform(datasetArrayObj)

In [16]:
tmp1.X.iloc[:, 0, 0]

       CO(GT)  (-203.00489003333593+0j)
  PT08.S1(CO)   (31352.259235229427+0j)
     NMHC(GT)    (2103.365810117917+0j)
     C6H6(GT)    (183.7569916168304+0j)
PT08.S2(NMHC)    (21178.65692255352+0j)
      NOx(GT)    (2830.009535623774+0j)
 PT08.S3(NOx)   (33108.013756721906+0j)
      NO2(GT)   (2096.0012268926475+0j)
 PT08.S4(NO2)   (37830.104266490875+0j)
  PT08.S5(O3)   (22635.975719316466+0j)
            T    (263.0239320519568+0j)

CaitsArray with shape (11,)

## FunctionTransformer2D

This is mainly used to inverse the `featureExtractor2D` process. So, if `DatasetList.X` is a `CaitsArray` object, it will be
transformed in a `CaitsArray`.

To test this we use the `caits.fe.istft` on the transformed `DatasetArray` object using `caits.fe.stft`.

In [17]:
functionTransformer = FunctionTransformer2D(istft, n_fft=100, hop_length=10)
tmp2 = functionTransformer.fit_transform(tmp1)

In [18]:
tmp2.X

                  CO(GT)         PT08.S1(CO)             NMHC(GT)            C6H6(GT)       PT08.S2(NMHC)             NOx(GT)  \
   0  2.5999999999999996  1360.0000000000002   150.00000000000003  11.900000000000002  1046.0000000000002               166.0  
   1                 2.0  1291.9999999999998   111.99999999999999                 9.4   954.9999999999998               103.0  
   2   2.199999999999998              1402.0                 88.0   9.000000000000004               939.0               131.0  
   3  2.2000000000000033  1376.0000000000005                 80.0                 9.2   948.0000000000001  172.00000000000006  
   4   1.600000000000001              1272.0    51.00000000000002   6.500000000000001   836.0000000000001               131.0  
 ...                 ...                 ...                  ...                 ...                 ...                 ...  
9465  -34.20752377898901  1048.9900609169608  -159.09009297851878  1.8656834455487872   894.59527626375

## SlidingWindow

This is used for performing the sliding window process in each instance of the `DatasetArray` object.

The final windows will be appended in a single `DatasetList` object.

In [19]:
slidingWindow = SlidingWindow(window_size=10, overlap=5)
tmp = slidingWindow.fit_transform(datasetArrayObj)

In [20]:
tmp

DatasetList object with 1893 instances.

In [21]:
tmp.X[0]

   CO(GT)  PT08.S1(CO)  NMHC(GT)  C6H6(GT)  PT08.S2(NMHC)  NOx(GT)  \
0     2.6       1360.0     150.0      11.9         1046.0    166.0  
1     2.0       1292.0     112.0       9.4          955.0    103.0  
2     2.2       1402.0      88.0       9.0          939.0    131.0  
3     2.2       1376.0      80.0       9.2          948.0    172.0  
4     1.6       1272.0      51.0       6.5          836.0    131.0  
5     1.2       1197.0      38.0       4.7          750.0     89.0  
6     1.2       1185.0      31.0       3.6          690.0     62.0  
7     1.0       1136.0      31.0       3.3          672.0     62.0  
8     0.9       1094.0      24.0       2.3          609.0     45.0  
9     0.6       1010.0      19.0       1.7          561.0   -200.0  

   PT08.S3(NOx)  NO2(GT)  PT08.S4(NO2)  PT08.S5(O3)     T  
0        1056.0    113.0        1692.0       1268.0  13.6  
1        1174.0     92.0        1559.0        972.0  13.3  
2        1140.0    114.0        1555.0       1074.0  11.9  

In [22]:
tmp.y[0]

     RH      AH  
0  48.9  0.7578  
1  47.7  0.7255  
2  54.0  0.7502  
3  60.0  0.7867  
4  59.6  0.7888  
5  59.2  0.7848  
6  56.8  0.7603  
7  60.0  0.7702  
8  59.7  0.7648  
9  60.2  0.7517  

CaitsArray with shape (10, 2)