# DatasetList test

In this notebook we test the functionalities of the `DatasetList` class.

## Libraries import

In [1]:
from caits.dataset._dataset3 import CoreArray, DatasetList
from caits.loading import csv_loader
from caits.filtering import filter_butterworth
from caits.properties import magnitude_signal


## Dataset loading

We load the data/GestureSet_small for this notebook.

In [2]:
data = csv_loader("data/GestureSet_small")
X, y, id = data["X"], data["y"], data["id"]
caitsX = [CoreArray(values=x.values, axis_names={
    "axis_1": {
        col: i for i, col in enumerate(x.columns)
    }
}) for x in X]
type(caitsX[0]), type(y[0]), type(id[0])

Loading CSV files: 100%|██████████| 924/924 [00:00<00:00, 2018.18it/s]


(caits.dataset._dataset3.CaitsArray, str, str)

In [3]:
datasetListObj = DatasetList(caitsX, y, id)
datasetListObj

DatasetList object with 924 instances.

In [4]:
len(datasetListObj)

924

## Indexing

In this subsection we test the various indexing methods that can be used.

### Indexing using integer

This returns a `DatasetList` object, consisting of a single instance `X[int], y[int], _id[int]`.

In [None]:
datasetListObj[3]

### Indexing using a slice

This returns a `DatasetList` object, consisting of instances `X[slice], y[slice], _id[slice]`.

In [None]:
datasetListObj[3:15]

### Indexing using list of indices

This returns a `DatasetList` object, consisting of instances of the indices in the list `X[indices] y[indices], _id[indices]`.

In [None]:
datasetListObj[[3,8,16,107]]

### Indexing using a tuple of indices

This returns a `DatasetList` object, consisting of a single instance `X[int1][..., int2], y[int1], _id[int1]`.

In [None]:
datasetListObj[1, 4]

### Indexing using a tuple consisting of an integer and a slice

This returns a `DatasetList object`, consisting of a single instance `X[int][:, slice], y[int], _id[int]`.

In [None]:
tmp = datasetListObj[1, 2:5]
tmp, tmp.X[0].shape

### Indexing using a tuple consisting of an integer and a list of integers

This returns a single `DatasetList` object, consisting o a single instance `X[int][:, list], y[int], _id[int]`.

In [None]:
tmp = datasetListObj[1, [3,4]]
tmp, tmp.X[0].shape

### Indexing using column names

In this part, we will investigate indexing using column names.

In [None]:
datasetListObj.X[0].axis_names["axis_1"]

#### Indexing using a tuple, consisting of an integer and a column name

This will return a single `DatasetList` object, consisting of a single instance `X[int][..., col], y[int], _id[int]`.

In [None]:
tmp = datasetListObj[1, "acc_x_axis_g"]
tmp, tmp.X[0].shape, tmp.X[0], tmp.y, tmp._id

#### Indexing using a tuple, consisting of an integer and a list of column names

This will return a single `DatasetList` object, constisting of a single instance `X[int][..., columns], y[int], _id[int]`.

In [None]:
tmp = datasetListObj[1, ["acc_x_axis_g", "acc_z_axis_g"]]
tmp, tmp.X[0].shape

#### Indexing using a tuple, consising of an integer and a slice of column names.

This will return a single `DatasetList` object, consisting of a single instance `X[int][..., slice], y[int], _id[int]`.

In [None]:
tmp = datasetListObj[1, "acc_x_axis_g":"gyr_x_axis_deg/s"]
tmp, tmp.X[0].shape, tmp.X[0]

### Indexing using tuple with first item a slice

#### Indexing using a tuple consisting of a slice and an integer

This will return a `DatasetList` object, consisting of multiple instances `X[slice][..., int], y[slice], _id[slice]`.

In [None]:
datasetListObj[1:4, 1]

#### Indexing using a tuple consisting of two slices

This will return a `DatasetList` object, consisting of multiple instances `X[slice1][..., slice2], y[slice1], _id[slice1]`.

In [None]:
datasetListObj[1:4, 3:5]

#### Indexing using a tuple consisting of a slice and a list of integers

This will return a `DatasetList` object, consisting of multiple instances `X[slice][..., list], y[slice], _id[slice]`.

In [None]:
datasetListObj[1:4, [1,5]]

#### Indexing using a slice and a column name

This will return a `DatasetList` object, consisting of multiple instances `X[slice][..., col], y[slice], _id[slice]`.

In [None]:
datasetListObj[1:4, "acc_x_axis_g"]

#### Indexing using a slice and a list of column names

This will return a `DatasetList` object, consisting of multiple instances `X[slice][..., list], y[slice], _id[slice]`.

In [None]:
datasetListObj[1:4, ["acc_z_axis_g", "gyr_z_axis_deg/s"]]

#### Indexing using a slice of integers and a slice of column names

This will return a `DatasetList` object, consisting of multiple instances `X[slice1][..., slice2], y[slice1], _id[slice1]`.

In [None]:
tmp = datasetListObj[1:4, "acc_x_axis_g":"gyr_x_axis_deg/s"]
tmp, tmp.X[0].shape, tmp.X[0]

In [None]:
tmp1 = datasetListObj[:100, "acc_x_axis_g":"acc_z_axis_g"]
tmp2 = datasetListObj[:100, "gyr_x_axis_deg/s":"gyr_y_axis_deg/s"]
len(tmp1), len(tmp2), tmp1.X[0].shape, tmp2.X[0].shape

## Unify

In this subsection we test the unify. This method is used to merge `DatasetList` objects, row or column wise.

In [None]:
axis_names = {**tmp1.X[0].axis_names["axis_1"], **tmp2.X[0].axis_names["axis_1"]}
axis_names

In [None]:
tmp = tmp1.unify([tmp2], axis=1)
tmp, tmp.X[0].shape, tmp.X[0]

In [None]:
tmp1 = datasetListObj[:100, ["acc_x_axis_g"]]
tmp2 = datasetListObj[:100, ["acc_y_axis_g"]]
tmp3 = datasetListObj[:100, ["acc_z_axis_g", "gyr_z_axis_deg/s"]]
tmp1.X[0], tmp2.X[0], tmp3.X[0]

In [None]:
tmp = tmp1.unify([tmp2, tmp3], axis_names={"axis_1": {"col1": 0, "col2": 1, "col3": 2, "col4": 3}}, axis=1)
tmp, tmp.X[0].shape, tmp.X[0].axis_names

In [None]:
tmp[:, ["col1", "col3"]].X

## Replace

In [None]:
import numpy as np

new_data_vals = [
    np.ones(shape=datasetListObj.X[i].iloc[:, [1,3,4]].shape)
    for i in range(len(datasetListObj))
]

axis_names_list = list(datasetListObj.X[0].axis_names["axis_1"].keys())
axis_names = {"axis_1": {name: i for i, name in enumerate(axis_names_list) if i in {1,3,4}}}

new_data_caits = [CoreArray(arr, axis_names=axis_names) for arr in new_data_vals]
new_dataset_list_obj = DatasetList(new_data_caits, datasetListObj.y, datasetListObj._id)
new_dataset_list_obj

In [None]:
datasetListObj.replace(new_dataset_list_obj)
datasetListObj.X[1]

## Loops

In this subsection we test looping capabilites of a `DatasetList` object.

### For loop

In [None]:
for i, row in enumerate(datasetListObj):
    print(i)

### For loop in batches

In [5]:
for i, batch in enumerate(datasetListObj.batch(10)):
    print(batch)

([     acc_x_axis_g  acc_y_axis_g  acc_z_axis_g  gyr_x_axis_deg/s  gyr_y_axis_deg/s  gyr_z_axis_deg/s  
  0         1.332         0.356         0.156           -74.207           -43.476          -101.098  
  1         1.751         0.146         0.178           -29.695           -16.098           -61.524  
  2          1.45        -0.049         0.252            -2.805              3.11            37.134  
  3         0.688         0.136         0.447            -3.902            -2.317            87.134  
  4         0.182         0.533         0.552             -10.0           -12.012            71.098  
...           ...           ...           ...               ...               ...               ...  
123         0.259         0.614         0.423            22.378           131.463            40.488  
124         0.259         0.614         0.459            13.293           123.537            23.293  
125         0.383         0.635         0.447             0.793           109.08

## Train_Test split

In this subsection we check the `train_test_split` method.

### Not-random split

This splits the `DatasetList` object in:
- train: first `Nx` instances
- test: last `N-Nx` instances

where `N` is the number of all instances and `Nx = int(N * (1 - test_size))`.

In [4]:
train_obj, test_obj = datasetListObj.train_test_split()

In [5]:
len(train_obj), len(test_obj)

(739, 185)

In [6]:
train_obj.X

[     acc_x_axis_g  acc_y_axis_g  acc_z_axis_g  gyr_x_axis_deg/s  gyr_y_axis_deg/s  gyr_z_axis_deg/s  
   0         1.332         0.356         0.156           -74.207           -43.476          -101.098  
   1         1.751         0.146         0.178           -29.695           -16.098           -61.524  
   2          1.45        -0.049         0.252            -2.805              3.11            37.134  
   3         0.688         0.136         0.447            -3.902            -2.317            87.134  
   4         0.182         0.533         0.552             -10.0           -12.012            71.098  
 ...           ...           ...           ...               ...               ...               ...  
 123         0.259         0.614         0.423            22.378           131.463            40.488  
 124         0.259         0.614         0.459            13.293           123.537            23.293  
 125         0.383         0.635         0.447             0.793         

### Random split

This splits the `DatasetList` object in:
- train: `Nx` random instances
- test: The rest `N-Nx` instances

where `N` is the number of all instances and `Nx = int(N * (1 - test_size))`.

In [7]:
train_obj, test_obj = datasetListObj.train_test_split(random_state=42)
len(train_obj), len(test_obj)

(739, 185)

In [9]:
len(train_obj.y), len(test_obj.y)

(739, 185)

### Stratify

In [10]:
train_obj, test_obj = datasetListObj.train_test_split(random_state=42, test_size=0.2, stratified=True)

In [11]:
train_obj

DatasetList object with 734 instances.

In [12]:
test_obj

DatasetList object with 186 instances.

## Adding two DatasetList objects

In this section we check the addition of two `DatasetList` objects. This is equivalent to:

`obj1.unify([obj2], axis=0)`

This way, the `obj2` is appended to the `obj1`, row-wise.

In [None]:
newDatasetListObj = train_obj + test_obj
len(newDatasetListObj)

In [None]:
len(newDatasetListObj.y)

## Apply method

In this subsection we test applying a method on a `DatasetList` object.

When `DatasetList.apply` is called, the callable method is applied to the instances of `DatasetList.X`, one at a time.

We test `DatasetList.apply` using `caits.fe.filter_butterworth` and `caits.fe.magnitude_signal`.

In [None]:
datasetListObj.apply(filter_butterworth, fs=200, filter_type='lowpass', cutoff_freq=50)

In [None]:
datasetListObj.apply(magnitude_signal, axis=0)

## Shuffling

In this subsection we test shuffling a `DatasetList` object.

In [None]:
shuffled_dataset = datasetListObj.shuffle()

In [None]:
datasetListObj.X, datasetListObj

In [None]:
datasetListObj.y

In [None]:
shuffled_dataset.X, shuffled_dataset

In [None]:
shuffled_dataset.y

## Flatten

In this subsection, we test the `DatasetList.flatten` method. By default, it flattens each instance and then stacks the flattened instance in a single 2D array.

Note that this functions works only when instances have the same shape.

In [None]:
reshaped_datasetListObj_vals = [x.iloc[:20, ...] for x in datasetListObj.X]
reshaped_datasetListObj = DatasetList(reshaped_datasetListObj_vals, y=datasetListObj.y, id=datasetListObj._id)

reshaped_datasetListObj_flat = reshaped_datasetListObj.flatten()
reshaped_datasetListObj_flat

In [None]:
len(reshaped_datasetListObj_flat.y)

In [None]:
reshaped_datasetListObj_flat.X

## Conversions

In this subsection we test various conversion methods of the `DatasetList` object.

### to_dict

This converts a `DatasetList` object to a dictionary with keys "X", "y" and "_id", where each value is the corresponding attribute of the `DatasetList` object.

In [None]:
datasetDict = datasetListObj.to_dict()
datasetDict.keys()

### dict_to_dataset

This converts a dictionary to a `DatasetList` object.

In [None]:
tmpToDataset = datasetListObj.dict_to_dataset(datasetDict)
tmpToDataset


### to_numpy()

This converts a `DatasetList` object to a list of `numpy.arrays`

In [None]:
datasetNumpy = tmp.to_numpy()
type(datasetNumpy[0]), type(datasetNumpy[1]), type(datasetNumpy[2])