# DatasetList test

In this notebook we test the functionalities of the `DatasetList` class.

## Libraries import

In [1]:
from caits.dataset import CoreArray, DatasetList
from caits.loading import csv_loader
from caits.filtering import filter_butterworth
from caits.properties import magnitude_signal


## Dataset loading

We load the data/GestureSet_small for this notebook.

In [2]:
data = csv_loader("data/GestureSet_small")
X, y, id = data["X"], data["y"], data["id"]
caitsX = [CoreArray(values=x.values, axis_names={"axis_1": x.columns}) for x in X]

type(caitsX[0]), type(y[0]), type(id[0])

Loading CSV files: 100%|██████████| 924/924 [00:00<00:00, 2250.58it/s]


(caits.dataset._coreArray.CoreArray, str, str)

In [3]:
datasetListObj = DatasetList(caitsX, y, id)
datasetListObj

DatasetList object with 924 instances.

In [4]:
datasetListObj.X[0]

     acc_x_axis_g  acc_y_axis_g  acc_z_axis_g  gyr_x_axis_deg/s  gyr_y_axis_deg/s  gyr_z_axis_deg/s  
  0         1.332         0.356         0.156           -74.207           -43.476          -101.098  
  1         1.751         0.146         0.178           -29.695           -16.098           -61.524  
  2          1.45        -0.049         0.252            -2.805              3.11            37.134  
  3         0.688         0.136         0.447            -3.902            -2.317            87.134  
  4         0.182         0.533         0.552             -10.0           -12.012            71.098  
...           ...           ...           ...               ...               ...               ...  
123         0.259         0.614         0.423            22.378           131.463            40.488  
124         0.259         0.614         0.459            13.293           123.537            23.293  
125         0.383         0.635         0.447             0.793           109.085 

In [5]:
len(datasetListObj)

924

## Indexing

In this subsection we test the various indexing methods that can be used.

### Indexing using integer

This returns a `DatasetList` object, consisting of a single instance `X[int], y[int], _id[int]`.

In [6]:
datasetListObj[3]

DatasetList object with 1 instances.

### Indexing using a slice

This returns a `DatasetList` object, consisting of instances `X[slice], y[slice], _id[slice]`.

In [7]:
datasetListObj[3:15]

DatasetList object with 12 instances.

### Indexing using list of indices

This returns a `DatasetList` object, consisting of instances of the indices in the list `X[indices] y[indices], _id[indices]`.

In [8]:
datasetListObj[[3,8,16,107]]

DatasetList object with 4 instances.

### Indexing using a tuple of indices

This returns a `DatasetList` object, consisting of a single instance `X[int1][..., int2], y[int1], _id[int1]`.

In [9]:
datasetListObj[1, 4]

DatasetList object with 1 instances.

### Indexing using a tuple consisting of an integer and a slice

This returns a `DatasetList object`, consisting of a single instance `X[int][:, slice], y[int], _id[int]`.

In [10]:
tmp = datasetListObj[1, 2:5]
tmp, tmp.X[0].shape

(DatasetList object with 1 instances., (91, 3))

### Indexing using a tuple consisting of an integer and a list of integers

This returns a single `DatasetList` object, consisting o a single instance `X[int][:, list], y[int], _id[int]`.

In [11]:
tmp = datasetListObj[1, [3,4]]
tmp, tmp.X[0].shape

(DatasetList object with 1 instances., (91, 2))

### Indexing using column names

In this part, we will investigate indexing using column names.

In [12]:
datasetListObj.X[0].axis_names["axis_1"]

{'acc_x_axis_g': 0,
 'acc_y_axis_g': 1,
 'acc_z_axis_g': 2,
 'gyr_x_axis_deg/s': 3,
 'gyr_y_axis_deg/s': 4,
 'gyr_z_axis_deg/s': 5}

In [13]:
datasetListObj.X[0].keys()["axis_1"]

['acc_x_axis_g',
 'acc_y_axis_g',
 'acc_z_axis_g',
 'gyr_x_axis_deg/s',
 'gyr_y_axis_deg/s',
 'gyr_z_axis_deg/s']

#### Indexing using a tuple, consisting of an integer and a column name

This will return a single `DatasetList` object, consisting of a single instance `X[int][..., col], y[int], _id[int]`.

In [14]:
tmp = datasetListObj[1, "acc_x_axis_g"]
tmp, tmp.X[0].shape, tmp.X[0], tmp.y, tmp._id

(DatasetList object with 1 instances.,
 (91,),
   0  -0.395
   1  -0.577
   2  -0.732
   3  -0.643
   4  -0.372
 ...     ...
  86  -0.014
  87  -0.064
  88  -0.074
  89  -0.089
  90  -0.116
 
 CoreArray with shape (91,),
 ['02a'],
 ['02a_0_100_AccGyr_1_0_1_01_90_5e79d60d02105cee3431d1ab.csv'])

#### Indexing using a tuple, consisting of an integer and a list of column names

This will return a single `DatasetList` object, constisting of a single instance `X[int][..., columns], y[int], _id[int]`.

In [15]:
tmp = datasetListObj[1, ["acc_x_axis_g", "acc_z_axis_g"]]
tmp, tmp.X[0].shape

(DatasetList object with 1 instances., (91, 2))

#### Indexing using a tuple, consising of an integer and a slice of column names.

This will return a single `DatasetList` object, consisting of a single instance `X[int][..., slice], y[int], _id[int]`.

In [16]:
tmp = datasetListObj[1, "acc_x_axis_g":"gyr_x_axis_deg/s"]
tmp, tmp.X[0].shape, tmp.X[0]

(DatasetList object with 1 instances.,
 (91, 4),
     acc_x_axis_g  acc_y_axis_g  acc_z_axis_g  gyr_x_axis_deg/s  
  0        -0.395         0.751         0.432           -44.878  
  1        -0.577         0.752         0.292           -52.988  
  2        -0.732         0.719         0.173           -59.024  
  3        -0.643         0.715         0.124            -64.39  
  4        -0.372         0.708         0.146           -62.073  
 ...           ...           ...           ...               ...  
 86        -0.014         0.693          0.72             11.22  
 87        -0.064         0.727         0.763             13.78  
 88        -0.074         0.751         0.727            18.293  
 89        -0.089         0.751         0.685            23.841  
 90        -0.116         0.755           0.7             24.39  
 
 CoreArray with shape (91, 4))

### Indexing using tuple with first item a slice

#### Indexing using a tuple consisting of a slice and an integer

This will return a `DatasetList` object, consisting of multiple instances `X[slice][..., int], y[slice], _id[slice]`.

In [17]:
datasetListObj[1:4, 1]

DatasetList object with 3 instances.

#### Indexing using a tuple consisting of two slices

This will return a `DatasetList` object, consisting of multiple instances `X[slice1][..., slice2], y[slice1], _id[slice1]`.

In [18]:
datasetListObj[1:4, 3:5]

DatasetList object with 3 instances.

#### Indexing using a tuple consisting of a slice and a list of integers

This will return a `DatasetList` object, consisting of multiple instances `X[slice][..., list], y[slice], _id[slice]`.

In [19]:
datasetListObj[1:4, [1,5]]

DatasetList object with 3 instances.

#### Indexing using a slice and a column name

This will return a `DatasetList` object, consisting of multiple instances `X[slice][..., col], y[slice], _id[slice]`.

In [20]:
datasetListObj[1:4, "acc_x_axis_g"]

DatasetList object with 3 instances.

#### Indexing using a slice and a list of column names

This will return a `DatasetList` object, consisting of multiple instances `X[slice][..., list], y[slice], _id[slice]`.

In [21]:
datasetListObj[1:4, ["acc_z_axis_g", "gyr_z_axis_deg/s"]]

DatasetList object with 3 instances.

#### Indexing using a slice of integers and a slice of column names

This will return a `DatasetList` object, consisting of multiple instances `X[slice1][..., slice2], y[slice1], _id[slice1]`.

In [22]:
tmp = datasetListObj[1:4, "acc_x_axis_g":"gyr_x_axis_deg/s"]
tmp, tmp.X[0].shape, tmp.X[0]

(DatasetList object with 3 instances.,
 (91, 4),
     acc_x_axis_g  acc_y_axis_g  acc_z_axis_g  gyr_x_axis_deg/s  
  0        -0.395         0.751         0.432           -44.878  
  1        -0.577         0.752         0.292           -52.988  
  2        -0.732         0.719         0.173           -59.024  
  3        -0.643         0.715         0.124            -64.39  
  4        -0.372         0.708         0.146           -62.073  
 ...           ...           ...           ...               ...  
 86        -0.014         0.693          0.72             11.22  
 87        -0.064         0.727         0.763             13.78  
 88        -0.074         0.751         0.727            18.293  
 89        -0.089         0.751         0.685            23.841  
 90        -0.116         0.755           0.7             24.39  
 
 CoreArray with shape (91, 4))

In [23]:
tmp1 = datasetListObj[:100, "acc_x_axis_g":"acc_z_axis_g"]
tmp2 = datasetListObj[:100, "gyr_x_axis_deg/s":"gyr_y_axis_deg/s"]
len(tmp1), len(tmp2), tmp1.X[0].shape, tmp2.X[0].shape

(100, 100, (128, 3), (128, 2))

## Unify

In this subsection we test the unify. This method is used to merge `DatasetList` objects, row or column wise.

In [24]:
axis_names = {**tmp1.X[0].axis_names["axis_1"], **tmp2.X[0].axis_names["axis_1"]}
axis_names

{'acc_x_axis_g': 0,
 'acc_y_axis_g': 1,
 'acc_z_axis_g': 2,
 'gyr_x_axis_deg/s': 0,
 'gyr_y_axis_deg/s': 1}

In [25]:
tmp = tmp1.unify([tmp2], axis=1)
tmp, tmp.X[0].shape, tmp.X[0]

(DatasetList object with 100 instances.,
 (128, 5),
      acc_x_axis_g  acc_y_axis_g  acc_z_axis_g  gyr_x_axis_deg/s  gyr_y_axis_deg/s  
   0         1.332         0.356         0.156           -74.207           -43.476  
   1         1.751         0.146         0.178           -29.695           -16.098  
   2          1.45        -0.049         0.252            -2.805              3.11  
   3         0.688         0.136         0.447            -3.902            -2.317  
   4         0.182         0.533         0.552             -10.0           -12.012  
 ...           ...           ...           ...               ...               ...  
 123         0.259         0.614         0.423            22.378           131.463  
 124         0.259         0.614         0.459            13.293           123.537  
 125         0.383         0.635         0.447             0.793           109.085  
 126         0.397         0.636         0.445            -5.854            79.207  
 127         

In [26]:
tmp1 = datasetListObj[:100, ["acc_x_axis_g"]]
tmp2 = datasetListObj[:100, ["acc_y_axis_g"]]
tmp3 = datasetListObj[:100, ["acc_z_axis_g", "gyr_z_axis_deg/s"]]
tmp1.X[0], tmp2.X[0], tmp3.X[0]

(     acc_x_axis_g  
   0         1.332  
   1         1.751  
   2          1.45  
   3         0.688  
   4         0.182  
 ...           ...  
 123         0.259  
 124         0.259  
 125         0.383  
 126         0.397  
 127         0.361  
 
 CoreArray with shape (128, 1),
      acc_y_axis_g  
   0         0.356  
   1         0.146  
   2        -0.049  
   3         0.136  
   4         0.533  
 ...           ...  
 123         0.614  
 124         0.614  
 125         0.635  
 126         0.636  
 127         0.632  
 
 CoreArray with shape (128, 1),
      acc_z_axis_g  gyr_z_axis_deg/s  
   0         0.156          -101.098  
   1         0.178           -61.524  
   2         0.252            37.134  
   3         0.447            87.134  
   4         0.552            71.098  
 ...           ...               ...  
 123         0.423            40.488  
 124         0.459            23.293  
 125         0.447            10.793  
 126         0.445             6.159  

In [27]:
# tmp = tmp1.unify([tmp2, tmp3], axis_names={"axis_1": {"col1": 0, "col2": 1, "col3": 2, "col4": 3}}, axis=1)
tmp = tmp1.unify([tmp2, tmp3], axis_names={"axis_1": ["col1", "col2", "col3", "col4"]}, axis=1)
tmp, tmp.X[0].shape, tmp.X[0].axis_names["axis_1"]

(DatasetList object with 100 instances.,
 (128, 4),
 {'col1': 0, 'col2': 1, 'col3': 2, 'col4': 3})

In [28]:
tmp[:, ["col1", "col3"]].X

[      col1   col3  
   0  1.332  0.156  
   1  1.751  0.178  
   2   1.45  0.252  
   3  0.688  0.447  
   4  0.182  0.552  
 ...    ...    ...  
 123  0.259  0.423  
 124  0.259  0.459  
 125  0.383  0.447  
 126  0.397  0.445  
 127  0.361  0.437  
 
 CoreArray with shape (128, 2),
       col1   col3  
  0  -0.395  0.432  
  1  -0.577  0.292  
  2  -0.732  0.173  
  3  -0.643  0.124  
  4  -0.372  0.146  
 ...     ...    ...  
 86  -0.014   0.72  
 87  -0.064  0.763  
 88  -0.074  0.727  
 89  -0.089  0.685  
 90  -0.116    0.7  
 
 CoreArray with shape (91, 2),
       col1   col3  
   0  0.036   0.41  
   1  0.042  0.434  
   2  0.073  0.481  
   3  0.122  0.543  
   4  0.168  0.582  
 ...    ...    ...  
 136  0.291  0.442  
 137  0.363  0.443  
 138  0.401  0.439  
 139   0.39  0.417  
 140  0.364  0.425  
 
 CoreArray with shape (141, 2),
       col1   col3  
   0   0.43  0.431  
   1  0.246  0.442  
   2  0.176  0.438  
   3  0.188  0.402  
   4  0.261  0.373  
 ...    ...    .

## Replace

In [29]:
import numpy as np

new_data_vals = [
    np.ones(shape=datasetListObj.X[i].iloc[:, [1,3,4]].shape)
    for i in range(len(datasetListObj))
]

# axis_names_list = list(datasetListObj.X[0].axis_names["axis_1"].keys())
axis_names_list = datasetListObj.X[0].keys()["axis_1"]
axis_names = {"axis_1": [name for i, name in enumerate(axis_names_list) if i in {1,3,4}]}

new_data_caits = [CoreArray(arr, axis_names=axis_names) for arr in new_data_vals]
new_dataset_list_obj = DatasetList(new_data_caits, datasetListObj.y, datasetListObj._id)
new_dataset_list_obj

DatasetList object with 924 instances.

In [30]:
new_dataset_list_obj.X[0]

     acc_y_axis_g  gyr_x_axis_deg/s  gyr_y_axis_deg/s  
  0           1.0               1.0               1.0  
  1           1.0               1.0               1.0  
  2           1.0               1.0               1.0  
  3           1.0               1.0               1.0  
  4           1.0               1.0               1.0  
...           ...               ...               ...  
123           1.0               1.0               1.0  
124           1.0               1.0               1.0  
125           1.0               1.0               1.0  
126           1.0               1.0               1.0  
127           1.0               1.0               1.0  

CoreArray with shape (128, 3)

In [31]:
datasetListObj.replace(new_dataset_list_obj)
datasetListObj.X[1]

    acc_x_axis_g  acc_y_axis_g  acc_z_axis_g  gyr_x_axis_deg/s  gyr_y_axis_deg/s  gyr_z_axis_deg/s  
 0        -0.395           1.0         0.432               1.0               1.0            22.683  
 1        -0.577           1.0         0.292               1.0               1.0            27.927  
 2        -0.732           1.0         0.173               1.0               1.0             31.28  
 3        -0.643           1.0         0.124               1.0               1.0            29.695  
 4        -0.372           1.0         0.146               1.0               1.0            26.159  
...           ...           ...           ...               ...               ...               ...  
86        -0.014           1.0          0.72               1.0               1.0            14.695  
87        -0.064           1.0         0.763               1.0               1.0            17.866  
88        -0.074           1.0         0.727               1.0               1.0          

## Loops

In this subsection we test looping capabilites of a `DatasetList` object.

### For loop

In [32]:
for i, row in enumerate(datasetListObj):
    print(i)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

### For loop in batches

In [33]:
for i, batch in enumerate(datasetListObj.batch(10)):
    print(batch)

([     acc_x_axis_g  acc_y_axis_g  acc_z_axis_g  gyr_x_axis_deg/s  gyr_y_axis_deg/s  gyr_z_axis_deg/s  
  0         1.332           1.0         0.156               1.0               1.0          -101.098  
  1         1.751           1.0         0.178               1.0               1.0           -61.524  
  2          1.45           1.0         0.252               1.0               1.0            37.134  
  3         0.688           1.0         0.447               1.0               1.0            87.134  
  4         0.182           1.0         0.552               1.0               1.0            71.098  
...           ...           ...           ...               ...               ...               ...  
123         0.259           1.0         0.423               1.0               1.0            40.488  
124         0.259           1.0         0.459               1.0               1.0            23.293  
125         0.383           1.0         0.447               1.0               1.

## Train_Test split

In this subsection we check the `train_test_split` method.

### Not-random split

This splits the `DatasetList` object in:
- train: first `Nx` instances
- test: last `N-Nx` instances

where `N` is the number of all instances and `Nx = int(N * (1 - test_size))`.

In [34]:
train_obj, test_obj = datasetListObj.train_test_split()

In [35]:
len(train_obj), len(test_obj)

(740, 184)

In [36]:
train_obj.X

[     acc_x_axis_g  acc_y_axis_g  acc_z_axis_g  gyr_x_axis_deg/s  gyr_y_axis_deg/s  gyr_z_axis_deg/s  
   0         1.332           1.0         0.156               1.0               1.0          -101.098  
   1         1.751           1.0         0.178               1.0               1.0           -61.524  
   2          1.45           1.0         0.252               1.0               1.0            37.134  
   3         0.688           1.0         0.447               1.0               1.0            87.134  
   4         0.182           1.0         0.552               1.0               1.0            71.098  
 ...           ...           ...           ...               ...               ...               ...  
 123         0.259           1.0         0.423               1.0               1.0            40.488  
 124         0.259           1.0         0.459               1.0               1.0            23.293  
 125         0.383           1.0         0.447               1.0         

### Random split

This splits the `DatasetList` object in:
- train: `Nx` random instances
- test: The rest `N-Nx` instances

where `N` is the number of all instances and `Nx = int(N * (1 - test_size))`.

In [37]:
train_obj, test_obj = datasetListObj.train_test_split(random_state=42)
len(train_obj), len(test_obj)

(740, 184)

In [38]:
len(train_obj.y), len(test_obj.y)

(740, 184)

### Stratify

In [39]:
train_obj, test_obj = datasetListObj.train_test_split(random_state=42, test_size=0.2, stratified=True)


In [40]:
train_obj

DatasetList object with 741 instances.

In [41]:
test_obj

DatasetList object with 183 instances.

## Adding two DatasetList objects

In this section we check the addition of two `DatasetList` objects. This is equivalent to:

`obj1.unify([obj2], axis=0)`

This way, the `obj2` is appended to the `obj1`, row-wise.

In [42]:
newDatasetListObj = train_obj + test_obj
len(newDatasetListObj)

924

In [None]:
len(newDatasetListObj.y)

## Apply method

In this subsection we test applying a method on a `DatasetList` object.

When `DatasetList.apply` is called, the callable method is applied to the instances of `DatasetList.X`, one at a time.

We test `DatasetList.apply` using `caits.fe.filter_butterworth` and `caits.fe.magnitude_signal`.

In [None]:
datasetListObj.apply(filter_butterworth, fs=200, filter_type='lowpass', cutoff_freq=50)

In [None]:
datasetListObj.apply(magnitude_signal, axis=0)

## Shuffling

In this subsection we test shuffling a `DatasetList` object.

In [None]:
shuffled_dataset = datasetListObj.shuffle()

In [None]:
datasetListObj.X, datasetListObj

In [None]:
datasetListObj.y

In [None]:
shuffled_dataset.X, shuffled_dataset

In [None]:
shuffled_dataset.y

## Flatten

In this subsection, we test the `DatasetList.flatten` method. By default, it flattens each instance and then stacks the flattened instance in a single 2D array.

Note that this functions works only when instances have the same shape.

In [None]:
reshaped_datasetListObj_vals = [x.iloc[:20, ...] for x in datasetListObj.X]
reshaped_datasetListObj = DatasetList(reshaped_datasetListObj_vals, y=datasetListObj.y, id=datasetListObj._id)

reshaped_datasetListObj_flat = reshaped_datasetListObj.flatten()
reshaped_datasetListObj_flat

In [None]:
len(reshaped_datasetListObj_flat.y)

In [None]:
reshaped_datasetListObj_flat.X

## Conversions

In this subsection we test various conversion methods of the `DatasetList` object.

### to_dict

This converts a `DatasetList` object to a dictionary with keys "X", "y" and "_id", where each value is the corresponding attribute of the `DatasetList` object.

In [None]:
datasetDict = datasetListObj.to_dict()
datasetDict.keys()

### dict_to_dataset

This converts a dictionary to a `DatasetList` object.

In [None]:
tmpToDataset = datasetListObj.dict_to_dataset(datasetDict)
tmpToDataset


### to_numpy()

This converts a `DatasetList` object to a list of `numpy.arrays`

In [None]:
from caits.windowing import sliding_window_arr
sw_tmp = tmp.stack(tmp.apply(sliding_window_arr, window_size=10, overlap=1))
sw_tmp.X

In [None]:
datasetNumpy_X, datasetNumpy_y, datasetNumpy_id = sw_tmp.to_numpy()
datasetNumpy_X.shape
