# Dataset
We will explore this dataset: https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State#

> All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.

In [6]:
import tensorflow as tf
data_dir = "../../data/raw"
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00264/EEG%20Eye%20State.arff"
datapath = tf.keras.utils.get_file(
        "eeg", origin=url, untar=False, cache_dir=data_dir
    )

2022-06-05 10:24:40.876486: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-06-05 10:24:40.876525: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


You can load the arff file with scipy

In [7]:
from scipy.io import arff
data = arff.loadarff(datapath)

In [8]:
datapath

'../../data/raw/datasets/eeg'

The data is a tuple of a description and observations

In [9]:
len(data), type(data)

(2, tuple)

Description

In [10]:
data[1]

Dataset: EEG_DATA
	AF3's type is numeric
	F7's type is numeric
	F3's type is numeric
	FC5's type is numeric
	T7's type is numeric
	P7's type is numeric
	O1's type is numeric
	O2's type is numeric
	P8's type is numeric
	T8's type is numeric
	FC6's type is numeric
	F4's type is numeric
	F8's type is numeric
	AF4's type is numeric
	eyeDetection's type is nominal, range is ('0', '1')

There are about 15k observations

In [91]:
line = data[0][0]
observation = []
obs = []
for index, i in enumerate(line):
    if index != 14:
        observation.append(i)
observation = torch.Tensor(observation)
observation

torch.stack(observation)

tensor([4329.2300, 4009.2300, 4289.2300, 4148.2100, 4350.2598, 4586.1499,
        4096.9199, 4641.0298, 4222.0498, 4238.4600, 4211.2798, 4280.5098,
        4635.8999, 4393.8501])

The observations are tuples of floats and a byte as label

In [102]:
first_label = int(data[0][0][14])
label = first_label
chunck = []
chuncks = []
for line in data[0]:
    if int(line[14]) == label:
        observation = []
        for index, i in enumerate(line):
            if index != 14:
                observation.append(i)
        observation = torch.Tensor(observation)
        chunck.append(observation)
    else:
        chunck_tuple = (label, torch.stack(chunck))
        chuncks.append(chunck_tuple)
        label = int(line[14])
        chunck = []
        observation = []
        for index, i in enumerate(line):
            if index != 14:
                observation.append(i)
        observation = torch.Tensor(observation)
chunck_tuple = (label, torch.stack(chunck))
chuncks.append(chunck_tuple)


# for chunck in chuncks:
#     print(chunck[0], len(chunck[1]))

chuncks[1][1].shape

torch.Size([682, 14])

In [28]:
labels = []
for x in data[0]:
    labels.append(int(x[14]))

data[0][0][14]

b'0'

In [9]:
import numpy as np
np.array(labels).mean()

0.4487983978638184

About 45% of the data has closed eyes.

# Excercises 1

- create a get_eeg function that downloads the data to a given path
- build a Dataset that yields a ($X, y$) tuple of tensors. $X$ should be sequential in time. Remember: a dataset should implement `__get_item__` and `__len__`.
- note that you could model this as both a classification task, but also as a sequence-to-sequence task! For this excercise, make it a classification task with consecutive 0s or 1s only.
- Note that, for a training task, a seq2seq model will probably be more realistic. However, the classification is a nice excercise because it is harder to set up.
- figure out what the length distribution is of your dataset: how many timestamps do you have for every consecutive sequence of 0s and 1s? On average, median, min, max?
- create a dataloader that yields timeseries with (batch, sequence_lenght). You can implement: windowed, padded and batched.
    1. yielding a windowed item should be the easy level
    2. yielding windowed and padded is medium level 
    3. yielding windowed, padded and batched is expert level, because the windowing will cause the timeseries to have different sizes. You will need to buffer before you can yield a batch.

1. Upload this to github. 
2. Put your dev notebooks in a seperate folder
3. Put all your functions in the src folder
4. Use a formater & linter
5. Add a single notebook, that sources the src folder. Indicate which level you got (1, 2 or 3)
6. and that shows your dataloader works:
    - it should not give errors because it runs out of data! Either let is stop by itself, or run forever.
    - batchsize should be consistent (in case 1 and 2, batchsize is 1)
    - sequence length is allowed to vary

The first excercise is ex1, this one is ex2. You will get $max(ex1, average(ex1, ex2))$ as a final remark.
Level 3 can get you an 11, because it exceeds expectation.

# Excercise 2
- build a Dataset that yields sequences of X, y. This time, y is a sequence and can contain both 0s and 1s
- create a Dataloader with this
- Test appropriate architectures (RNN, Attention)
- for the loss, note that you will need a BCELoss instead of a CrossEntroyLoss

In [67]:
import random


list_index

[1,
 3,
 6,
 4,
 7,
 22,
 20,
 21,
 10,
 16,
 2,
 5,
 9,
 15,
 8,
 0,
 18,
 11,
 17,
 19,
 14,
 12,
 13]

In [179]:
from __future__ import annotations
import random
import shutil
from datetime import datetime
from pathlib import Path
from typing import Callable, Dict, Iterator, List, Optional, Sequence, Tuple, Union
import numpy as np
import tensorflow as tf
import torch
from loguru import logger
from torch.nn.utils.rnn import pad_sequence
from tqdm import tqdm
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
from scipy.io import arff

Tensor = torch.Tensor

def get_eeg(data_dir: Path = "../../data/raw") -> Path:
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00264/EEG%20Eye%20State.arff"
    datapath = tf.keras.utils.get_file(
        "eeg", origin=url, untar=False, cache_dir=data_dir
    )
    datapath = Path(datapath)
    logger.info(f"Data is downloaded to {datapath}.")
    return datapath

class BaseDataset:
    def __init__(self, datapath: Path):
        self.path = datapath
        self.data = self.process_data()
        
    def process_data(self) -> None:
        data = arff.loadarff(self.path)
        first_label = int(data[0][0][14])
        label = first_label
        chunck = []
        chuncks = []
        for line in data[0]:
            if int(line[14]) == label:
                observation = []
                for index, i in enumerate(line):
                    if index != 14:
                        observation.append(i)
                observation = torch.Tensor(observation)
                chunck.append(observation)
            else:
                chunck_tuple = (label, torch.stack(chunck))
                chuncks.append(chunck_tuple)
                label = int(line[14])
                chunck = []
                observation = []
                for index, i in enumerate(line):
                    if index != 14:
                        observation.append(i)
                observation = torch.Tensor(observation)
        chunck_tuple = (label, torch.stack(chunck))
        chuncks.append(chunck_tuple)
        return chuncks

    def __getitem__(self):
        lenght = self.__len__()
        list_index = list(range(0,lenght))
        random.shuffle(list_index)
        return(self.data[list_index[0]])

    def __len__(self):
        length = len(self.data)
        return length

dataloader = BaseDataset(datapath = get_eeg())
dataloader.__getitem__()[1].shape

2022-06-05 13:48:38.776 | INFO     | __main__:get_eeg:26 - Data is downloaded to ../../data/raw/datasets/eeg.


torch.Size([51, 14])

In [176]:
def window(x: Tensor, n_time: int) -> Tensor:
    """
    Generates and index that can be used to window a timeseries.
    E.g. the single series [0, 1, 2, 3, 4, 5] can be windowed into 4 timeseries with
    length 3 like this:

    [0, 1, 2]
    [1, 2, 3]
    [2, 3, 4]
    [3, 4, 5]

    We now can feed 4 different timeseries into the model, instead of 1, all
    with the same length.
    """
    n_window = len(x) - n_time + 1
    time = torch.arange(0, n_time).reshape(1, -1)
    window = torch.arange(0, n_window).reshape(-1, 1)
    idx = time + window
    return idx


class BaseDataIterator:
    def __init__(self, dataset: BaseDataset, batchsize: int) -> None:
        self.dataset = dataset
        self.batchsize = batchsize
        self.item = self.dataset.__getitem__()
        print(self.item)

    def __len__(self) -> int:
        return int(len(self.dataset) / self.batchsize)

    def __iter__(self) -> BaseDataIterator:
        self.index = 0
        self.index_list = torch.randperm(len(self.dataset))
        return self

    def batchloop(self) -> Tuple[List, List]:
        X = []  # noqa N806
        Y = []  # noqa N806
        for _ in range(self.batchsize):
            x, y = self.dataset[int(self.index_list[self.index])]
            X.append(x)
            Y.append(y)
            self.index += 1
        return X, Y

    def __next__(self) -> Tuple[Tensor, Tensor]:
        if self.index <= (len(self.dataset) - self.batchsize):
            X, Y = self.batchloop()  # noqa N806
            return torch.tensor(X), torch.tensor(Y)
        else:
            raise StopIteration

class PaddedDatagenerator(BaseDataIterator):
    """Iterator with additional padding of X

    Args:
        BaseDataIterator (_type_): _description_
    """

    def __init__(self, dataset: BaseDataset, batchsize: int) -> None:
        super().__init__(dataset, batchsize)

    def __next__(self):
        if self.index <= (len(self.dataset) - self.batchsize):
            X, Y = self.batchloop()  # noqa N806
            X_ = pad_sequence(X, batch_first=True, padding_value=0)  # noqa N806
            return X_, torch.tensor(Y)
        else:
            raise StopIteration

dataset = BaseDataset(datapath = get_eeg())
loader = BaseDataIterator(dataset = dataset, batchsize=2)

2022-06-05 13:33:33.336 | INFO     | __main__:get_eeg:26 - Data is downloaded to ../../data/raw/datasets/eeg.


(0, tensor([[4256.9199, 4010.7700, 4261.0298,  ..., 4271.2798, 4540.0000,
         4293.3301],
        [4259.4902, 4011.7900, 4261.0298,  ..., 4275.3799, 4543.5898,
         4298.4600],
        [4257.9502, 4020.5100, 4255.8999,  ..., 4275.3799, 4548.2100,
         4307.1802],
        ...,
        [4402.5601, 4005.1299, 4287.1802,  ..., 4323.0801, 4736.9199,
         4496.9199],
        [4386.6699, 3998.4600, 4276.4102,  ..., 4313.8501, 4727.6899,
         4486.1499],
        [4378.9702, 3990.7700, 4271.7900,  ..., 4314.8701, 4734.3599,
         4485.1299]]))


In [171]:
next(loader)

AttributeError: 'BaseDataIterator' object has no attribute 'index'

In [None]:

class PaddedDatagenerator(BaseDataIterator):
    """Iterator with additional padding of X

    Args:
        BaseDataIterator (_type_): _description_
    """

    def __init__(self, dataset: BaseDataset, batchsize: int) -> None:
        super().__init__(dataset, batchsize)

    def __next__(self) -> Tuple[Tensor, Tensor]:
        if self.index <= (len(self.dataset) - self.batchsize):
            X, Y = self.batchloop()  # noqa N806
            X_ = pad_sequence(X, batch_first=True, padding_value=0)  # noqa N806
            return X_, torch.tensor(Y)
        else:
            raise StopIteration


class BaseDatastreamer:
    """This datastreamer will never stop
    The dataset should have a:
        __len__ method
        __getitem__ method

    """

    def __init__(
        self,
        dataset: BaseDataset,
        batchsize: int,
        preprocessor: Optional[Callable] = None,
    ) -> None:
        self.dataset = dataset
        self.batchsize = batchsize
        self.preprocessor = preprocessor
        self.size = len(self.dataset)
        self.reset_index()

    def __len__(self) -> int:
        return int(len(self.dataset) / self.batchsize)

    def reset_index(self) -> None:
        self.index_list = np.random.permutation(self.size)
        self.index = 0

    def batchloop(self) -> Sequence[Tuple]:
        batch = []
        for _ in range(self.batchsize):
            x, y = self.dataset[int(self.index_list[self.index])]
            batch.append((x, y))
            self.index += 1
        return batch

    def stream(self) -> Iterator:
        while True:
            if self.index > (self.size - self.batchsize):
                self.reset_index()
            batch = self.batchloop()
            if self.preprocessor is not None:
                X, Y = self.preprocessor(batch)  # noqa N806
            else:
                X, Y = zip(*batch)  # noqa N806
            yield X, Y


In [None]:
from __future__ import annotations
import random
import shutil
from datetime import datetime
from pathlib import Path
from typing import Callable, Dict, Iterator, List, Optional, Sequence, Tuple, Union
import numpy as np
import tensorflow as tf
import torch
from loguru import logger
from torch.nn.utils.rnn import pad_sequence
from tqdm import tqdm
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
from src.data import data_tools
from src.data.data_tools import PaddedDatagenerator, TSDataset

def window(x: Tensor, n_time: int) -> Tensor:
    """
    Generates and index that can be used to window a timeseries.
    E.g. the single series [0, 1, 2, 3, 4, 5] can be windowed into 4 timeseries with
    length 3 like this:

    [0, 1, 2]
    [1, 2, 3]
    [2, 3, 4]
    [3, 4, 5]

    We now can feed 4 different timeseries into the model, instead of 1, all
    with the same length.
    """
    n_window = len(x) - n_time + 1
    time = torch.arange(0, n_time).reshape(1, -1)
    window = torch.arange(0, n_window).reshape(-1, 1)
    idx = time + window
    return idx



def get_eeg(data_dir: Path = "../../data/raw") -> Path:
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00264/EEG%20Eye%20State.arff"
    datapath = tf.keras.utils.get_file(
        "eeg", origin=url, untar=False, cache_dir=data_dir
    )
    datapath = Path(datapath)
    logger.info(f"Data is downloaded to {datapath}.")
    return datapath

class Dataloader:
    def __init__(self, datapath: Path):
        self.path = datapath
        self.dataset = []
        self.process_data()

    def __get_item__():
        pass

    def __len__():
        pass

    def process_data(self) -> None:
        file = self.path
        x_ = np.genfromtxt(file)[:, 3:]
        x = torch.tensor(x_).type(torch.float32)
        y = torch.tensor(int(file.parent.name) - 1)
        self.dataset.append((x, y))

class BaseDataset:
    def __init__(self, paths: List[Path]) -> None:
        self.paths = paths
        random.shuffle(self.paths)
        self.dataset = []
        self.process_data()

    def process_data(self) -> None:
        for file in tqdm(self.paths):
            x_ = np.genfromtxt(file)[:, 3:]
            x = torch.tensor(x_).type(torch.float32)
            y = torch.tensor(int(file.parent.name) - 1)
            self.dataset.append((x, y))

    def __len__(self) -> int:
        return len(self.dataset)

    def __getitem__(self, idx: int) -> Tuple[Tensor, int]:
        return self.dataset[idx]

class BaseDataIterator:
    def __init__(self, dataset: BaseDataset, batchsize: int):
        self.dataset = dataset
        self.batchsize = batchsize

    def __len__(self) -> int:
        # the lenght is the amount of batches
        return int(len(self.dataset) / self.batchsize)

    def __iter__(self) -> BaseDataIterator:
        # initialize index
        self.index = 0
        self.index_list = torch.randperm(len(self.dataset))
        return self
    
    def batchloop(self) -> Tuple[Tensor, Tensor]:
        X = []  # noqa N806
        Y = []  # noqa N806
        # fill the batch
        for _ in range(self.batchsize):
            x, y = self.dataset[int(self.index_list[self.index])]
            X.append(x)
            Y.append(y)
            self.index += 1
        return X, Y

    def __next__(self) -> Tuple[Tensor, Tensor]:
        if self.index <= (len(self.dataset) - self.batchsize):
            X, Y = self.batchloop()
            return X, Y
        else:
            raise StopIteration

# BCELoss example
In this example, which input would you prefer for the given target?

In [25]:
import torch
import torch.nn as nn

input1 = torch.tensor([0.1, 0.1, 0.7, 0.9])
input2 = torch.tensor([0.1, 0.3, 0.6, 0.7])
target = torch.tensor([0., 0., 1., 1.])

So, which loss should you pick? CrossEntropyLoss won't work:

In [32]:
loss = nn.CrossEntropyLoss()
try:
    loss(input1, target)
except Exception as e:
    print(e)

Dimension out of range (expected to be in range of [-1, 0], but got 1)


You will need BCELoss for this.
Binary cross entropy loss works like this:
$$X = {x_i, \dots, x_n}$$

$$l_i =-(y_i \cdot log(x_i) + (1-y_i) \cdot log(1-x_i))$$
$$BCELoss = mean(l)$$

Note that the labels are assumed to be either 0 or 1 (hence, the binary part).
If a label is 0, only the second part is relevent. If the label is 1, only the first part is relevant. the default reduction is "mean":

$$
BCEloss = 
\begin{cases}
mean(-log(1 - x_i)) & \text{if\,} y = 0\\
mean(-log(x_i)) & \text{if\,} y = 1
\end{cases}
$$



 We can see this works nice for a sequence of 0s and 1s.
You can see that input1 is preferred, because it is more certain of the cases.

In [34]:
loss = nn.BCELoss()
loss(input1, target), loss(input2, target)

(tensor(0.1682), tensor(0.3324))

Or a more generic example

In [35]:
m = nn.Sigmoid() # make sure outputs are between 0 and 1
X = torch.randn(100) # generate 100 random inputs
yhat = m(X) # our dummy model

p = torch.ones_like(yhat) / 2
y = torch.bernoulli(p) # we create a random label sequence of 0s and 1s
loss(yhat, y)

tensor(0.8594)