# Experimentation 2: Inspecting our Data Set
We will investigate and get a feel for EEG data.
The focus lies on the aspect of accessing the file format,
the contained meta and use data, and displaying it in a useful way.

These tasks will help you consolidating your Python skills.
We will also re-visit the general ideas presented in **Experimentation 1**.
And -- having these preparations in place,
we will be able to perform an effective start with the actual contents of the course.

I will prepare an accompanying pdf document with additional information.


## Reporting
Not every plot or number generated in this template comes into the report.
Nonetheless, some intermediary steps are required to achieve the relevant results.
Cells, which have a *direct* relation to a part of the report, 
are marked with **Report:**.
Be sure to not jump over relevant precursor steps.
But when having little time, think about your priorities, and try to avoid spending time on decorative steps or cells which are not needed for the report.

The relevant information about what to include in the report is the corresponding pdf document on itslearning.
Annotations in this file are just for orientation and might not be complete.


## What to do
In this notebook, you will
- load the data set
- optional: downsample for plotting<br/>
This step may not be that relevant when working locally, but could still improve navigation / zoom in the whole time series a lot, depending on your local resources (CPU/RAM).
Note, even if you experience problems plotting the whole time series, you can still use full-detail plots on smaller regions.
Advanced users could create an interactive zoom function, which will resample the data based on the number of visible points.
- filter the data
- investigate the autocorrelation function<br/>
Note: Using pandas' autocorrelation_plot can be rather time consuming for a signal that size.
A faster way is to use [scipy.signal's correlate](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html) to calculate the autocorrelation function and plot the result directly. 
- extract and display the seizure annotations
- determine the stationarity of the time series

## Discussion
Prepare for discussion in class, which will take place either in groups or with your neighbours
- What is the meaning of the filter coefficients?
- What can you observe on the edges in the filter response?
- Please compare to and discuss how the autocorrelation function of the selected EEG data relates to other data sets.
- Compare the filterted signal to the original signal.
- Is the available EEG data stationary? What do you observe?

Feel free to use the forum on itslearning.
## Good to Know
### Note on using Cells in Notebooks
When using Jupyter Notebooks, you can split your script into cells, which correspond to blocks of code which can be re-executed individually. Keep in mind though that the variables are shared, i.e., they are global variables.

At the same time, you will find that you will modify parts of your notebook several times before you go on to the next step.

Therefore, we recommend to not overwrite your variables when processing the data, otherwise, re-execution will lead to an iterated application of your algorithm, which will in most cases not have the desired effect.

Example: (Good)
```
    # Cell 1:
    raw_data = load_from(file)

    # Cell 2:
    filtered_data = filter(raw_data)
```

Imagine now, that Cell 2 would read ```raw_data = filter(raw_data)```.
What would happen when re-executing Cell 2 several times?

### Storing away pre-processed data
Please note, there are many cases where it makes sense to write pre-processed data or extracted features to disc and start processing it with a new notebook/script. Whenever the pre-processing or feature extraction takes ages, and especially while developing the analysis (when redoing parts of the notebook again and again), you will want to store it away to not always having to re-run everything.

Pickle can be a simple solution for storing and then loading your temporary data.

https://docs.python.org/3/library/pickle.html
https://wiki.python.org/moin/UsingPickle

Short example for variable x:

```
import pickle

# storing x
pickle.dump(x, open('x-store.dat', 'wb')

# loading x
x = pickle.load(open('x-store.dat', 'rb')
```

Note: Pickle has same caveats, please search the internet for a thorough discussion.
From an accessibility-perspective, using a general format like hdf5 or arrow might be preferrable.
Domain and application specific formats like edf typically cannot hold your preprocessed data.
But when working with Python, pickle is a fast and easy to use solution,
but be aware that you shouldn't load pickled data from untrusted sources,
as it could contain malicious code.

In [None]:
import numpy as np                            # Array library

import scipy                                  # Algorithms working on arrays


# Jupyter lab supports interactive plots      # Matplotlib for plotting
# using "widget"
%matplotlib widget

# Jupyter lab doesn't support notebook,
# which was the preferred method for jupyter notebooks.
#%matplotlib notebook
#%matplotlib inline


from matplotlib import pyplot as plt
from matplotlib import patches
import matplotlib.dates as mdates

import seaborn as sns                         # Advanced plotting, support for data frames

import pandas as pd                           # Advanced data frames & csv reading
from pandas.plotting import autocorrelation_plot

# Adjust plot size & resolution for inline display.
# Tune to your needs.
plt.rcParams['figure.figsize'] = [9, 5.56]
plt.rcParams['figure.dpi'] = 100

# Augmented Dickey-Fuller test
from statsmodels.tsa.stattools import adfuller, pacf, acf

In [None]:
# Defining base paths for read-only and read-write data
# will make it easy for us to switch between cloud
# and local environments by just adjusting the paths.
#
# Also, it will prevent accidental overwriting of read-only data.

from pathlib import Path         # OS agnostic path handling (/ vs \)

user = 'jb'                      # Per-user output directories

# Base directories
# DATA_DIR -- where the read-only sources are
DATA_DIR = Path('/work/data')

# OUTPUT_DIR -- where we will keep our data (read/write)
# We will make sure it exists!
OUTPUT_DIR = Path('/work/output')

# Now create our own output directory and change to it
OUTPUT_DIR /= user
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

import os
os.chdir(OUTPUT_DIR)             # Change working directory

# Data Access
Now, we read the actual data.

In [None]:
from pyedflib import highlevel

fn = DATA_DIR / \
    'module_3_-_time_series_analysis_for_eeg/seizure_eeg_data' / \
    '01.edf'
print('Reading EEG from: {}'.format(fn))

signals, signal_headers, header = \
    highlevel.read_edf(str(fn))

In [None]:
# Labels for all channels in the edf
for i, sh in enumerate(signal_headers):
    print('Channel {:2d}: {}'.format(i, sh['label']))

# Inspecting the data
Plot the data with ```plt.plot()```, use ```plt.xlabel()``` ```plt.ylabel()``` and ```plt.title()``` to provide names and units for the axes as well as the plot.
Make sure the right units are displayed, i.e., time in seconds, and use the appropriate physical unit for the y axis.
When you experience problems with display speed, use simple downsampling to achieve faster zooming at low zoom levels.

Inspect a single channel in detail.

# Extracting Annotations
To be able to create a model, we will need to when seizures happend. Thus, extract the seizure annotations and then use matplotlib patches to indicate seizure areas.

https://matplotlib.org/stable/api/_as_gen/matplotlib.patches.Rectangle.html

In [None]:
# Create a new plotting canvas for the downsampled plot
fig, ax = plt.subplots()
plt.title('Downsampled Signal with Annotations')
plt.plot(plot_times, plot_channel)

r = patches.Rectangle(
    (1000, -1000), 4000, 3000,
    linewidth=2, edgecolor='r', fill=False, zorder=100
)
ax.add_patch(r)

## Inspect all channels with seizure annotations
Are there channels where you can see clear effects of the seizure?
Try zooming.

**Report:** Parts 3.1, 3.2

# Pre-Processing
What ever needs to be done!

Please note, there are many cases where it makes sense to write pre-processed data or extracted features to disc and start processing it with a new notebook/script. Whenever the pre-processing or feature extraction takes ages, and especially while developing the analysis, you will want to store it away to not always having to re-run everything.

Another solution is to check in one notebook if a certain result has been stored and then load that automatically.

Pickle can be a simple solution for storing and then loading your temporary data.

https://docs.python.org/3/library/pickle.html
https://wiki.python.org/moin/UsingPickle

Short example for variable x:

```
import pickle

# storing x
pickle.dump(x, open('x-store.dat', 'wb')

# loading x
x = pickle.load(open('x-store.dat', 'rb')
```

## Here: Create a band-pass filter based on a FIR filter
Besides the desired EEG-signal, recording brain activity, our recording equipment picks up a lot of other stuff.
This includes electronic noise (in the cuircuits), electrode movement, general muscle activity (from body movements),
eye movements (the eyes are electrically charged!).
To remove the noise, we apply a band-pass filter, which will cut of very slow and very fast changing components.

Using the ```firwin``` function, we can generate a filter based on the given parameters,
and using the ```convolve``` function, we can apply the filter to a signal.
The possible parameters include the allowed frequencise, i.e., slowest and fastest allowed changes, as well as the filter length.
The latter describes how many values the filter takes into account for calculating a value.

Try different filter lengths. Start with 1 Hz as lower and 70 Hz as higher limit for the frequency.

Plot the filter coefficients (the resulting array).

```
coefficients = scipy.signal.firwin(length of filter, [limits], sampling frequency, pass_zero=False)
filtered_signal = scipy.signal.convolve(signal, coefficients, mode='same')
```

**References:**
  * [The scipy function reference](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.firwin.html)
  * [The scipy cookbook](https://scipy-cookbook.readthedocs.io/items/FIRFilter.html)
  * [Wikipedia on finite impulse response filters](https://en.wikipedia.org/wiki/Finite_impulse_response)

## Filter response
Use scipy.signal.freqz to get the filter response.
Use insets to magnify the edges of the band pass.
Use ```plt.xlim()``` and ```plt.ylim()``` to zoom in on the relevant parts.

Check https://scipy-cookbook.readthedocs.io/items/FIRFilter.html for details.

In [None]:
# Filter response

plt.figure()
plt.title('Filter response')
w, h = signal.freqz(_filter, worN=8192)
plt.plot((w/np.pi)*nyq_freq, np.absolute(h), linewidth=2)
plt.xlabel('Frequency (Hz)')
plt.ylabel('Gain')
plt.ylim(-0.05, 1.05)
plt.grid(True)

# Upper inset plot.
ax1 = plt.axes([0.42, 0.6, .45, .25])
plt.plot((w/np.pi)*nyq_freq, np.absolute(h), linewidth=2)
plt.grid(True)

# Lower inset plot
ax2 = plt.axes([0.42, 0.25, .45, .25])
plt.plot((w/np.pi)*nyq_freq, np.absolute(h), linewidth=2)
plt.grid(True)

## Compare the Filtered Signal to the Original Signal
Plot both signals on top of each other.
How do the filter response and the filtered signal connect?
We will revisit this angle when doing Fourier decompositions in the course.

# Autocorrelation

Using pandas' autocorrelation_plot can be rather time consuming for a signal that size.

A faster way is to use scipy.signal's correlate to calculate the autocorrelation function and plot the result directly. 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html

**Report:** Part 3.1

In [None]:
auto_corr = signal.correlate(filtered, filtered)
corr_lags = signal.correlation_lags(len(filtered), len(filtered))
plt.figure()
plt.plot(corr_lags, auto_corr)

In [None]:
#plt.figure()
#autocorrelation_plot(filtered)
#plt.ylim((-0.2, 0.2))

# Determine Stationarity
Can you determine the stationarity for the full time series?
The test actually requires a lot of memory to run, which especially laptops often do not have.
In that case, work on a subset of the channel.

And even if you can determine the stationarity for the whole time series,
the question is, does the result of the test depend on down sampling or taking a part of the time series?
So please try different amounts of data points.

**Report:** Part 3.1

# Convert to Meaningful Units
The data files contain binary values recorded directly by analogue to digital converters.
While we, of course, can just plot those values, the results will be hard to interpret.

Luckily, it is possible to use the metadata for the channels to convert the unit less numbers to the actual measured quantities in Volts.
To this end, the metadata contains the minimum and maximum values with the corresponding physical minimum and maximum values.

Apply the correct transformation to get correct physical units and use them in your axes labels when plotting the signal.

**Short on time?** Jump over this meaningful units block for now and focus on the other tasks first. Having useful units is desirable, but when pressed, the priority should be on completing the detector.

**Report:** Part 3.1

# The (Optional) Final Step

Dump your preprocessed data into a file, for example with pickle, so that you can reuse it in the detection template instead of copying over all the code.

For template 03, we will need all channels in filtered form.
Instead of copying and adjusting the code from this template,
you can also filter all channels here and pickle them away for later use.