# Investigate differences in the feature matrices
____

Current bug: In the backend, we currently obtain invalid predictions (all Wake). When investigating, I saw that the feature matrix sent to the classifier was really different compared to the one generated in our notebooks.

## Matrix feature differences
___

In [41]:
import numpy as np
import pandas as pd
import mne
from constants import EPOCH_DURATION

In [36]:
start = 1582418280
bed = 1582423980
wake = 1582452240

nb_epoch_bed = int((bed - start) // EPOCH_DURATION)
nb_epoch_wake = int((wake - start) // EPOCH_DURATION)

In [37]:
X_notebook = np.vstack(np.load("data/X_openbci_HP_PRIOR.npy", allow_pickle=True))[nb_epoch_bed:nb_epoch_wake]
X_backend = np.load("investigation_data/feature.npy", allow_pickle=True)

X_notebook.shape, X_backend.shape

((942, 48), (942, 48))

In [40]:
for feature_idx in range(X_notebook.shape[1]):
    difference = np.mean(abs(X_notebook[:,feature_idx] - X_backend[:,feature_idx]))
    print(f"Feature {feature_idx} diff: {difference:.2f}")

Feature 0 diff: 0.00
Feature 1 diff: 0.00
Feature 2 diff: 0.14
Feature 3 diff: 21.24
Feature 4 diff: 0.01
Feature 5 diff: 0.04
Feature 6 diff: 5.94
Feature 7 diff: 0.00
Feature 8 diff: 0.80
Feature 9 diff: 116.65
Feature 10 diff: 6.50
Feature 11 diff: 1.92
Feature 12 diff: 1.94
Feature 13 diff: 0.23
Feature 14 diff: 0.00
Feature 15 diff: 0.00
Feature 16 diff: 0.00
Feature 17 diff: 0.00
Feature 18 diff: 0.00
Feature 19 diff: 0.01
Feature 20 diff: 1486454042448.59
Feature 21 diff: 71806204238.74
Feature 22 diff: 16018476987.50
Feature 23 diff: 22727447431.86
Feature 24 diff: 9513626021.77
Feature 25 diff: 0.08
Feature 26 diff: 12.56
Feature 27 diff: 0.01
Feature 28 diff: 0.04
Feature 29 diff: 6.80
Feature 30 diff: 0.00
Feature 31 diff: 3.70
Feature 32 diff: 51.18
Feature 33 diff: 2.82
Feature 34 diff: 1.20
Feature 35 diff: 0.55
Feature 36 diff: 0.10
Feature 37 diff: 0.00
Feature 38 diff: 0.00
Feature 39 diff: 0.00
Feature 40 diff: 0.00
Feature 41 diff: 0.00
Feature 42 diff: 0.02
Feature 

We can see there's a big difference in the time subband domain. The related features was to apply a subband filter (i.e. delta) and take the mean signal energy of the signal in the time domain. It consists of the sum of each sample powered by two.

Then, if there are differences in the signal, those are highly amplified in the subband domain features.

### Known errors
___

We know that the signal we've trained on had a different quantification than the original quantification of the OpenBCI Cyton quantification. The EDF specification encodes the samples on 16 bits, whereas we currently keep the original encoding of 24 bits.

## 1. Conversion from hexadecimal to decimal

We will compare both files converted with the OpenBCI GUI and our own code. The file used will be the mini file.

In [28]:
openbci_gui_data = pd.read_csv("investigation_data/SDconverted-2020-10-29_00-05-19_mini.csv", skiprows=7, usecols=[1,2], names=["Fpz-Cz", "Pz-Oz"])
openbci_gui_data

Unnamed: 0,Fpz-Cz,Pz-Oz
0,-11104.548,205.09960
1,-11065.186,191.77795
2,-11017.264,180.24446
3,-11041.448,194.37076
4,-11036.777,194.03549
...,...,...
899993,-17148.057,6566.42800
899994,-17136.926,6567.14360
899995,-17134.377,6574.34100
899996,-17127.582,6579.28030


In [7]:
script_data = pd.DataFrame(data=np.transpose(np.load('investigation_data/raw_converted.npy')), columns=["Fpz-Cz", "Pz-Oz"])
script_data

Unnamed: 0,Fpz-Cz,Pz-Oz
0,-11104.547811,205.099607
1,-11065.186389,191.777967
2,-11017.264249,180.244467
3,-11041.448836,194.370770
4,-11036.777322,194.035494
...,...,...
899993,-17148.057180,6566.428431
899994,-17136.926012,6567.143687
899995,-17134.377913,6574.340948
899996,-17127.582982,6579.280684


In [23]:
difference_df = abs(openbci_gui_data - script_data)
print(f'Mean difference between decimal raw data \n{difference_df.mean()}\n')
print(f'Median difference between decimal raw data \n{difference_df.median()}\n')
print(f'Min difference between decimal raw data \n{difference_df.min()}\n')
print(f'Max difference between decimal raw data \n{difference_df.max()}')

Mean difference between decimal raw data 
Fpz-Cz    0.000550
Pz-Oz     0.000263
dtype: float64

Median difference between decimal raw data 
Fpz-Cz    0.000503
Pz-Oz     0.000212
dtype: float64

Min difference between decimal raw data 
Fpz-Cz    4.529284e-09
Pz-Oz     1.073204e-10
dtype: float64

Max difference between decimal raw data 
Fpz-Cz    0.002508
Pz-Oz     0.001277
dtype: float64


As we see, the OpenBCI GUI stores only the five first decimals, whereas we do not round the number in our backend. We can look if its the cause of the problem.

In [33]:
script_data_chopped = script_data.apply(lambda x: np.round(x, decimals=5))
script_data_chopped

difference_df = abs(openbci_gui_data - script_data_chopped)
print(f'Mean difference between decimal raw data \n{difference_df.mean()}\n')
print(f'Median difference between decimal raw data \n{difference_df.median()}\n')
print(f'Min difference between decimal raw data \n{difference_df.min()}\n')
print(f'Max difference between decimal raw data \n{difference_df.max()}')

Mean difference between decimal raw data 
Fpz-Cz    0.000550
Pz-Oz     0.000263
dtype: float64

Median difference between decimal raw data 
Fpz-Cz    0.00050
Pz-Oz     0.00021
dtype: float64

Min difference between decimal raw data 
Fpz-Cz    0.0
Pz-Oz     0.0
dtype: float64

Max difference between decimal raw data 
Fpz-Cz    0.00251
Pz-Oz     0.00128
dtype: float64


We see that the maximum difference is the same as before. It is then not caused by the number of decimals.

## Quantification
____

In the case of the notebooks, we first convert the decimal converted file to another format, the edf format. Since this format enforces quantification to 16 bits, the data is then requantified from 24 to 16 bits. It is not the case in the backend. We will compare both results.

In [46]:
notebook_data = mne.io.read_raw_edf('investigation_data/william-recording-mini.edf', preload=True, stim_channel=None, verbose=False)
notebook_data.get_data(), notebook_data.get_data().shape

(array([[-0.01110396, -0.01106457, -0.01101625, ..., -0.01713194,
         -0.01713614, -0.01712879],
        [ 0.0002052 ,  0.00019168,  0.00018013, ...,  0.00657419,
          0.00657165,  0.00656827]]), (2, 899975))

In [47]:
openbci_gui_data = pd.read_csv("investigation_data/SDconverted-2020-10-29_00-05-19_mini.csv", skiprows=7, usecols=[1,2], names=["Fpz-Cz", "Pz-Oz"])
openbci_gui_data

Unnamed: 0,Fpz-Cz,Pz-Oz
0,-11104.548,205.09960
1,-11065.186,191.77795
2,-11017.264,180.24446
3,-11041.448,194.37076
4,-11036.777,194.03549
...,...,...
899993,-17148.057,6566.42800
899994,-17136.926,6567.14360
899995,-17134.377,6574.34100
899996,-17127.582,6579.28030


In [43]:
script_data = pd.DataFrame(data=np.transpose(np.load('investigation_data/raw_converted.npy')), columns=["Fpz-Cz", "Pz-Oz"])
script_data

Unnamed: 0,Fpz-Cz,Pz-Oz
0,-11104.547811,205.099607
1,-11065.186389,191.777967
2,-11017.264249,180.244467
3,-11041.448836,194.370770
4,-11036.777322,194.035494
...,...,...
899993,-17148.057180,6566.428431
899994,-17136.926012,6567.143687
899995,-17134.377913,6574.340948
899996,-17127.582982,6579.280684
