
<img src="https://raw.githubusercontent.com/sdgroeve/Machine_Learning_course_UGent_D012554_kaggle/master/header.png" alt="drawing"/>


A multi-channel electroencephalography (EEG) system enables a broad range of applications including neurotherapy, biofeedback, and brain computer interfacing. The dataset you will analyse is created with the [Emotiv EPOC+](https://www.emotiv.com/product/emotiv-epoc-14-channel-mobile-eeg).  

It has 14 EEG channels with names based on the International 10-20 locations: AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, AF4:

<br/>
<br/>
<center>
<img src="https://raw.githubusercontent.com/sdgroeve/Machine_Learning_course_UGent_D012554_kaggle/master/EEG.png" alt="drawing" width="200"/>
<center/>
<br/>
<br/>


All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. 

The experiment was conducted on one person only. The duration of the measurement was around 117 seconds.

From the paper:

> *The experiment was carried out in a quiet room. During
the experiment, the proband was being videotaped. To prevent
artifacts, the proband was not aware of the exact start time
of the measurement. Instead, he was told to sit relaxed, look
straight to the camera, and change the eye state at free will.
Only additional constraint was that, accumulated over the
entire session, the duration of both eye states should be about
the same and that the individual intervals should vary greatly
in length (from eye blinking to longer stretches)...*

The eye state was detected via a camera during the EEG measurement and later added manually to the file after analyzing the video frames. 

A label '1' indicates the eye-closed and '0' the eye-open state.

(*Source: Oliver Roesler, Stuttgart, Germany*)

First I will load the train and test set as well as modules I will be using to visualize the data. 

In [0]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


trainset = pd.read_csv("https://raw.githubusercontent.com/sdgroeve/Machine_Learning_course_UGent_D012554_kaggle/master/eeg_train.csv")

testset = pd.read_csv("https://raw.githubusercontent.com/sdgroeve/Machine_Learning_course_UGent_D012554_kaggle/master/eeg_test.csv")



First I want to take a look at the data.

In [3]:
print(trainset)
print(testset)

          AF3       F7       F3      FC5  ...       F4       F8      AF4  label
0     4299.49  3997.44  4277.95  4116.92  ...  4278.97  4600.00  4369.23      1
1     4302.05  3985.64  4261.03  4129.74  ...  4283.08  4607.18  4358.46      0
2     4321.03  4015.90  4265.13  4122.56  ...  4286.15  4608.21  4371.79      0
3     4408.21  4104.10  4380.00  4232.31  ...  4388.21  4715.90  4464.10      0
4     4347.18  3975.38  4266.67  4102.56  ...  4313.33  4664.10  4411.79      1
...       ...      ...      ...      ...  ...      ...      ...      ...    ...
1995  4211.79  4015.90  4230.26  4107.69  ...  4240.51  4544.62  4265.13      1
1996  4268.72  4035.38  4237.95  4112.82  ...  4250.77  4586.67  4321.54      0
1997  4287.69  4007.69  4267.18  4128.21  ...  4260.51  4597.44  4353.33      0
1998  4297.95  4031.79  4275.90  4147.69  ...  4279.49  4604.10  4340.51      0
1999  4303.08  4010.26  4270.77  4148.21  ...  4281.54  4626.67  4349.23      0

[2000 rows x 15 columns]
           AF3

In [7]:
print(trainset[trainset.label == 1].label.count())
print(trainset[trainset.label == 0].label.count())

901
1099


here we see that there are 901 timepoints captured where the subjects eyes were closes (label 1) and 1099 timepoints where the subjects eyes were open (label2)

let's now remove the label column from the trainset and the index column from the testset. 

In [8]:
train_label = trainset.pop('label')
testset.pop('index')
print(trainset)
print(testset)

          AF3       F7       F3      FC5  ...      FC6       F4       F8      AF4
0     4299.49  3997.44  4277.95  4116.92  ...  4211.79  4278.97  4600.00  4369.23
1     4302.05  3985.64  4261.03  4129.74  ...  4195.90  4283.08  4607.18  4358.46
2     4321.03  4015.90  4265.13  4122.56  ...  4155.38  4286.15  4608.21  4371.79
3     4408.21  4104.10  4380.00  4232.31  ...  4319.49  4388.21  4715.90  4464.10
4     4347.18  3975.38  4266.67  4102.56  ...  4248.21  4313.33  4664.10  4411.79
...       ...      ...      ...      ...  ...      ...      ...      ...      ...
1995  4211.79  4015.90  4230.26  4107.69  ...  4168.72  4240.51  4544.62  4265.13
1996  4268.72  4035.38  4237.95  4112.82  ...  4189.74  4250.77  4586.67  4321.54
1997  4287.69  4007.69  4267.18  4128.21  ...  4192.82  4260.51  4597.44  4353.33
1998  4297.95  4031.79  4275.90  4147.69  ...  4210.77  4279.49  4604.10  4340.51
1999  4303.08  4010.26  4270.77  4148.21  ...  4217.44  4281.54  4626.67  4349.23

[2000 rows x 14

Now we can see that in both the train and testset the only columns present are the features. 



In [0]:
#visualize data before scaling ? 


The data is not centered around 0 , the next step will be to center the data around 0.
all features need to have the same scale

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(trainset)

trainset_scaled = scaler.transform(trainset)

testset_scaled = scaler.transform(testset)


trainset_scaled = pd.DataFrame(trainset_scaled ,columns= trainset.columns )
testset_scaled = pd.DataFrame(testset_scaled ,columns= testset.columns )
print(trainset_scaled)
print(testset_scaled)


           AF3        F7        F3  ...        F4        F8       AF4
0    -0.018351 -0.396476  0.677912  ...  0.026714 -0.156365  0.252992
1     0.052070 -0.791842 -0.136210  ...  0.235973  0.060820 -0.037576
2     0.574178  0.222036  0.061066  ...  0.392280  0.091976  0.322059
3     2.972354  3.177226  5.588146  ...  5.588622  3.349453  2.812530
4     1.293521 -1.135609  0.135164  ...  1.776139  1.782573  1.401236
...        ...       ...       ...  ...       ...       ...       ...
1995 -2.430832  0.222036 -1.616738  ... -1.931461 -1.831536 -2.555566
1996 -0.864783  0.874724 -1.246726  ... -1.409078 -0.559580 -1.033657
1997 -0.342950 -0.053044  0.159704  ... -0.913170 -0.233802 -0.175981
1998 -0.060714  0.754439  0.579275  ...  0.053189 -0.032346 -0.521857
1999  0.080403  0.033065  0.332440  ...  0.157564  0.650366 -0.286596

[2000 rows x 14 columns]
            AF3        F7        F3  ...        F4        F8       AF4
0     -0.103077  1.046608 -0.506702  ... -0.495161 -0.202948 -0