### Kaggle Link: https://www.kaggle.com/wanghaohan/confused-eeg

We collected EEG signal data from 10 college students while they watched MOOC video clips. 

We extracted online education videos that are assumed not to be confusing for college students, such as videos of the introduction of basic algebra or geometry. We also prepare videos that are expected to confuse a typical college student if a student is not familiar with the video topics like Quantum Mechanics, and Stem Cell Research. We prepared 20 videos, 10 in each category. 


Each video was about 2 minutes long. We chopped the two-minute clip in the middle of a topic to make the videos more confusing.

The students wore a single-channel wireless MindSet that measured activity over the frontal lobe. The MindSet measures the voltage between an electrode resting on the forehead and two electrodes (one ground and one reference) each in contact with an ear.


After each session, the student rated his/her confusion level on a scale of 1-7, where one corresponded to the least confusing and seven corresponded to the most confusing. These labels if further normalized into labels of whether the students are confused or not. This label is offered as self-labelled confusion in addition to our predefined label of confusion.

**Content**

These data are collected from ten students, each watching ten videos. Therefore, it can be seen as only 100 data points for these 12000+ rows. If you look at this way, then each data point consists of 120+ rows, which is sampled every 0.5 seconds (so each data point is a one minute video). Signals with higher frequency are reported as the mean value during each 0.5 second.

EEG_data.csv: Contains the EEG data recorded from 10 students

demographic.csv: Contains demographic information for each student

video data : Each video lasts roughly two-minute long, we remove the first 30 seconds and last 30 seconds, only collect the EEG data during the middle 1 minute.

In [104]:
import warnings
warnings.filterwarnings('ignore')

In [105]:
import pandas as pd
data = pd.read_csv('./EEG_data.csv')

## cols needed
attention = data[['SubjectID','VideoID', 'Attention', 'predefinedlabel', 'user-definedlabeln']]

In [116]:
data.columns

Index(['SubjectID', 'VideoID', 'Attention', 'Mediation', 'Raw', 'Delta',
       'Theta', 'Alpha1', 'Alpha2', 'Beta1', 'Beta2', 'Gamma1', 'Gamma2',
       'predefinedlabel', 'user-definedlabeln'],
      dtype='object')

In [106]:
table = attention.groupby(by=['SubjectID','VideoID']).agg(
    nrows=pd.NamedAgg(column="Attention", aggfunc="count"),
    predefined=pd.NamedAgg(column="predefinedlabel", aggfunc="sum"),
    user_label=pd.NamedAgg(column="user-definedlabeln", aggfunc="sum")).reset_index()

table['predefined'] = table['predefined']!=0
table['user_label'] = table['user_label']!=0

table

Unnamed: 0,SubjectID,VideoID,nrows,predefined,user_label
0,0.0,0.0,144,False,False
1,0.0,1.0,140,False,True
2,0.0,2.0,142,False,True
3,0.0,3.0,122,False,False
4,0.0,4.0,116,False,False
...,...,...,...,...,...
95,9.0,5.0,123,True,True
96,9.0,6.0,116,True,False
97,9.0,7.0,112,True,False
98,9.0,8.0,124,True,True


### How well is pre-defined label working?

In [107]:
confused_count = {'student':[], 'predefined':[], 'user_label':[]}
for student in range(10):
    confused_count['student'].append(student)
    confused_count['predefined'].append(sum(table[table['SubjectID']==student]['predefined']))
    confused_count['user_label'].append(sum(table[table['SubjectID']==student]['user_label']))

pd.DataFrame(confused_count)

Unnamed: 0,student,predefined,user_label
0,0,5,5
1,1,5,4
2,2,5,5
3,3,5,5
4,4,5,6
5,5,5,6
6,6,5,5
7,7,5,6
8,8,5,4
9,9,5,5


### Question: 

How did they normalized the users label? Z-score (take mean 0 and std 1)


### to do: 

* look into the large pool of zeros (make some graphs)
* pick an attention threshold - hypthosis testing 

In [108]:
pd.options.plotting.backend = "plotly"
import numpy as np

In [112]:
attention['unique_id'] = attention['SubjectID'].astype(str) + "-" + attention['VideoID'].astype(str)

In [113]:
def increment_counter(val, counter):
    if val in list(counter.keys()):
        counter[val] += 1
        return counter[val]
    else:
        counter[val] = 1
        return 1

counter = {}
attention['AttentionID'] = attention['unique_id'].apply(lambda x: increment_counter(x, counter))

In [122]:
attention.head()

Unnamed: 0,SubjectID,VideoID,Attention,predefinedlabel,user-definedlabeln,unique_id,AttentionID
0,0.0,0.0,56.0,0.0,0.0,0.0-0.0,1
1,0.0,0.0,40.0,0.0,0.0,0.0-0.0,2
2,0.0,0.0,47.0,0.0,0.0,0.0-0.0,3
3,0.0,0.0,47.0,0.0,0.0,0.0-0.0,4
4,0.0,0.0,44.0,0.0,0.0,0.0-0.0,5


In [125]:
attention['Attention'].plot.hist()

In [134]:
attention[attention['SubjectID']==2].plot(y='Attention', x='AttentionID', color='VideoID')

In [141]:
new = attention[~(attention['SubjectID']==6.0)]

##new['Attention'].plot.hist()

### T test - (user identified)

In [149]:
## user- identified 
confused_user = np.array(new[new['user-definedlabeln']==1.0]['Attention'])
confusednt_user = np.array(new[new['user-definedlabeln']==0.0]['Attention'])

In [152]:
from scipy.stats import ttest_ind

t,p = ttest_ind(confused_user, confusednt_user)

print(p)

7.485033298105577e-69


### T test - (pre-defined)

In [154]:
new.columns

Index(['SubjectID', 'VideoID', 'Attention', 'predefinedlabel',
       'user-definedlabeln', 'unique_id', 'AttentionID'],
      dtype='object')

In [156]:
## predefined
confused_pre = np.array(new[new['predefinedlabel']==1.0]['Attention'])
confusednt_pre = np.array(new[new['predefinedlabel']==0.0]['Attention'])

In [157]:
from scipy.stats import ttest_ind

t,p = ttest_ind(confused_pre, confusednt_pre)

print(p)

0.8596275911364174


In [165]:
## check
##(len(confused_user), len(confusednt_user))
##(len(confused_pre), len(confusednt_pre))