### Brief Note on State of Notebook  
  
It's a bit disorganized right now after trying to understand what was in each dataset. As of right now I know that there are about 17k unique EEG IDs. Each corresponds to a parquet which can be found named 'idnumber.parquet' in the train_eegs folder. Each has a number of sub IDs. These are 50 second subsamples of the whole EEG. Each EEG has a varying number of them and they overlap with one another to varying degrees. The test EEG is a single 50 second EEG, so I'm thinking that ultimately I'll want to pull a set of 50 second recordings to work with for the classification problem. Because of the overlap, I think I'll want to either do some kind of combination of the heavily overlapped recordings, or just choose one of them to work with.  
  
The EEG recordings contain a varying amount of rows depending on length of recording (it's 200 rows of data per second) and each contains 20 columns corresponding to electrode locations. So the 20 features in the EEG data are the readings every 1/200th second at a specific electrode location. This is all numerical data.  
  
The target variable is categorical. It's multiclass. There are 6 categories: Seizure, LPD, LRDA, GPD, GRDA, and Other. The paper linked seems to suggest that the most harmful activity being studied, the main activity, is seizure. The others are associated with seizures to varying degrees. LPD is most associated. GRDA is least associated. LRDA and GPD are in the middle. And other is other. I'm currently thinking that combining the middle two because they're both intermediate associations might make sense, but I'll need to do exploratory data analysis first.  
  
Understanding the 'other' category is also important. The paper lays out clearly the importance of the first 5 categories. 'Other' might show too much variance to be useful in identifying harmful brain activity, but that's something to figure out while doing exploratory data analysis.  
  
The last decision I can think of right now is how to encode the target variable. Label encoding may make some sense. There's the problem of models treating label encoded variables as numerical data, but that may be fine in this instance since each category corresponds with some level of severity or association with the most harmful brain activity being studied (seizures). Ranking them from least associated with seizures to most associated with seizures (0 - GRDA, 1 - combo of LRDA and GPD, 2 - LPD, 3 - Seizure). The problem here is that it isn't clear how you would want to rank the 'other' category. This would depend on the variance in that data. If there's a lot, maybe it makes sense to use a clustering algorithm and turn each cluster into its own category to add to the ranking based on its association with seizures.

### Notes on Next Steps  
  
First thing to do is checking data for missing data. This will be an ongoing process I think because working through 17k EEGs stored in individual parquets with more than 10k rows in each I'm assuming based on what I've looked at so far would take too long. Maybe there's an easy way of doing this that I'm not thinking of, though. I will check train.csv for missing data, but any parquets will be checked as I make use of them.  
  
I also need to figure out the Spectrogram data. Based on the data overview from Kaggle, I would have thought there was one Spectrogram for every one EEG. There are about 6k fewer Spectrograms than EEGs, however. The Spectrogram parquet for testing has 300 rows corresponding to time points. It has 401 rows (one of which is for time 0 - 299, so that can be used as the set of indexes for the dataset).  
  
After this, the plan is to take a set of EEGs for each target class and use them to do exploratory data analysis to get a better understanding of the EEG data. I'll also want to consider methods for scaling these features. They're all numerical and don't all appear to be on the same scale. Then I'll want to do the same with the Spectrogram data.  
  
To obtain the sets of EEGs and Spectrograms I'll use for exploratory data analysis, the plan is to create two new columns in train.csv. The first will be for the variability in votes. A varying number of experts (anywhere from 0 to 19) voted for each row. They picked from the 6 categories mentioned. The consensus choice has its own column in train.csv. I will choose EEGs and Spectrograms from rows where there wasn't disagreement about the brain activity in that row. The second column will show how many experts voted. Because some rows only have 3 total votes, it's important to choose from rows with more votes so that the consensus is more meaningful because that should help the exploratory data analysis be more informative. I will choose EEGs and Spectrograms from rows that have a higher number of votes (thinking 10 votes minimum).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
training = pd.read_csv('train.csv')

In [3]:
training

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106795,351917269,6,12.0,2147388374,6,12.0,4195677307,10351,LRDA,0,0,0,3,0,0
106796,351917269,7,14.0,2147388374,7,14.0,290896675,10351,LRDA,0,0,0,3,0,0
106797,351917269,8,16.0,2147388374,8,16.0,461435451,10351,LRDA,0,0,0,3,0,0
106798,351917269,9,18.0,2147388374,9,18.0,3786213131,10351,LRDA,0,0,0,3,0,0


In [4]:
np.where(training.eeg_id == 1628180742)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),)

In [5]:
eeg0742 = training[training['eeg_id'] == 1628180742]

In [6]:
eeg0742

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0
5,1628180742,5,26.0,353733,5,26.0,2413091605,42516,Seizure,3,0,0,0,0,0
6,1628180742,6,30.0,353733,6,30.0,364593930,42516,Seizure,3,0,0,0,0,0
7,1628180742,7,36.0,353733,7,36.0,3811483573,42516,Seizure,3,0,0,0,0,0
8,1628180742,8,40.0,353733,8,40.0,3388718494,42516,Seizure,3,0,0,0,0,0


In [7]:
np.where(training.eeg_id == 351917269)

(array([106789, 106790, 106791, 106792, 106793, 106794, 106795, 106796,
        106797, 106798, 106799]),)

In [8]:
train_eeg1 = pd.read_parquet('train_eegs/1628180742.parquet', engine = 'pyarrow')

In [9]:
train_eeg2 = pd.read_parquet('train_eegs/351917269.parquet', engine = 'pyarrow')

In [10]:
train_eeg1

Unnamed: 0,Fp1,F3,C3,P3,F7,T3,T5,O1,Fz,Cz,Pz,Fp2,F4,C4,P4,F8,T4,T6,O2,EKG
0,-80.519997,-70.540001,-80.110001,-108.750000,-120.330002,-88.620003,-101.750000,-104.489998,-99.129997,-90.389999,-97.040001,-77.989998,-88.830002,-112.120003,-108.110001,-95.949997,-98.360001,-121.730003,-106.449997,7.920000
1,-80.449997,-70.330002,-81.760002,-107.669998,-120.769997,-90.820000,-104.260002,-99.730003,-99.070000,-92.290001,-96.019997,-84.500000,-84.989998,-115.610001,-103.860001,-97.470001,-89.290001,-115.500000,-102.059998,29.219999
2,-80.209999,-75.870003,-82.050003,-106.010002,-117.500000,-87.489998,-99.589996,-96.820000,-119.680000,-99.360001,-91.110001,-99.440002,-104.589996,-127.529999,-113.349998,-95.870003,-96.019997,-123.879997,-105.790001,45.740002
3,-84.709999,-75.339996,-87.480003,-108.970001,-121.410004,-94.750000,-105.370003,-100.279999,-113.839996,-102.059998,-95.040001,-99.230003,-101.220001,-125.769997,-111.889999,-97.459999,-97.180000,-128.940002,-109.889999,83.870003
4,-90.570000,-80.790001,-93.000000,-113.870003,-129.960007,-102.860001,-118.599998,-101.099998,-107.660004,-102.339996,-98.510002,-95.300003,-88.930000,-115.639999,-99.800003,-97.500000,-88.730003,-114.849998,-100.250000,97.769997
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17995,-144.660004,-147.809998,-129.820007,-129.460007,-157.509995,-124.000000,-124.570000,-94.820000,-153.070007,-121.110001,-86.459999,-132.520004,-138.339996,-128.970001,-71.300003,-114.480003,-86.709999,-114.959999,-81.500000,-20.070000
17996,-140.880005,-153.000000,-129.529999,-129.020004,-154.059998,-131.220001,-128.380005,-95.000000,-140.820007,-114.639999,-84.379997,-115.339996,-119.230003,-114.709999,-70.989998,-92.129997,-79.639999,-116.139999,-81.879997,10.600000
17997,-133.729996,-141.770004,-121.900002,-122.370003,-158.750000,-123.550003,-127.730003,-93.089996,-125.230003,-106.489998,-83.419998,-112.720001,-103.209999,-107.629997,-61.869999,-97.910004,-77.150002,-106.500000,-75.339996,-2.060000
17998,-141.449997,-151.139999,-127.190002,-128.699997,-163.460007,-124.309998,-129.479996,-94.419998,-140.869995,-113.339996,-83.519997,-129.300003,-118.650002,-117.589996,-71.879997,-99.279999,-83.900002,-116.160004,-81.410004,2.820000


In [11]:
train_eeg2

Unnamed: 0,Fp1,F3,C3,P3,F7,T3,T5,O1,Fz,Cz,Pz,Fp2,F4,C4,P4,F8,T4,T6,O2,EKG
0,-38.000000,-8.800000,24.100000,11.690000,17.990000,-2.390000,17.200001,17.219999,-3.46,-25.950001,9.580000,31.17,-11.100000,-14.020000,-3.050000,15.62,21.629999,4.660000,36.209999,-29.480000
1,-51.529999,-11.350000,16.920000,3.610000,4.730000,-13.800000,8.150000,10.820000,-12.95,-34.349998,1.710000,24.07,-20.809999,-27.340000,-9.220000,11.79,20.219999,0.900000,28.660000,-15.290000
2,-32.509998,-5.500000,26.129999,11.380000,17.299999,-4.270000,16.370001,20.950001,-3.77,-26.650000,9.560000,31.00,-12.590000,-13.730000,1.550000,16.93,31.530001,14.190000,38.570000,-6.360000
3,-33.270000,-1.550000,29.180000,12.330000,18.080000,-5.700000,15.220000,16.760000,0.23,-24.469999,9.370000,29.66,-14.300000,-13.040000,-1.440000,18.75,28.129999,9.370000,38.840000,-26.020000
4,-47.459999,-14.740000,15.630000,-1.720000,2.270000,-19.090000,-2.430000,-0.940000,-10.85,-35.840000,-6.060000,18.01,-26.959999,-31.260000,-19.129999,-2.99,9.560000,-9.150000,17.070000,-24.980000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13995,-80.480003,-32.880001,1.500000,-20.059999,-31.540001,-37.029999,-25.830000,-39.220001,3.95,-31.219999,-29.379999,6.05,-14.700000,-38.049999,-47.720001,-5.20,-9.880000,-53.930000,-35.430000,-10.770000
13996,-87.220001,-44.430000,-10.560000,-32.639999,-36.230000,-47.389999,-39.779999,-51.470001,-0.52,-38.410000,-39.230000,-1.53,-21.820000,-44.450001,-55.279999,-9.50,-16.420000,-60.730000,-43.169998,-24.580000
13997,-96.750000,-54.709999,-21.059999,-42.419998,-46.310001,-58.959999,-49.599998,-60.509998,-7.53,-44.459999,-47.759998,-4.17,-26.530001,-48.189999,-60.580002,-16.32,-23.240000,-65.160004,-52.930000,-7.390000
13998,-83.529999,-40.419998,-7.320000,-30.000000,-34.970001,-45.669998,-36.279999,-47.360001,4.54,-35.889999,-36.770000,9.74,-15.960000,-43.250000,-49.930000,-3.53,-5.770000,-52.869999,-40.290001,-2.190000


In [12]:
np.unique(training.patient_id).shape

(1950,)

In [13]:
np.unique(training.expert_consensus)

array(['GPD', 'GRDA', 'LPD', 'LRDA', 'Other', 'Seizure'], dtype=object)

In [14]:
training.lpd_vote.value_counts()

lpd_vote
0     77675
1      9680
2      4618
3      4011
4      2290
5      1323
6      1065
7       863
13      769
14      739
10      629
8       616
12      589
9       574
15      557
11      545
17      120
16       92
18       45
Name: count, dtype: int64

New column for variability in votes to track where there's disagreement.

In [15]:
training.columns

Index(['eeg_id', 'eeg_sub_id', 'eeg_label_offset_seconds', 'spectrogram_id',
       'spectrogram_sub_id', 'spectrogram_label_offset_seconds', 'label_id',
       'patient_id', 'expert_consensus', 'seizure_vote', 'lpd_vote',
       'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote'],
      dtype='object')

In [16]:
np.unique(training.eeg_id).shape

(17089,)

In [17]:
np.unique(training.spectrogram_id).shape

(11138,)

In [18]:
import pyarrow, fastparquet

In [19]:
test_eeg_parquet = '3911565283.parquet'

In [20]:
test_eeg = pd.read_parquet('test_eegs/3911565283.parquet', engine = 'pyarrow')

In [21]:
test_eeg

Unnamed: 0,Fp1,F3,C3,P3,F7,T3,T5,O1,Fz,Cz,Pz,Fp2,F4,C4,P4,F8,T4,T6,O2,EKG
0,9.210000,-47.459999,15.100000,8.220000,-16.900000,-22.99,-25.820000,-10.090000,28.370001,-3.010000,-27.299999,101.040001,35.110001,14.540000,18.330000,28.540001,44.090000,69.650002,30.74,171.679993
1,-3.590000,-30.290001,32.380001,10.800000,-68.980003,-21.60,-15.080000,-9.210000,26.360001,-8.980000,-32.279999,95.800003,26.389999,4.820000,10.540000,20.559999,32.060001,59.439999,23.32,178.279999
2,-26.040001,-60.070000,2.370000,-10.150000,-34.689999,-31.40,-31.920000,-26.980000,-1.940000,-28.770000,-49.770000,73.449997,-3.680000,-17.320000,-16.150000,-8.270000,5.330000,45.180000,9.49,306.739990
3,-3.040000,-36.250000,29.559999,14.530000,-14.010000,-11.90,-14.230000,-6.310000,26.040001,-2.770000,-25.030001,91.010002,22.610001,6.900000,9.930000,15.480000,33.580002,69.620003,31.01,223.259995
4,-4.630000,-20.160000,25.190001,1.190000,-44.580002,-23.51,-30.709999,-17.600000,25.420000,-8.860000,-33.959999,89.449997,19.440001,-2.080000,6.110000,8.380000,24.180000,55.869999,19.91,170.759995
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,-26.889999,-45.480000,-17.250000,-23.570000,19.059999,-9.40,-27.120001,-21.580000,-75.760002,-65.800003,-88.790001,-30.090000,-49.830002,-75.339996,-61.139999,-71.889999,-53.299999,-8.130000,-12.38,-34.799999
9996,-24.049999,-41.689999,-13.450000,-26.219999,14.210000,0.02,-30.030001,-22.219999,-75.440002,-68.639999,-91.099998,-33.180000,-45.610001,-78.809998,-61.259998,-71.889999,-55.009998,-12.320000,-15.15,-27.799999
9997,-34.500000,-55.340000,-25.959999,-30.670000,8.890000,-9.74,-38.520000,-30.330000,-87.080002,-70.690002,-92.320000,-37.349998,-57.290001,-80.209999,-67.320000,-72.919998,-57.110001,-12.330000,-15.20,21.980000
9998,-16.110001,-35.980000,-8.570000,-12.020000,28.580000,5.45,-20.510000,-10.300000,-65.459999,-50.730000,-71.650002,-15.970000,-36.380001,-59.660000,-46.310001,-51.520000,-39.740002,6.770000,3.74,-5.800000


In [22]:
test_spectrogram = pd.read_parquet('test_spectrograms/853520.parquet', engine = 'pyarrow')

In [23]:
test_spectrogram

Unnamed: 0,time,LL_0.59,LL_0.78,LL_0.98,LL_1.17,LL_1.37,LL_1.56,LL_1.76,LL_1.95,LL_2.15,...,RP_18.16,RP_18.36,RP_18.55,RP_18.75,RP_18.95,RP_19.14,RP_19.34,RP_19.53,RP_19.73,RP_19.92
0,1,14.910000,17.110001,11.660000,11.73,6.08,4.54,4.31,3.38,2.05,...,0.07,0.06,0.05,0.06,0.05,0.05,0.06,0.05,0.04,0.05
1,3,11.130000,10.950000,10.770000,5.07,4.03,3.24,3.61,2.98,1.54,...,0.05,0.04,0.04,0.04,0.04,0.04,0.03,0.03,0.03,0.02
2,5,10.880000,10.570000,8.790000,5.33,2.44,1.48,1.83,0.99,0.89,...,0.04,0.04,0.04,0.03,0.03,0.04,0.04,0.05,0.06,0.06
3,7,19.450001,18.200001,17.719999,13.38,4.17,1.88,1.84,1.22,1.27,...,0.03,0.03,0.05,0.08,0.07,0.07,0.08,0.03,0.03,0.03
4,9,21.650000,22.530001,23.160000,17.00,7.19,3.89,3.65,2.72,2.35,...,0.04,0.04,0.05,0.05,0.06,0.05,0.05,0.05,0.04,0.03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,591,15.580000,18.209999,14.020000,15.96,4.36,4.98,2.68,2.22,2.03,...,0.48,0.59,0.59,0.73,0.44,0.41,0.56,0.60,0.61,0.60
296,593,17.209999,20.219999,20.889999,17.16,9.15,4.14,2.49,2.71,1.60,...,0.26,0.37,0.41,0.36,0.48,0.36,0.39,0.46,0.34,0.32
297,595,9.610000,13.320000,9.190000,11.50,8.11,5.53,5.57,3.69,3.19,...,0.58,0.37,0.17,0.14,0.13,0.30,0.36,0.39,0.56,0.29
298,597,8.430000,11.840000,13.640000,10.56,8.63,5.80,2.98,1.48,0.96,...,0.54,0.22,0.17,0.16,0.11,0.38,0.45,0.45,0.45,0.34


In [25]:
training

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106795,351917269,6,12.0,2147388374,6,12.0,4195677307,10351,LRDA,0,0,0,3,0,0
106796,351917269,7,14.0,2147388374,7,14.0,290896675,10351,LRDA,0,0,0,3,0,0
106797,351917269,8,16.0,2147388374,8,16.0,461435451,10351,LRDA,0,0,0,3,0,0
106798,351917269,9,18.0,2147388374,9,18.0,3786213131,10351,LRDA,0,0,0,3,0,0


#### Creating Column for Total Number of Votes

In [26]:
training['total_votes'] = (training.seizure_vote + training.lpd_vote + training.gpd_vote 
                           + training.lrda_vote + training.grda_vote + training.other_vote)

In [27]:
training

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote,total_votes
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0,3
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0,3
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0,3
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0,3
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106795,351917269,6,12.0,2147388374,6,12.0,4195677307,10351,LRDA,0,0,0,3,0,0,3
106796,351917269,7,14.0,2147388374,7,14.0,290896675,10351,LRDA,0,0,0,3,0,0,3
106797,351917269,8,16.0,2147388374,8,16.0,461435451,10351,LRDA,0,0,0,3,0,0,3
106798,351917269,9,18.0,2147388374,9,18.0,3786213131,10351,LRDA,0,0,0,3,0,0,3


In [28]:
np.unique(training.total_votes)

array([ 1,  2,  3,  4,  5,  6,  7, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 24, 25, 26, 27, 28])

In [29]:
training.total_votes.value_counts()

total_votes
3     51867
15    10665
13     7525
16     5191
1      4360
12     4356
5      3974
14     3887
4      3451
11     2602
2      2316
18     1934
17     1445
10     1146
6       883
20      634
19      250
21      179
22       54
23       24
25       20
24       17
28        6
26        6
27        5
7         3
Name: count, dtype: int64

#### Creating Column for Variability in Voting

The category of activity associated with each row is determined by the votes. Not only is it important to emphasize rows where more experts voted so that the consensus is more meaningful, it's also important to emphasize rows where there was full agreement because that suggests less ambiguity about the nature of the activity shown in the EEG and Spectrogram. It's clearly more meaningful when 20 experts agree than when 3 do, and it's also probably more meaningful when 10 agree and 0 disagree than when 11 agree and 9 disagree.

These columns will allow me to select parquets of EEG and Spectrogram to do exploratory data analysis with in order to get a better understanding of the data.

In [30]:
training['vote_variability'] = 0

In [31]:
training

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote,total_votes,vote_variability
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0,3,0
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0,3,0
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0,3,0
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0,3,0
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106795,351917269,6,12.0,2147388374,6,12.0,4195677307,10351,LRDA,0,0,0,3,0,0,3,0
106796,351917269,7,14.0,2147388374,7,14.0,290896675,10351,LRDA,0,0,0,3,0,0,3,0
106797,351917269,8,16.0,2147388374,8,16.0,461435451,10351,LRDA,0,0,0,3,0,0,3,0
106798,351917269,9,18.0,2147388374,9,18.0,3786213131,10351,LRDA,0,0,0,3,0,0,3,0


In [32]:
vote_df = training[['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']].copy()

In [33]:
vote_df

Unnamed: 0,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
0,3,0,0,0,0,0
1,3,0,0,0,0,0
2,3,0,0,0,0,0
3,3,0,0,0,0,0
4,3,0,0,0,0,0
...,...,...,...,...,...,...
106795,0,0,0,3,0,0
106796,0,0,0,3,0,0
106797,0,0,0,3,0,0
106798,0,0,0,3,0,0


In [34]:
vote_df_rows = [[vote_df.seizure_vote[i], vote_df.lpd_vote[i], vote_df.gpd_vote[i], vote_df.lrda_vote[i], 
                vote_df.grda_vote[i], vote_df.other_vote[i]] for i in range(vote_df.shape[0])]

In [35]:
for i in range(training.shape[0]):
    training.vote_variability[i] += np.var(vote_df_rows[i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training.vote_variability[i] += np.var(vote_df_rows[i])


In [36]:
training

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote,total_votes,vote_variability
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0,3,1.25
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0,3,1.25
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0,3,1.25
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0,3,1.25
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0,3,1.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106795,351917269,6,12.0,2147388374,6,12.0,4195677307,10351,LRDA,0,0,0,3,0,0,3,1.25
106796,351917269,7,14.0,2147388374,7,14.0,290896675,10351,LRDA,0,0,0,3,0,0,3,1.25
106797,351917269,8,16.0,2147388374,8,16.0,461435451,10351,LRDA,0,0,0,3,0,0,3,1.25
106798,351917269,9,18.0,2147388374,9,18.0,3786213131,10351,LRDA,0,0,0,3,0,0,3,1.25


In [37]:
np.unique(training.vote_variability)

array([ 0.13888889,  0.13888889,  0.22222222,  0.22222222,  0.25      ,
        0.47222222,  0.55555556,  0.55555556,  0.55555556,  0.58333333,
        0.66666667,  0.80555556,  0.80555556,  0.88888889,  0.88888889,
        1.        ,  1.13888889,  1.13888889,  1.13888889,  1.22222222,
        1.22222222,  1.25      ,  1.33333333,  1.47222222,  1.47222222,
        1.47222222,  1.55555556,  1.55555556,  1.58333333,  1.66666667,
        1.80555556,  1.80555556,  1.80555556,  1.88888889,  1.91666667,
        2.        ,  2.13888889,  2.13888889,  2.22222222,  2.22222222,
        2.25      ,  2.33333333,  2.47222222,  2.47222222,  2.47222222,
        2.55555556,  2.55555556,  2.58333333,  2.66666667,  2.80555556,
        2.80555556,  2.80555556,  2.88888889,  2.88888889,  2.88888889,
        2.91666667,  3.        ,  3.13888889,  3.13888889,  3.13888889,
        3.22222222,  3.22222222,  3.22222222,  3.25      ,  3.33333333,
        3.47222222,  3.47222222,  3.47222222,  3.55555556,  3.55

In [38]:
test = [10, 0, 0, 0, 0, 0]
test2 = [3, 0, 0, 0, 0, 0]

In [39]:
np.var(test), np.var(test2)

(13.888888888888886, 1.25)

Variance isn't a good measure here. Higher variance simply because 10 is a greater number than 3 even though both are situations where there was absolute consensus among those who voted. Need to determine the best way to measure disagreement. Maybe counting non-zeros for each row.

In [40]:
training = training.drop(columns = 'vote_variability')

In [41]:
training

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote,total_votes
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0,3
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0,3
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0,3
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0,3
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106795,351917269,6,12.0,2147388374,6,12.0,4195677307,10351,LRDA,0,0,0,3,0,0,3
106796,351917269,7,14.0,2147388374,7,14.0,290896675,10351,LRDA,0,0,0,3,0,0,3
106797,351917269,8,16.0,2147388374,8,16.0,461435451,10351,LRDA,0,0,0,3,0,0,3
106798,351917269,9,18.0,2147388374,9,18.0,3786213131,10351,LRDA,0,0,0,3,0,0,3


In [42]:
unique_votes = np.zeros(len(vote_df_rows))

In [43]:
len(unique_votes)

106800

In [44]:
for i in range(len(vote_df_rows)):
    unique_count = 0
    for j in range(len(vote_df_rows[0])):
        if vote_df_rows[i][j] > 0:
            unique_votes[i] += 1

In [45]:
len(unique_votes), max(unique_votes), min(unique_votes)

(106800, 6.0, 1.0)

In [46]:
training['unique_votes'] = unique_votes

In [47]:
training.unique_votes.value_counts()

unique_votes
1.0    51037
2.0    30288
3.0    17268
4.0     5938
5.0     2153
6.0      116
Name: count, dtype: int64

This is what I'm going with for now. It at least tells me that by far most rows in the data had absolute consensus or close to it. With the column for number of total votes I can narrow down this data further and select at random from that subset.

#### Exploratory Data Analysis

For the purpose of better understanding my features (the EEG features specifically for now), I'll choose 5 parquets for each brain activity. To narrow down the set of data being pulled from in order to better ensure that I'm doing exploratory data analysis with meaningful data (rows which are strongly associated with their assigned activity type so that there's less ambiguity about which category the row falls in), I'll pull only from rows with an absolute consensus (number of unique votes = 1) and at least 8 votes (number of total votes >= 8). Then I'll split that subset by category of brain activity. From those 6 remaining subsets, I'll pull 5 rows at random and use the EEG sub IDs to get the data for the corresponding 50 seconds of EEG recording.

This should help me compare the resulting EEG data at each electrode location across levels of the target variable (the 6 categories of brain activity). I'll also want within group correlations to see assocations between electrode locations. Another thing to look for is how each locations activity is associated with each target class. Change in magnitude of activity will definitely be important for identifying types of brain activity, but location may be also. Important to keep both in mind.

### Splitting Data

In [50]:
initial_split = training[training['unique_votes'] == 1] # split by unique votes

In [52]:
initial_split.unique_votes.value_counts()

unique_votes
1.0    51037
Name: count, dtype: int64

##### Making Sure Each Target Class Has Enough To Pull From

In [55]:
initial_split.expert_consensus.value_counts()

expert_consensus
Seizure    18245
GRDA       11673
Other       7486
LRDA        5551
LPD         5062
GPD         3020
Name: count, dtype: int64

In [68]:
second_split = initial_split[initial_split['total_votes'] > 4] # split by total votes

In [69]:
second_split

Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote,total_votes,unique_votes
83,1445780287,0,0.0,4004824,0,0.0,3042959589,22597,Other,0,0,0,0,0,17,17,1.0
84,1445780287,1,6.0,4004824,1,6.0,942569566,22597,Other,0,0,0,0,0,17,17,1.0
85,1445780287,2,16.0,4004824,2,16.0,3752799254,22597,Other,0,0,0,0,0,17,17,1.0
104,2559567335,0,0.0,5487370,0,0.0,1418039992,36059,Other,0,0,0,0,0,12,12,1.0
105,2559567335,1,4.0,5487370,1,4.0,2738835895,36059,Other,0,0,0,0,0,12,12,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106663,728496959,27,310.0,2145358771,31,590.0,1305786638,200,Seizure,5,0,0,0,0,0,5,1.0
106664,728496959,28,326.0,2145358771,32,606.0,853838430,200,Seizure,5,0,0,0,0,0,5,1.0
106776,2750557840,0,0.0,2146188334,0,0.0,83367578,21884,Other,0,0,0,0,0,13,13,1.0
106777,2750557840,1,4.0,2146188334,1,4.0,3168355352,21884,Other,0,0,0,0,0,13,13,1.0


##### Checking Target Class Value Counts

In [70]:
second_split.expert_consensus.value_counts()

expert_consensus
Other      1752
LPD        1544
GPD         678
Seizure     417
LRDA         68
GRDA         26
Name: count, dtype: int64

##### Quick Note

The first thing I notice about the value counts above is that LRDA, Seizure, and GRDA have so few rows that meet the split criteria I went with. The second thing I notice is that the category with by far the most rows that meet this criteria is other which is interesting.

### Getting Target Class Splits of Data

In [59]:
other_split = second_split[second_split['expert_consensus'] == 'Other']
lpd_split = second_split[second_split['expert_consensus'] == 'LPD']
gpd_split = second_split[second_split['expert_consensus'] == 'GPD']
lrda_split = second_split[second_split['expert_consensus'] == 'LRDA']
seizure_split = second_split[second_split['expert_consensus'] == 'Seizure']
grda_split = second_split[second_split['expert_consensus'] == 'GRDA']

In [64]:
lrda_split.eeg_id.value_counts()

eeg_id
2713014975    27
Name: count, dtype: int64

In [65]:
training.expert_consensus.value_counts()

expert_consensus
Seizure    20933
GRDA       18861
Other      18808
GPD        16702
LRDA       16640
LPD        14856
Name: count, dtype: int64

In [66]:
initial_split.expert_consensus.value_counts()

expert_consensus
Seizure    18245
GRDA       11673
Other       7486
LRDA        5551
LPD         5062
GPD         3020
Name: count, dtype: int64

In [67]:
second_split.expert_consensus.value_counts()

expert_consensus
Other      1698
LPD        1194
GPD         569
LRDA         27
Seizure      17
GRDA         14
Name: count, dtype: int64

##### Cosine Similarity