# Seizure Prediction Kaggle Competition

## Melbourne University, MathWorks, and NIH 

#### [ATOM] 30 May, 2017

# Epilepsy

Epilepsy is a neurological condition characterised by spontaneous seizures. Epilepsy affects around 1% of the world.

Seizures are caused by abnormal synchronous activity in the brain.

Medication exists, but causes uncomfortable side effects. For 20 - 40% of individuals, medication is ineffective. People with epilepsy can sometimes resort to surgery, though sometimes this doesn't fix anything.

This leads to poor quality of life, and anxiety. Driving, swimming, many every day activities are difficult.

# Predicting epileptic seizures

If seizures could be predicted, quick-acting medication can be taken before a seizure, or dangerous activities could be avoided. Patients can be implanted with early warning systems.

Epilepsy is typically diagnosed using electroencephalography. EEG records electrical activity on the surface of the brain using electrodes placed on the scalp or directly on the cortex.

<img src="images/eeg.png" width="600px" />

# EEG data

<img src="images/epileptiform_eeg.png" width="800px" />

EEG is a great tool to see seizures happening. Abnormal activity begins a long time prior to external symptoms.

# The competition

### Melbourne University AES / MathWorks / NIH Seizure Prediction

- Began 2 September, ended 1 December.
- Prizes : \$10k, \$6k, \$4k

<div align="center"><h3>Predict seizures in long-term human intracranial EEG recordings</h3></div>

# Technicalities

The **data** : ten-minute recordings of iEEG ( a number of complications ).

The **task** : produce probabilities that these chunks are preictal or interictal.

The **metric** : area under the receiver operating characteristic curve ( AUC-ROC ).

# The data

- three human participants
- 16 electrode intracranial EEG at 400 Hz
- recordings span months or years

- each hour is labelled *preictal*, *interictal*, *ictal*, and *postictal*
- preictal are from within one hour of a seizure ( -1:05 to -0:05 before seizure )
- interictal are at least four hours before and after any seizure

<img src="images/data_timing.png" width="600px" />

# Datapoints

- come in ten minute chunks ( 240k points x 16 channels )
- full hours are given ( 6 x 10-minute chunks ) but only for training data
- are labelled by patient

### Problems with the data

- may contain *dropout*, times when the electrodes fail
- "may also contain artifacts such as large amplitude rapid signal transitions that can be removed from analysis."
- [data leak](https://www.kaggle.com/c/melbourne-university-seizure-prediction/discussion/24803) : "the signal data of 1_145_1.mat - 1_150_1.mat and 1_1129_0.mat - 1_1134_0.mat are virtually identical, there is just a little time shift. The problem is that label is different."

<img src="images/dropout.png" width="800px" />

# Data exploration

<img src="images/all_channels.png" width="800px" />

<img src="images/ps_mean_signals.png" width="800px" />

<img src="images/mean_across_channels.png" width="800px" />

# The winning team

### Team Not-So-Random-Anymore

This team consisted of four people. They submitted 260 entries, and finished with an AUC-ROC score of 0.80701.

See their [interview](http://blog.kaggle.com/2017/02/06/seizure-prediction-competition-first-place-winners-interview-team-not-so-random-anymore-andriy-alexandre-feng-gilberto/) and their [winning forum post](https://www.kaggle.com/c/melbourne-university-seizure-prediction/discussion/26310).

### For comparison...

My team came in 83rd ( top 20%, bronze ), with an AUC-ROC of 0.71441.

# Members of the team

**Alex**

- PhD in signal processing
- EEG / BCI specialist
- won 3 other EEG competitions on Kaggle...

**Gilberto**

- MSc Electrical Engineering and Telecoms

**Feng** 

- BSc Statistics
- at the time, doing MSc in Data Science

**Andriy**

- PhD in signal processing
- Electrical engineer, experience with biomedical signals

# The winning solution

Team members made their own models, with their own features. The winning solution is the ranked average of the predictions of eleven models, "to avoid overfitting". All models are patient-specific.

Their primary tools were Python, Matlab, and R. Key machine learning modules were scikit-learn and XGBoost.

They reviewed the top ten entries from the last two competitions for inspiration. They mostly trained a bunch of classifiers : XGB, SVC, KNN, LR.

Their code is available [here](https://github.com/alexandrebarachant/kaggle-seizure-prediction-challenge-2016).

# Tackling the data

Their insight into the data :

> The dataset is small and noisy, and cross-validation is difficult. Diversity in the ensemble is the key. We opted for many simple and low performing models rather than hyperoptimising one good model.

This is classic ensembling theory.

# Models : Alex and Gilberto

- split data into 30 non-overlapping 20s segments
- for prediction, use the maximum probability of 20s segments

#### Model 1 : normalised log power in different frequency bands

- frequencies : 0.1 - 4 Hz; 4 - 8 Hz; 8 - 15 Hz; 15 - 30 Hz; 30 - 90 Hz; 90 - 170 Hz
- power spectral density was averaged over each channel, then normalised by total power
- 6 frequency bands and 16 channels = 96 features
- XGBoost, ten bags

#### Model 2 : relative log power, plus a bunch of signal statistics

- summary statistics : mean, min, max, var, percentiles
- autoregressive error coefficient, fractal dimensions, Hurst exponents
- 21 x 16 channels = 336 features
- XGBoost, five bags


#### Model 3 : autocorrelation in tangent space

- start with autocorrelation matrix for each channel
- project into Riemannian tangent space
- vectorise to produce a 36-long vector 
- 36 x 16 channels = 576 features
- XGBoost, four bags

#### Model 4 : cross-frequency coherence

- for the six frequency bands above, calculate cross-frequency coherence for all channels
- results in 6 x 6 coherence matrices
- project to tangent space and vectorise
- 21 x 16 = 336 features
- XGBoost, ten bags

# Models : Feng

- preprocessing : Butterworth bandpass filter 0.1 - 180 Hz
- split data into nonoverlapping 30s windows
- arithmetic mean of individual windows used to aggregate into a probability score for 10-min segments

#### Features

- Features 1 : std and PSD averaged over 0.1 - 4 Hz, 4 - 8 Hz, 8 - 12 Hz, 12 - 30 Hz, 30 - 70 Hz, and 70 - 180 Hz
- Features 2 : time and frequency domain correlations and their eigenvalues.

#### Models

- Model 1 : XGBoost on F1
- Model 2 : *k*-nearest neighbours on F1
- Model 3 : *k*-nearest neighbours on F1 + F2
- Model 4 : logistic regression with L2 penalty on F1 + F2

A quick remark : a weighted blending of M1, M2 and M4 could get 0.8+ AUC

# Models : Andriy

- preprocessing : demeaning, bandpass filter 0.5 - 128 Hz, downsampling to 256 Hz
- split datainto 30s segments
- max probability used for 10-min chunks

#### Features

- for each EEG channel, 111 features extracted from time, frequency, information-theoretic domains to capture energy, frequency, temporal and structural info as generic descriptions of the EEG signals
- peak frequency of spectrum, spectral edge frequency (80%, 90%, 95%), fine spectral log-filterbank energies in 2 Hz width sub-bands (0-2Hz, 1-3Hz, ...30-32Hz), coarse log filterbank energies in delta, theta, alpha, beta, gamma frequency bands, normalised FBE in those sub-bands, wavelet energy, curve length, Number of maxima and minima, RMS amplitude, Hjorth parameters, Zero crossings (raw epoch, Δ, ΔΔ), Skewness, Kurtosis, Nonlinear energy, Variance (Δ, ΔΔ), Mean frequency, band-width, Shannon entropy, Singular value decomposition entropy, Fisher information, Spectral entropy, Autoregressive modelling error (model order 1-9). 
- these led to 111 x 16 = 1776 features in a concatenated feature vector
- additional multivariate features
- total features : 1956

#### Models

- Model 1 : All features, bagged XGBoost
- Model 2 : Linear SVM with top 300 features
- Model 3 : GLM with top 200 models

<img src="images/model_correlations.png" width="600px" />

# Cross-validation

Ten-minute segments that are sometimes sequential must be kept in the same CV train-test set.

Training and testing data were recorded at different times, and emulating this in CV is impossible. 

The leak ( and release of more data ) caused further problems with cross-validation.

60% of their time was spent on building a good CV procedure.

Because CV was hard, their approach was to extract as many features as possible, build as many models as possible, and pick the ones that were diverse and robust.

In the end, they had **2-fold** and **26-fold** cross-validation methods.

# Submission

> We believed the public leaderboard to be very overfitted (including our best public score) and that the private score will settle down around 0.8 AUC. We decided to minimize our risk by choosing as one of our final submissions a very conservative ensemble, tailored to be stable. It was a difficult choice, the public score of this ensemble was around 0.81 while our best score was 0.85. Luckily enough, our prediction was right and everyone went below 0.8 AUC on the private LB except for our stable ensemble that barely moved.

<img src="images/bios.png" width="900px" />

# Ensembling - a few resources

- [Ensembling](https://en.wikipedia.org/wiki/Ensemble_learning) on Wikipedia
- Excellent [Kaggle Ensembling Guide](https://mlwave.com/kaggle-ensembling-guide/)
- [Overview of ensemble methods](https://www.toptal.com/machine-learning/ensemble-methods-machine-learning)
- [Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) on Wikipedia
- [Boosting](https://en.wikipedia.org/wiki/Boosting_&#40;machine_learning&#41;) on Wikipedia