Submission for Kaggle's American Epilepsy Society Seizure Prediction Challenge
Jupyter Notebook Matlab Python C HTML C++ Other
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
ipynotebooks
matlab
model
output
python
settings
test
testing
.gitignore
.travis.yml
LICENSE
README.md
SETTINGS.json
average.py
joinallrawcln.m
merge.json
predict.py
preprocess_MI.m
preprocess_clean.m
preprocess_combocspfeat.m
preprocess_combocspfeat2.m
preprocess_csp.m
preprocess_feat.m
preprocess_feat_missing.m
preprocess_ica.m
preprocess_mvarorder.m
preprocessing.m
requirements.txt
segmentMetadata.json
serial_train.sh
train.py

README.md

hail-seizure

Submission for Kaggle's American Epilepsy Society Seizure Prediction Challenge

http://www.kaggle.com/c/seizure-detection

This README and repository modelled on https://www.kaggle.com/wiki/ModelSubmissionBestPractices

Hardware / OS platform used

  • Various servers owned by Edinburgh University Informatics Department:
    • 64 AMD Opteron cores, 256GB RAM, 4TB disk
    • Scientific Linux
  • Various mid-high end desktops and laptops:
    • Intel processors (i3 and Xeons), 8-64GB RAM, 0.5-8TB disk
    • Arch Linux

Dependencies

Required

  • MATLAB or Octave
  • Python 3.4.1
    • scikit_learn-0.15.2
    • numpy-1.8.1
    • scipy
    • h5py

Generate features

Place path to raw data organised by subject under the RAW_DATA_DIRS key of SETTINGS.json and check the values used in the SETTINGS.json

RAW_DATA_DIR/
  Dog_1/
    Dog_1_ictal_segment_1.mat
    Dog_1_ictal_segment_2.mat
    ...
    Dog_1_interictal_segment_1.mat
    Dog_1_interictal_segment_2.mat
    ...
    Dog_1_test_segment_1.mat
    Dog_1_test_segment_2.mat
    ...

  Dog_2/
  ...

Then run ./preprocessing.m with:

matlab -nodisplay -nosplash -r "preprocessing"

or similar.

This will calculate features used the feature functions specified in SETTINGS.json FEATURES field and output them to TRAIN_DATA_PATH directory as HDF5 files.

HDF5 structure:

$feature_name.h5 = {$subject: {$type : {$segment_file_name : $feature_vector } } }

  • $feature_name.h5: is the feature name, modification type and version number e.g. (raw_feat_var_v2.h5 or ica_feat_covar_v5.h5 etc)
  • $type: data type e.g. 'preictal', 'interictal' or 'test'
  • $segment_file_name: the filename for the segment from which that vector was generated
  • $feature_vector: A 1xNxM feature vector for that segment using the specified feature function

Train classifier

One classifier is trained for each patient and serialised into the directory specific in SETTINGS.json under MODEL_PATH (default is model/).

This is achieved by running:

./train.py

To run alternative models the options can be accessed through the standard help interface:

./train.py -h

Cross validation

Cross validation is run in the process of the train.py script. The AUC for each subject and over all subjects is calculated and saved to the If the verbose option is set this will also print the calculated values to the command line.

Important note: cross validation is run by splitting the data over the hours that it is split into. This is very important, as this respects the split between training and test data for the leader board.

Make prediction

After running train.py model files will be generated in the default model (model) directory. These will be automatically loaded along with the test data to classify the test data points. The results will be written to an output csv in the default output directory (output):

./predict.py

As above, options can be viewed by:

./predict.py -h

SETTINGS.json

{
    "TRAIN_DATA_PATH": "train", 
    "MODEL_PATH": "model", 
    "SUBJECTS": ["Dog_1",
                 "Dog_2",
                 "Dog_3",
                 "Dog_4",
                 "Dog_5",
                 "Patient_1",
                 "Patient_2"],
    "FEATURES": ["feat_var",
                 "feat_var", 
                 "feat_cov", 
                 "feat_corrcoef",
                 "feat_pib", 
                 "feat_xcorr", 
                 "feat_psd", 
                 "feat_psd_logf",
                 "feat_coher",
                 "feat_coher_logf"],
    "TEST_DATA_PATH": "test", 
    "SUBMISSION_PATH": "output",
    "VERSION": "_v1",
    "RAW_DATA_DIRS": ["/disk/data2/neuroglycerin/hail-seizure-data/",
                      "/media/SPARROWHAWK/neuroglycerin/hail-seizure-data/",
                      "/media/scott/SPARROWHAWK/neuroglycerin/hail-seizure-data/"]
}
  • SUBJECTS: list of which subjects to use in the current run
  • VERSION: string to indicate version number of this run
  • RAW_DATA_DIRS: directory that contains the raw .mat data organised by subject
  • FEATURES: list of features used in this run
  • TRAIN_DATA_PATH: directory holding the preprocessed extracted features from raw data in per-feature HDF5s
  • MODEL_PATH: directory containing the serialised miodels
  • TEST_DATA_PATH: directory containing all output related to model testing (CV etc).
  • SUBMISSION_PATH: directory containing the submission csv for the current run
  • THRESHOLD: if present will activate VarianceThreshold
  • PCA: if present will activate Principle Component analysis transform, options not implemented
  • SELECTION: if present will activate univariate feature selection. Dictionary inside each of these keys will be used as options, keys are:
  • TREE_EMBEDDING: Random Tree Embedding transformation
  • BAGGING: meta-bagger using selected classifier as base, options are set as a dictionary at this key.
  • RFE: use recursive feature elimination, only works with linear SVC

Model documentation

Our final model was a combination of four models, all of which used a support vector machine classifier with feature selection. Notes on this, and the code actually used in the competition can be found the [Comparing outputs][comparing] IPython notebook. The settings for each of these models can be found in the settings directory of the repository.

The important part of this code that can combine the outputs to produce the final csv can be found in the average.py script. Calling this with the four csvs four csvs found in merge.json will produce our final output csv:

merge.json
----------
["output/forestselection_gavin_submission_using__v2_feats.csv",
 "output/SVC_best_for_each_subject_in_batchall_with_FS_submission_using__v3_feats.csv",
 "output/stoch_opt_2nd_submission_using__v2_feats.csv",
 "output/bbsubj_pg_submission_using__v2_feats.csv"]
./average.py -s merge.json -o merged_many_v1.csv