# Epileptic Seizure Classification with Random Forest
This notebook contains the classification of time series egg data for the detection of epileptic seizures based on the preprocessed CHB-MIT Scalp EEG Database.
The codes is structured as followed:
1. [Imports](#1-imports)
2. [Load Dataset](#2-load-dataset)
3. [Split Dataset](#3-split-dataset)
4. [Define Space & Optimization Function](#4-define-space--optimization-function)
5. [Train Optimized Classifier](#5-optimize-classifier)
6. Validate Results
7. Explain Classifier with SHAP
8. [Conclusions](#8-conclusion)

<a name="imports"></a>
## 1. Imports
Import requiered libraries. <br>
External packages can be installed via the `pip install` command.

In [None]:
# Import built-in libraries
import time

# Import datascience libraries
import numpy as np

# Import preprocessing-libraries, classifier & metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold
from sklearn.metrics import f1_score, roc_auc_score, make_scorer
from imblearn.metrics import geometric_mean_score

# Import optimization library
from hyperopt import fmin, hp, tpe, STATUS_OK, Trials
from hyperopt.pyll import scope

# Import explainability library
# import shap

## 2. Load Dataset
In order to load the preprocessed dataset, that was created with the notebook `00_Preprocessing.ipynb`, is loaded and the numpy Arrays for the features and labels are extracted. <br>
To enshure a functional distribution of the classes in the dataset, the classes with the respective amounts are plotted.

In [None]:
dataset = np.load('../00_Data/Processed-Data/classification_dataset.npz')
X = dataset["features"]
y = dataset["labels"]

In [None]:
print("Shapes: ", X.shape, y.shape)
np.unique(y, return_counts=True)

In [None]:
n_samples, n_timesteps, n_features = X.shape
X_reshaped = np.reshape(X, (n_samples, (n_timesteps * n_features)))

## 3. Split Dataset
In order to validate and test the trained classifier, the dataset must be split into a `train`, `test`, and `validation` subset. <br>
To preserve an equal distribution within each split, the `stratify`-option is enabled.

In [None]:
X_train, X_rest, y_train, y_rest = train_test_split(X_reshaped, y, test_size=0.4, shuffle=True, stratify=np.ravel(y), random_state=34)
X_test, X_val, y_test, y_val = train_test_split(X_rest, y_rest, test_size=0.5, shuffle=True, stratify=np.ravel(y_rest), random_state=34)

In [None]:
print(np.unique(y_train, return_counts=True))
print(np.unique(y_test, return_counts=True))
print(np.unique(y_val, return_counts=True))

## 4. Define Space & Optimization Function
To get the best possible predictions, the hyperparameters of the classifier are optimized with the bayesian optimization library `hyperopt`. <br>
First, the space for each hyperparameter is defined and stored as an dictionary. <br>
The `objective()`-function contains the definition, training and evaluation of the classifier, which is done by a five-fold cross-validation split. <br>
Last, the metrics are returned to enable a correct optimization.

In [None]:
space={
    'n_estimators': scope.int(hp.quniform('n_estimators', 100, 600, 10)),
    'max_depth': hp.quniform('max_depth', 100, 400, 10),
    'min_samples_split' : hp.uniform ('min_samples_split', 0, 0.5),
    'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.5),
    'max_features': hp.choice('max_features', ['sqrt','log2', None]),
}

gm_scorer = make_scorer(geometric_mean_score, greater_is_better=True, average='macro') #Create Scorer for G-Mean

In [None]:
def objective(space):
    global X_train, y_train, X_test, y_test

    # Create classifier
    rf_classifier = RandomForestClassifier(
        n_estimators = int(space["n_estimators"]),
        max_depth = int(space["max_depth"]),
        min_samples_split = space["min_samples_split"],
        min_samples_leaf = space["min_samples_leaf"],
        max_features=space["max_features"],
        random_state=456,
        n_jobs=-1,
        verbose=2
    )

    # Train classifier
    rf_classifier.fit(
        X=X_train,
        y=np.ravel(y_train)
    )

    # Cross Validation
    splits = StratifiedKFold(n_splits=5, shuffle=True)
    cross_val = cross_validate(rf_classifier, X_train, np.ravel(y_train), cv=splits, scoring={'f1_macro': 'f1_macro', 'f1_weighted': 'f1_weighted', 'auc': 'roc_auc_ovr', 'gmean': gm_scorer, 'precision': 'precision_macro', 'recall': 'recall_macro', 'waccuracy': 'balanced_accuracy'})
    try:
        cv_f1_macro = np.mean(cross_val.get('test_f1_macro')[~np.isnan(cross_val.get('test_f1_macro'))])
        cv_f1_weighted = np.mean(cross_val.get('test_f1_weighted')[~np.isnan(cross_val.get('test_f1_weighted'))])
        cv_auc = np.mean(cross_val.get('test_auc')[~np.isnan(cross_val.get('test_auc'))])
        cv_gmean = np.mean(cross_val.get('test_gmean')[~np.isnan(cross_val.get('test_gmean'))])
        cv_precision = np.mean(cross_val.get('test_precision')[~np.isnan(cross_val.get('test_precision'))])
        cv_recall = np.mean(cross_val.get('test_recall')[~np.isnan(cross_val.get('test_recall'))])
        cv_acc_weighted = np.mean(cross_val.get('test_waccuracy')[~np.isnan(cross_val.get('test_waccuracy'))])

        pred = rf_classifier.predict(X_test) #Predict X_test
        pred_proba = rf_classifier.predict_proba(X_test) #Predict probablities X_test
        f1 = f1_score(y_test, pred, average="macro") #Compute f1-score
        auc = roc_auc_score(np.ravel(y_test), pred_proba[:,1], average="macro", multi_class="ovr") #Compute AUC
        gmean = geometric_mean_score(y_test, pred, average="macro") #Compute G-Mean
    except:
        pass

    return {
        'loss': -cv_f1_macro, 
        'status': STATUS_OK, 
        'metrics': {
            'f1': f1, 
            'auc': auc, 
            'gmean': gmean, 
            'cv_f1_macro': cv_f1_macro,
            'cv_f1_weighted': cv_f1_weighted,
            'cv_auc': cv_auc,
            'cv_gmean': cv_gmean,
            'cv_precision': cv_precision,
            'cv_recall': cv_recall,
            'cv_acc_weighted': cv_acc_weighted
        },
        'eval_time': time.time()
    }

## 5. Optimize Classifier

In [None]:
trials = Trials() #Parallelized optimization

best_param = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=10,
    trials=trials
)

print(best_param)

## 8. Conclusion