# Sleep stage classification: Random Forest & Hidden Markov Model
____

This model aims to classify sleep stages based on two EEG channel. We will use the features extracted in the `pipeline.ipynb` notebook as the input to a Random Forest. The output of this model will then be used as the input of a HMM. We will implement our HMM the same as in this paper (Malafeev et al., « Automatic Human Sleep Stage Scoring Using Deep Neural Networks »).

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys

# Ensure parent folder is in PYTHONPATH
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [18]:
%matplotlib inline

import sys
from itertools import groupby

import matplotlib.pyplot as plt
import numpy as np
import joblib

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (GridSearchCV,
                                     RandomizedSearchCV,
                                     GroupKFold,
                                     cross_validate)
from sklearn.metrics import (accuracy_score,
                             confusion_matrix,
                             classification_report,
                             f1_score,
                             cohen_kappa_score,
                             make_scorer)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.decomposition import PCA

from scipy.signal import medfilt

from hmmlearn.hmm import MultinomialHMM
from constants import (SLEEP_STAGES_VALUES,
                       N_STAGES,
                       EPOCH_DURATION)
from model_utils import (print_hypnogram,
                         train_test_split_one_subject,
                         train_test_split_according_to_age)

## Load the features
___

In [3]:
# position of the subject information and night information in the X matrix
SUBJECT_IDX = 0 
NIGHT_IDX = 1
USE_CONTINUOUS_AGE = False
DOWNSIZE_SET = False
TEST_SET_SUBJECTS = [0.0, 24.0, 49.0, 71.0]

if USE_CONTINUOUS_AGE:
    X_file_name = "../data/x_features-age-continuous.npy"
    y_file_name = "../data/y_observations-age-continuous.npy"
else:
    X_file_name = "../data/x_features.npy"
    y_file_name = "../data/y_observations.npy"

In [4]:
X_init = np.load(X_file_name, allow_pickle=True)
y_init = np.load(y_file_name, allow_pickle=True)


In [5]:
X_init = np.vstack(X_init)
y_init = np.hstack(y_init)
print(X_init.shape)
print(y_init.shape)


(168954, 50)
(168954,)


In [6]:
print("Number of subjects: ", np.unique(X_init[:,SUBJECT_IDX]).shape[0]) # Some subject indexes are skipped, thus total number is below 83 (as we can see in https://physionet.org/content/sleep-edfx/1.0.0/)
print("Number of nights: ", len(np.unique([f"{int(x[0])}-{int(x[1])}" for x in X_init[:,SUBJECT_IDX:NIGHT_IDX+1]])))


Number of subjects:  78
Number of nights:  153


## Downsizing sets
___

We will use the same set for all experiments. It includes the first 20 subjects, and excludes the 13th, because it only has one night.

The last subject will be put in the test set. 

In [11]:
if DOWNSIZE_SET:
    # Filtering to only keep first 20 subjects
    X_20 = X_init[np.isin(X_init[:,SUBJECT_IDX], range(20))]
    y_20 = y_init[np.isin(X_init[:,SUBJECT_IDX], range(20))]

    # Exclude the subject with only one night recording (13th)
    MISSING_NIGHT_SUBJECT = 13

    X = X_20[X_20[:,SUBJECT_IDX] != MISSING_NIGHT_SUBJECT]
    y = y_20[X_20[:,SUBJECT_IDX] != MISSING_NIGHT_SUBJECT]

    print(X.shape)
    print(y.shape)
else:
    X = X_init
    y = y_init

In [12]:
print("Number of subjects: ", np.unique(X[:,SUBJECT_IDX]).shape[0]) # Some subject indexes are skipped, thus total number is below 83 (as we can see in https://physionet.org/content/sleep-edfx/1.0.0/)
print("Subjects available: ", np.unique(X[:,SUBJECT_IDX]))
print("Number of nights: ", len(np.unique([f"{int(x[0])}-{int(x[1])}" for x in X[:,SUBJECT_IDX:NIGHT_IDX+1]])))

Number of subjects:  78
Subjects available:  [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
 36. 37. 38. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54.
 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 70. 71. 72. 73. 74.
 75. 76. 77. 80. 81. 82.]
Number of nights:  153


## Train, validation and test sets
___

If we downsize the dataset, the test set will only contain the two nights recording of the last subject (no 19) will be the test set. The rest will be the train and validation sets.

If we did not downsize the dataset, we will randomly pick a subject from each age group to be in the test set. Both nights (if there are two) are placed in the test set so that the classifier does not train on any recordings from a subject placed in the test set.


In [13]:
if DOWNSIZE_SET:
    X_test, X_train_valid, y_test, y_train_valid = train_test_split_one_subject(X, y)
else:
    X_test, X_train_valid, y_test, y_train_valid = train_test_split_according_to_age(X,
                                                                                     y,
                                                                                     subjects_test=TEST_SET_SUBJECTS,
                                                                                     use_continuous_age=USE_CONTINUOUS_AGE)
    
print(X_test.shape, X_train_valid.shape, y_test.shape, y_train_valid.shape)

Selected subjects for the test set are:  [0.0, 24.0, 49.0, 71.0]
(8123, 50) (160831, 50) (8123,) (160831,)


## Random forest validation
___

In [14]:
NB_KFOLDS = 5
NB_CATEGORICAL_FEATURES = 2
NB_FEATURES = 48

CLASSIFIER_PIPELINE_KEY = 'classifier'

def get_random_forest_model():
    return Pipeline([
        ('scaling', ColumnTransformer([
            ('pass-through-categorical', 'passthrough', list(range(NB_CATEGORICAL_FEATURES))),
            ('scaling-continuous', StandardScaler(copy=False), list(range(NB_CATEGORICAL_FEATURES,NB_FEATURES)))
        ])),
        (CLASSIFIER_PIPELINE_KEY, RandomForestClassifier(
            n_estimators=100,
            random_state=42, # enables deterministic behaviour
            n_jobs=-1
        ))
    ])

For the cross validation, we will use the `GroupKFold` technique. For each fold, we make sure to train and validate on different subjects, to avoid overfitting over subjects.

In [16]:
%%time

def cross_validate_pipeline(pipeline):
    accuracies = []
    macro_f1_scores = []
    weighted_f1_scores = []
    kappa_agreements = []
    emission_matrix = np.zeros((N_STAGES,N_STAGES))

    for train_index, valid_index in GroupKFold(n_splits=5).split(X_train_valid, groups=X_train_valid[:,SUBJECT_IDX]):
        # We drop the subject and night indexes
        X_train, X_valid = X_train_valid[train_index, 2:], X_train_valid[valid_index, 2:]
        y_train, y_valid = y_train_valid[train_index], y_train_valid[valid_index]

        pipeline.fit(X_train, y_train)
        y_valid_pred = pipeline.predict(X_valid)

        print("----------------------------- FOLD RESULTS --------------------------------------\n")
        current_kappa = cohen_kappa_score(y_valid, y_valid_pred)

        print("TRAIN:", train_index, "VALID:", valid_index, "\n\n")
        print(confusion_matrix(y_valid, y_valid_pred), "\n")
        print(classification_report(y_valid, y_valid_pred, target_names=SLEEP_STAGES_VALUES.keys()), "\n")
        print("Agreement score (Cohen Kappa): ", current_kappa, "\n")

        accuracies.append(round(accuracy_score(y_valid, y_valid_pred),2))
        macro_f1_scores.append(f1_score(y_valid, y_valid_pred, average="macro"))
        weighted_f1_scores.append(f1_score(y_valid, y_valid_pred, average="weighted"))
        kappa_agreements.append(current_kappa)

        for y_pred, y_true in zip(y_valid_pred, y_valid):
            emission_matrix[y_true, y_pred] += 1

    emission_matrix = emission_matrix / emission_matrix.sum(axis=1, keepdims=True)
    
    print(f"Mean accuracy          : {np.mean(accuracies):0.2f} ± {np.std(accuracies):0.3f}")
    print(f"Mean macro F1-score    : {np.mean(macro_f1_scores):0.2f} ± {np.std(macro_f1_scores):0.3f}")
    print(f"Mean weighted F1-score : {np.mean(weighted_f1_scores):0.2f} ± {np.std(weighted_f1_scores):0.3f}")
    print(f"Mean Kappa's agreement : {np.mean(kappa_agreements):0.2f} ± {np.std(kappa_agreements):0.3f}")

    return emission_matrix

CPU times: user 9 µs, sys: 1 µs, total: 10 µs
Wall time: 14.1 µs


In [29]:
validation_pipeline = get_random_forest_model()
validation_pipeline.set_params(
    classifier__max_depth=24,
    classifier__n_estimators=100,
)

cross_validate_pipeline(validation_pipeline)

----------------------------- FOLD RESULTS --------------------------------------

TRAIN: [  2137   2138   2139 ... 158843 158844 158845] VALID: [     0      1      2 ... 160828 160829 160830] 


[[ 7206   194   111     2   139]
 [ 1235   534  1404     1   543]
 [  993   439 10654   360   492]
 [  155     7   632  2132     5]
 [  842   907  1233     5  2579]] 

              precision    recall  f1-score   support

           W       0.69      0.94      0.80      7652
          N1       0.26      0.14      0.18      3717
          N2       0.76      0.82      0.79     12938
          N3       0.85      0.73      0.79      2931
         REM       0.69      0.46      0.55      5566

    accuracy                           0.70     32804
   macro avg       0.65      0.62      0.62     32804
weighted avg       0.68      0.70      0.68     32804
 

Agreement score (Cohen Kappa):  0.5914311657565539 

----------------------------- FOLD RESULTS --------------------------------------

TRAIN: [ 

array([[8.80220686e-01, 5.72434281e-02, 2.28674139e-02, 1.57275882e-03,
        3.80957136e-02],
       [2.25087390e-01, 1.86635595e-01, 3.28182785e-01, 1.97578398e-03,
        2.58118446e-01],
       [3.37108325e-02, 3.06349132e-02, 8.42475649e-01, 2.38306070e-02,
        6.93479983e-02],
       [2.45142248e-02, 5.73911618e-04, 2.22677708e-01, 7.51578257e-01,
        6.55898992e-04],
       [8.94674459e-02, 1.38022643e-01, 1.99288838e-01, 1.38962684e-03,
        5.71831446e-01]])

## Random forest training and testing
___

In [20]:
%%time

testing_pipeline = get_random_forest_model()
testing_pipeline.set_params(
    classifier__max_depth=24,
    classifier__n_estimators=100,
)

testing_pipeline.fit(X_train_valid[:, 2:], y_train_valid);

CPU times: user 3min 41s, sys: 2.12 s, total: 3min 43s
Wall time: 1min 13s


In [21]:
feature_importance_indexes = [
    (idx, round(importance,4))
    for idx, importance in enumerate(testing_pipeline.steps[1][1].feature_importances_)
]
feature_importance_indexes.sort(reverse=True, key=lambda x: x[1])

category_feature_range = np.array([2, 3]) - 2
time_domaine_feature_range = np.array([4, 5, 6, 7, 8, 9, 10, 27, 28, 29, 30, 31, 32, 33]) - 2
freq_domain_feature_range = np.array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44]) - 2
subband_domain_feature_range = np.array([22, 23, 24, 25, 26, 45, 46, 47, 48, 49]) - 2
fpz_cz_feature_range = np.array(range(2, 25))
pz_oz_feature_range = np.array(range(25, 48))

def get_feature_range_importance(indexes):
    return np.sum([feature[1] for feature in feature_importance_indexes if feature[0] in indexes])

print(f"Categorical features:         {category_feature_range}")
print(f"Time domain features:         {time_domaine_feature_range}")
print(f"Frequency domain features:    {freq_domain_feature_range}")
print(f"Subband time domain features: {subband_domain_feature_range}\n")

print(f"Top 5 features:    {[feature for feature in feature_importance_indexes[:5]]}")
print(f"Bottom 5 features: {[feature for feature in feature_importance_indexes[-5:]]}\n")

print(f"Fpz-Cz feature importances:   {get_feature_range_importance(fpz_cz_feature_range):.4f}")
print(f"Pz-Oz feature importances:    {get_feature_range_importance(pz_oz_feature_range):.4f}\n")

print(f"Category feature importances:            {get_feature_range_importance([0,1]):.4f}")
print(f"Time domain feature importances:         {get_feature_range_importance(time_domaine_feature_range):.4f}")
print(f"Frequency domain feature importances:    {get_feature_range_importance(freq_domain_feature_range):.4f}")
print(f"Subband time domain feature importances: {get_feature_range_importance(subband_domain_feature_range):.4f}")

Categorical features:         [0 1]
Time domain features:         [ 2  3  4  5  6  7  8 25 26 27 28 29 30 31]
Frequency domain features:    [ 9 10 11 12 13 14 15 16 17 18 19 32 33 34 35 36 37 38 39 40 41 42]
Subband time domain features: [20 21 22 23 24 43 44 45 46 47]

Top 5 features:    [(41, 0.0627), (29, 0.0487), (18, 0.0421), (20, 0.0411), (47, 0.0403)]
Bottom 5 features: [(11, 0.0108), (27, 0.0093), (4, 0.0091), (42, 0.0066), (0, 0.0031)]

Fpz-Cz feature importances:   0.4553
Pz-Oz feature importances:    0.5284

Category feature importances:            0.0162
Time domain feature importances:         0.2843
Frequency domain feature importances:    0.4711
Subband time domain feature importances: 0.2283


In [22]:
y_test_pred = testing_pipeline.predict(X_test[:,2:])

print(confusion_matrix(y_test, y_test_pred))

print(classification_report(y_test, y_test_pred, target_names=SLEEP_STAGES_VALUES.keys()))

print("Agreement score (Cohen Kappa): ", cohen_kappa_score(y_test, y_test_pred))

[[1512   65    3    3   41]
 [ 220  147  332    0  284]
 [  39   45 3212  194  113]
 [   4    0   32  575    0]
 [  49   81  284    0  888]]
              precision    recall  f1-score   support

           W       0.83      0.93      0.88      1624
          N1       0.43      0.15      0.22       983
          N2       0.83      0.89      0.86      3603
          N3       0.74      0.94      0.83       611
         REM       0.67      0.68      0.68      1302

    accuracy                           0.78      8123
   macro avg       0.70      0.72      0.69      8123
weighted avg       0.75      0.78      0.75      8123

Agreement score (Cohen Kappa):  0.6879671218212182


## Saving trained model
___

We save the trained model with the postprocessing step, HMM. We will save only the matrix that define it. We do not need to persist the median filter postprocessing step, because it is stateless.

In [27]:
SAVED_DIR = "trained_model"

if not os.path.exists(SAVED_DIR):
    os.mkdir(SAVED_DIR);    

In [32]:
if USE_CONTINUOUS_AGE: 
    joblib.dump(testing_pipeline, f"{SAVED_DIR}/classifier_RF_continous_age.joblib")
else:
    fd = joblib.dump(testing_pipeline, f"{SAVED_DIR}/classifier_RF_small.joblib")
    print(
        "Pipeline object size (Mbytes): ",
        os.path.getsize(f"{SAVED_DIR}/classifier_RF_small.joblib")/1e6
    )

Pipeline object size (Mbytes):  322.775421
