Script 4. Classification using temporal features extracted from 'tsfel'.

Possible improvements:
extract additional features available in the library;
retrieve the names of features from an improved version of the wrapper.

In [1]:
import pandas as pd

import tsfel # 0.1.6
from extract_tsfel_features import get_tsfel_features
# function of 'tsfel.time_series_features_extractor' wrapped in order to process signals in rows.

from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from target_to_binary import is_seizure

In [2]:
file_path  = '../dat/Epileptic_Seizure_Recognition.csv'
data = pd.read_csv(file_path)
print(data.shape)  # (11500, 180)
del file_path

# remove the 1st column (Unnamed)
data.drop(columns=[list(data)[0]], inplace=True)

(11500, 180)


In [3]:
# group all classes >1 (healthy) together into new class 0
target = list(data)[-1]  # "y"

features = list(data)[0:-1]
print(len(features))  # 178

data[target] = data[target].apply(is_seizure)

data.head()

178


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,y
0,135,190,229,223,192,125,55,-9,-33,-38,...,-17,-15,-31,-77,-103,-127,-116,-83,-51,0
1,386,382,356,331,320,315,307,272,244,232,...,164,150,146,152,157,156,154,143,129,1
2,-32,-39,-47,-37,-32,-36,-57,-73,-85,-94,...,57,64,48,19,-12,-30,-35,-35,-36,0
3,-105,-101,-96,-92,-89,-95,-102,-100,-87,-79,...,-82,-81,-80,-77,-85,-77,-72,-69,-65,0
4,-9,-65,-98,-102,-78,-48,-16,0,-21,-59,...,4,2,-12,-32,-41,-65,-83,-89,-73,0


Signals, sampling rate.

In [4]:
mat_signals = data[features].to_numpy(dtype=float)
print(mat_signals.shape) # (11500, 178)

# dataset sampling frequency
fs = mat_signals.shape[1] # 178 data points for 1 second

# tsfel_features = get_tsfel_features(mat_train, fs=fs)

(11500, 178)


Train-test split. It is done before extracting features in order to prevent a data leak in future (during feature selection). 

In [5]:
train_sig, test_sig, Y_train, Y_test = train_test_split(data[features], data[target], test_size=0.2, stratify=data[target], random_state=42)
# data_train, data_test = train_test_split(data, test_size=0.2, stratify=data[target], random_state=42)

print(type(train_sig))  # DataFrame
print(type(test_sig))   # DataFrame
print(train_sig.shape)  # (9200, 178)
print(test_sig.shape)   # (2300, 178)
print(Y_train.shape)  # (9200,)
print(Y_test.shape)   # (2300,)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
(9200, 178)
(2300, 178)
(9200,)
(2300,)


Extract tsfel features.

In [6]:
X_train = get_tsfel_features(train_sig.to_numpy(), fs=fs)
print('signal matrix', train_sig.shape)
# print('-> features of type:', type(X_train))  # DataFrame 
print(X_train.shape)  # (9200, 14)

X_test = get_tsfel_features(test_sig.to_numpy(), fs=fs)
print('signal matrix', train_sig.shape)
# print('-> features of type:', type(X_train))  # DataFrame 
print(X_train.shape)  # (2300, 14)

signal matrix (9200, 178)
(9200, 14)


signal matrix (9200, 178)
(9200, 14)


In [7]:
# Highly correlated features are removed.

corr_features = tsfel.correlated_features(X_train)  # 'numpy.ndarray' object has no attribute 'corr' 
X_train.drop(corr_features, axis=1, inplace=True)
X_test.drop(corr_features, axis=1, inplace=True)

# Remove low variance features
selector = VarianceThreshold()
X_train = selector.fit_transform(X_train)
X_test = selector.transform(X_test)

X_train.shape, X_test.shape
# ((9200, 10), (2300, 10))
# 10 features remain.

((9200, 10), (2300, 10))

Train the classifier.

In [8]:
clf = RandomForestClassifier(n_estimators=50, max_depth=15, random_state=1)
clf.fit(X_train, Y_train)
print("Accuracy on training set is : {}".format(clf.score(X_train, Y_train)))
# almost 1
print("Accuracy on test set is : {}".format(clf.score(X_test, Y_test)))
# 0.9796

Accuracy on training set is : 1.0
Accuracy on test set is : 0.9786956521739131


Check its performance on training set

In [9]:
preds_train = clf.predict(X_train)
print(classification_report(Y_train, preds_train))

# f1-score for class 1 on train data : 1.0


              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7360
           1       1.00      1.00      1.00      1840

    accuracy                           1.00      9200
   macro avg       1.00      1.00      1.00      9200
weighted avg       1.00      1.00      1.00      9200



Check its performance on test set

In [10]:
preds_test = clf.predict(X_test)
print(classification_report(Y_test, preds_test))
# f1-score on class 1 on test data : 0.95

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1840
           1       0.96      0.93      0.95       460

    accuracy                           0.98      2300
   macro avg       0.97      0.96      0.97      2300
weighted avg       0.98      0.98      0.98      2300



This classifier reaches a similar performance to wavelets with a smaller number of features (9 instead of 80). An improved wrapper can be used to get the names of these features. 