# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
31,0 -1.9608 1 -1.9304 2 -1.8643 3 ...
42,0 -1.9921 1 -2.0144 2 -1.9611 3 ...
106,0 -2.0000 1 -2.0029 2 -1.9696 3 ...
137,0 -1.6130 1 -1.6113 2 -1.5963 3 ...
72,0 -1.7289 1 -1.6855 2 -1.6324 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:05<00:20,  5.03s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:14,  4.89s/it]

Feature Extraction:  60%|██████    | 3/5 [00:14<00:09,  4.82s/it]

Feature Extraction:  80%|████████  | 4/5 [00:18<00:04,  4.77s/it]

Feature Extraction: 100%|██████████| 5/5 [00:23<00:00,  4.66s/it]

Feature Extraction: 100%|██████████| 5/5 [00:23<00:00,  4.67s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,249.999736,83.724552,0.015825,0.007144,0.190806,0.630121,0.103683,-0.990237,0.243063,1.384002,...,1.0,0.053414,-0.010174,-0.045888,0.0,0.0,0.0,0.996015,0.0,6958316.0
1,249.999669,92.154196,0.157192,0.182119,0.103257,-0.109189,-0.412927,-1.369386,0.166426,0.884207,...,1.0,0.066259,0.023101,-0.001812,0.0,0.0,0.0,0.996015,0.0,-613969.0
2,249.999639,85.158964,0.267632,0.28434,0.076466,-0.251184,-0.554734,-1.2396,0.119013,0.709702,...,1.0,0.055315,0.004909,-0.019214,0.0,0.0,0.0,0.996014,0.0,-2341115.0
3,249.99957,75.740292,0.370202,0.429531,0.085676,-0.29049,-0.68024,-1.199385,0.099466,0.663122,...,1.0,0.031119,0.005069,-0.012406,0.0,0.0,0.0,0.996014,0.0,-4554532.0
4,249.997573,78.139454,0.18919,0.244795,0.142444,0.024702,-0.25029,-0.833451,0.082725,0.88657,...,1.0,0.039653,-0.000263,-0.012901,0.0,0.0,0.0,0.996006,0.0,-517558.4


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.74s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:14,  4.72s/it]

Feature Extraction:  60%|██████    | 3/5 [00:14<00:09,  4.69s/it]

Feature Extraction:  80%|████████  | 4/5 [00:18<00:04,  4.68s/it]

Feature Extraction: 100%|██████████| 5/5 [00:23<00:00,  4.59s/it]

Feature Extraction: 100%|██████████| 5/5 [00:23<00:00,  4.62s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.65s/it]

Feature Extraction:  40%|████      | 2/5 [00:03<00:04,  1.62s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:03,  1.62s/it]

Feature Extraction:  80%|████████  | 4/5 [00:06<00:01,  1.61s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.51s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.54s/it]




0.7735849056603774

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
25,0 -0.044205 1 -0.044205 2 -0.878387 3...,0 -0.496912 1 -0.496912 2 -1.725143 3...,0 -0.428723 1 -0.428723 2 1.558894 3...,0 0.620566 1 0.620566 2 0.082565 3...,0 0.229050 1 0.229050 2 0.098545 3...,0 0.649863 1 0.649863 2 -0.191763 3...
29,0 0.118553 1 0.118553 2 -0.545332 3...,0 0.419456 1 0.419456 2 0.371223 3...,0 -0.283447 1 -0.283447 2 0.707172 3...,0 0.135832 1 0.135832 2 0.159802 3...,0 -0.079901 1 -0.079901 2 -0.090555 3...,0 0.050604 1 0.050604 2 0.474080 3...
35,0 -0.040961 1 -0.040961 2 0.338414 3...,0 -0.971100 1 -0.971100 2 -3.420216 3...,0 0.203560 1 0.203560 2 -2.053446 3...,0 0.061258 1 0.061258 2 0.250357 3...,0 -0.047941 1 -0.047941 2 -0.639209 3...,0 0.961478 1 0.961478 2 -0.298298 3...
6,0 1.275129 1 1.275129 2 -0.273185 3...,0 -1.024406 1 -1.024406 2 0.095152 3...,0 -0.545722 1 -0.545722 2 0.023203 3...,0 -0.463427 1 -0.463427 2 0.042614 3...,0 -0.367545 1 -0.367545 2 -0.109198 3...,0 -0.159802 1 -0.159802 2 0.183773 3...
33,0 -0.166323 1 -0.166323 2 -0.227300 3...,0 0.768712 1 0.768712 2 -0.677960 3...,0 -0.344511 1 -0.344511 2 -1.58804...,0 0.218397 1 0.218397 2 2.141352 3...,0 0.135832 1 0.135832 2 0.250357 3...,0 0.162466 1 0.162466 2 1.808430 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:08<00:33,  8.36s/it]

Feature Extraction:  40%|████      | 2/5 [00:16<00:25,  8.43s/it]

Feature Extraction:  60%|██████    | 3/5 [00:25<00:16,  8.33s/it]

Feature Extraction:  80%|████████  | 4/5 [00:33<00:08,  8.26s/it]

Feature Extraction: 100%|██████████| 5/5 [00:41<00:00,  8.25s/it]

Feature Extraction: 100%|██████████| 5/5 [00:41<00:00,  8.28s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,193.996354,98.389577,-0.00797,-0.002092,0.044726,3.072738,0.655916,-1.20338,2.053698,3.507223,...,1.0,0.000822,0.254753,0.811217,0.0,0.0,0.0,1.781946,1.0,23.268575
1,232.319298,137.452006,-0.012131,-0.039673,0.023279,4.086147,0.998918,-0.92573,2.506426,5.070925,...,1.0,0.130831,0.451732,0.787399,0.0,0.0,0.0,2.031789,1.0,17.833684
2,6681.979256,441.093167,-0.003917,-0.024874,0.021016,18.693258,6.340561,0.353609,41.466413,29.321178,...,1.0,-28.438658,3.340718,15.100302,0.0,0.0,0.0,12.710682,1.0,31.065326
3,14.11858,16.614377,-0.006352,-0.018835,0.015877,0.594643,-0.156639,-0.556163,0.167986,1.275129,...,1.0,-0.000125,-2e-05,-9e-06,0.0,4.0,0.0,0.007111,0.0,7.4148
4,10991.779448,555.675893,-0.000122,-0.039183,0.035843,23.558182,5.495269,-2.17036,58.349987,29.126005,...,1.0,36.630331,20.753415,-28.497307,0.0,0.0,0.0,30.200014,1.0,-14.681544
