# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use `sktime` with [`tsfresh`](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_gunpoint
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/01_classification_univariate.ipynb).

In [3]:
X, y = load_gunpoint(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(150, 1) (150,) (50, 1) (50,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
0,0 -0.64789 1 -0.64199 2 -0.63819 3...
26,0 -1.1042 1 -1.1044 2 -1.1028 3 ...
121,0 -1.2525 1 -1.2499 2 -1.2490 3 ...
19,0 -0.60183 1 -0.60023 2 -0.59905 3...
124,0 -1.2272 1 -1.2736 2 -1.2927 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:15,  3.97s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.84s/it]

Feature Extraction:  60%|██████    | 3/5 [00:11<00:07,  3.77s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.73s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.69s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.67s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,148.99956,23.28697,0.447178,0.494247,0.090215,-0.947333,-1.108709,-0.877764,-0.028691,0.182893,...,1.0,0.005725,0.022142,0.050016,0.0,0.0,0.0,0.99333,0.0,-1115664.0
1,149.000439,36.39218,0.556778,0.600458,0.038777,-0.938455,-1.210859,-1.291641,0.011716,0.543392,...,1.0,0.003384,0.004215,-0.002037,0.0,0.0,0.0,0.993336,0.0,1541231.0
2,148.999447,39.505674,0.376047,0.367909,0.054839,-0.407206,-0.875695,-1.495618,0.296523,0.929742,...,1.0,0.000972,-0.006026,-0.018425,0.0,0.0,0.0,0.99333,0.0,633469.9
3,149.00056,24.04257,0.470649,0.522128,0.093297,-1.171806,-1.172813,-0.798975,-0.107503,-0.46967,...,1.0,-0.000547,-0.003227,-0.004482,0.0,0.0,0.0,0.993337,0.0,1141217.0
4,148.999466,39.74396,0.414706,0.437921,0.05437,-0.414197,-0.85189,-1.504428,0.284091,1.063817,...,1.0,0.002523,-0.004027,-0.009736,0.0,0.0,0.0,0.99333,0.0,-458585.6


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:14,  3.72s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.71s/it]

Feature Extraction:  60%|██████    | 3/5 [00:11<00:07,  3.69s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.68s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.66s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.66s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:05,  1.27s/it]

Feature Extraction:  40%|████      | 2/5 [00:02<00:03,  1.25s/it]

Feature Extraction:  60%|██████    | 3/5 [00:03<00:02,  1.24s/it]

Feature Extraction:  80%|████████  | 4/5 [00:04<00:01,  1.23s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.23s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.22s/it]




0.96

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
21,0 0.648833 1 0.648833 2 0.076985 3...,0 -0.996722 1 -0.996722 2 -0.897264 3...,0 -0.644136 1 -0.644136 2 0.970515 3...,0 -0.101208 1 -0.101208 2 -0.407496 3...,0 0.055931 1 0.055931 2 -0.157139 3...,0 -0.031960 1 -0.031960 2 -0.343575 3...
13,0 2.580342 1 2.580342 2 -7.26891...,0 -0.850954 1 -0.850954 2 -6.06223...,0 -0.150030 1 -0.150030 2 0.96421...,0 -0.005327 1 -0.005327 2 0.002663 3...,0 0.050604 1 0.050604 2 -0.364882 3...,0 0.311615 1 0.311615 2 -0.772378 3...
7,0 -0.352746 1 -0.352746 2 -1.354561 3...,0 0.316845 1 0.316845 2 0.490525 3...,0 -0.473779 1 -0.473779 2 1.454261 3...,0 -0.327595 1 -0.327595 2 -0.269001 3...,0 0.106535 1 0.106535 2 0.021307 3...,0 0.197090 1 0.197090 2 0.460763 3...
35,0 1.102297 1 1.102297 2 0.73238...,0 -1.790773 1 -1.790773 2 0.661191 3...,0 0.001413 1 0.001413 2 -1.57956...,0 0.258347 1 0.258347 2 -0.127842 3...,0 -0.165129 1 -0.165129 2 -0.16779...,0 0.516694 1 0.516694 2 -0.58860...
28,0 -0.373788 1 -0.373788 2 0.076140 3...,0 0.248056 1 0.248056 2 -1.703104 3...,0 0.164594 1 0.164594 2 0.803796 3...,0 -0.143822 1 -0.143822 2 0.026634 3...,0 -0.183773 1 -0.183773 2 -0.620566 3...,0 -0.015980 1 -0.015980 2 -3.941791 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:31,  8.00s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.95s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.91s/it]

Feature Extraction:  80%|████████  | 4/5 [00:31<00:07,  7.92s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.92s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.90s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,172.027276,79.981285,0.018738,-0.010114,0.05157,3.290444,1.030049,-0.639805,1.932756,4.425835,...,1.0,-0.045246,0.165261,0.447651,0.0,0.0,0.0,1.461253,1.0,-43.809709
1,10764.169856,671.27214,0.016019,-0.037014,0.161924,10.704797,1.946699,-12.205448,65.296221,15.367535,...,1.0,4.948201,5.331894,16.386624,0.0,0.0,0.0,19.585905,1.0,196.180932
2,7.940863,17.538085,-0.012544,-0.03333,0.011363,0.308947,-0.199023,-0.815698,0.100413,0.590843,...,1.0,0.002756,0.004311,0.000921,0.0,0.0,0.0,0.078089,0.0,8.320448
3,8664.94077,324.813455,-0.034966,-0.112034,0.071626,20.912103,5.560128,-1.942906,63.322971,28.026657,...,1.0,32.364692,48.229669,93.801281,0.0,1.0,0.0,19.9436,1.0,-14.591905
4,193.578195,99.122458,-0.026095,-0.032476,0.025073,2.9846,0.760303,-0.864634,1.540174,5.334237,...,1.0,-0.15158,-0.110877,-0.458514,0.0,0.0,0.0,2.194275,1.0,-27.767232
