# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
40,0 -2.0324 1 -2.0386 2 -2.0182 3 ...
172,0 -1.6033 1 -1.5874 2 -1.5774 3 ...
30,0 -2.1310 1 -2.1095 2 -2.0842 3 ...
55,0 -1.9399 1 -1.9377 2 -1.9112 3 ...
7,0 -1.6336 1 -1.6432 2 -1.6137 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:17,  4.36s/it]

Feature Extraction:  40%|████      | 2/5 [00:08<00:13,  4.34s/it]

Feature Extraction:  60%|██████    | 3/5 [00:12<00:08,  4.32s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.29s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.20s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.23s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,249.999518,85.066186,0.177512,0.213324,0.112479,-0.186523,-0.430731,-1.135556,0.081582,0.826341,...,1.0,0.066283,0.001462,-0.013138,0.0,0.0,0.0,0.996014,0.0,-2197362.0
1,250.000719,76.40354,0.352691,0.400189,0.084685,-0.301243,-0.665539,-1.161841,0.088764,0.604482,...,1.0,0.030505,0.004144,-0.013691,0.0,0.0,0.0,0.996019,0.0,-1616128.0
2,250.001068,92.131462,0.162395,0.18134,0.095384,0.047701,-0.306166,-1.178158,0.149008,0.991258,...,1.0,0.068152,0.007123,-0.017772,0.0,0.0,0.0,0.99602,0.0,526260.6
3,249.998689,83.050412,0.295672,0.311059,0.069416,-0.299377,-0.592225,-1.143836,0.090197,0.663791,...,1.0,0.049578,0.006796,-0.022519,0.0,0.0,0.0,0.996011,0.0,-6958301.0
4,249.999602,76.20934,0.306337,0.365832,0.107011,-0.179761,-0.470085,-0.992107,0.07893,0.748175,...,1.0,0.038551,0.010539,0.004764,0.0,0.0,0.0,0.996014,0.0,15656210.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.69s/it]

Feature Extraction:  40%|████      | 2/5 [00:08<00:13,  4.47s/it]

Feature Extraction:  60%|██████    | 3/5 [00:12<00:08,  4.34s/it]

Feature Extraction:  80%|████████  | 4/5 [00:16<00:04,  4.24s/it]

Feature Extraction: 100%|██████████| 5/5 [00:20<00:00,  4.08s/it]

Feature Extraction: 100%|██████████| 5/5 [00:20<00:00,  4.08s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:05,  1.43s/it]

Feature Extraction:  40%|████      | 2/5 [00:02<00:04,  1.41s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:02,  1.39s/it]

Feature Extraction:  80%|████████  | 4/5 [00:05<00:01,  1.38s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.30s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.33s/it]




0.8113207547169812

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
2,0 -0.663284 1 -0.663284 2 5.393924 3...,0 0.273010 1 0.273010 2 -3.079673 3...,0 -0.160963 1 -0.160963 2 -3.175911 3...,0 -0.245030 1 -0.245030 2 -6.408074 3...,0 -0.077238 1 -0.077238 2 0.471417 3...,0 -0.018644 1 -0.018644 2 -3.592890 3...
10,0 0.300413 1 0.300413 2 -1.96499...,0 0.727580 1 0.727580 2 -0.30055...,0 0.878731 1 0.878731 2 -1.226914 3...,0 -0.082565 1 -0.082565 2 -0.631219 3...,0 -0.055931 1 -0.055931 2 0.039951 3...,0 0.668507 1 0.668507 2 0.130505 3...
3,0 -1.088052 1 -1.088052 2 -0.683620 3...,0 0.183832 1 0.183832 2 -2.909047 3...,0 -0.260871 1 -0.260871 2 1.507042 3...,0 -0.284981 1 -0.284981 2 0.415486 3...,0 0.487397 1 0.487397 2 0.013317 3...,0 1.081329 1 1.081329 2 0.820319 3...
22,0 0.175924 1 0.175924 2 0.194403 3...,0 0.548757 1 0.548757 2 -3.699192 3...,0 -1.191314 1 -1.191314 2 -0.554051 3...,0 0.039951 1 0.039951 2 0.042614 3...,0 0.263674 1 0.263674 2 -0.178446 3...,0 0.937507 1 0.937507 2 0.071911 3...
17,0 0.324449 1 0.324449 2 9.29442...,0 -0.977516 1 -0.977516 2 -6.96322...,0 -1.260218 1 -1.260218 2 -2.498493 3...,0 -0.788358 1 -0.788358 2 2.434323 3...,0 0.316941 1 0.316941 2 -0.079901 3...,0 0.588605 1 0.588605 2 6.535916 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:28,  7.16s/it]

Feature Extraction:  40%|████      | 2/5 [00:13<00:21,  7.05s/it]

Feature Extraction:  60%|██████    | 3/5 [00:20<00:13,  7.00s/it]

Feature Extraction:  80%|████████  | 4/5 [00:27<00:06,  6.95s/it]

Feature Extraction: 100%|██████████| 5/5 [00:34<00:00,  6.95s/it]

Feature Extraction: 100%|██████████| 5/5 [00:34<00:00,  6.92s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,50.035273,42.580855,-0.010558,-0.020379,0.007655,2.764465,0.148212,-0.546997,1.128408,5.393924,...,1.0,0.033806,0.188196,0.241035,0.0,1.0,0.0,0.399832,0.0,-11.120113
1,15733.291175,830.921324,0.007213,-0.050807,0.148321,17.528861,4.097916,-16.694289,175.191225,20.613626,...,1.0,16.575667,53.551991,65.844355,0.0,0.0,0.0,26.867367,1.0,-51.801132
2,9.885558,18.719357,-0.015923,-0.009914,0.009444,0.416747,-0.155594,-0.789909,0.16028,1.732735,...,1.0,0.014714,0.009011,-0.016646,0.0,0.0,0.0,0.217038,0.0,19.242999
3,383.560959,127.01821,-0.000228,0.001823,0.125313,3.757798,1.359549,-0.84666,2.566076,4.442557,...,1.0,0.2535,0.439563,1.125258,0.0,0.0,0.0,2.834777,1.0,-56.442891
4,13876.020277,736.256653,0.005391,-0.171305,0.18451,14.628194,5.399811,-12.349274,75.352924,18.280817,...,1.0,2.145671,14.394118,24.538999,0.0,0.0,0.0,26.76324,1.0,-23.953564
