# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
114,0 -2.1830 1 -2.1270 2 -2.1029 3 ...
14,0 -2.1888 1 -2.1855 2 -2.1765 3 ...
19,0 -1.9190 1 -1.8946 2 -1.8713 3 ...
37,0 -2.0220 1 -2.0166 2 -2.0074 3 ...
0,0 -1.9078 1 -1.9049 2 -1.8886 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.53s/it]

Feature Extraction:  40%|████      | 2/5 [00:08<00:13,  4.50s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:08,  4.49s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.47s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.39s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.41s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,250.000476,81.492342,0.28821,0.303148,0.066675,-0.279789,-0.579471,-0.977028,0.063121,0.487878,...,1.0,0.05824,0.004186,-0.034151,0.0,0.0,0.0,0.996018,0.0,-2105040.0
1,249.999702,93.62735,0.215607,0.181405,0.065005,-0.096802,-0.532722,-1.274629,0.158609,0.711313,...,1.0,0.059727,0.000547,-0.043075,0.0,0.0,0.0,0.996015,0.0,1240096.0
2,249.998983,87.13823,0.024528,0.042379,0.18068,0.413121,-0.022015,-1.236532,0.281757,1.123744,...,1.0,0.060646,0.006645,-0.030327,0.0,0.0,0.0,0.996012,0.0,-7157114.0
3,250.000079,88.776582,0.118339,0.141308,0.122485,0.169655,-0.222628,-1.009743,0.140384,1.192656,...,1.0,0.061157,-0.006034,-0.022523,0.0,0.0,0.0,0.996016,0.0,-5825571.0
4,249.998915,85.371492,0.189196,0.221467,0.109071,-0.122742,-0.404003,-1.165574,0.126268,0.835701,...,1.0,0.056683,0.003702,-0.024615,0.0,0.0,0.0,0.996012,0.0,-416111.2


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.58s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:13,  4.60s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:09,  4.55s/it]

Feature Extraction:  80%|████████  | 4/5 [00:18<00:04,  4.52s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.43s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.47s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.58s/it]

Feature Extraction:  40%|████      | 2/5 [00:03<00:04,  1.55s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:03,  1.54s/it]

Feature Extraction:  80%|████████  | 4/5 [00:06<00:01,  1.53s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.43s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.46s/it]




0.9056603773584906

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
7,0 -0.366403 1 -0.366403 2 0.126830 3...,0 0.331289 1 0.331289 2 1.060159 3...,0 -0.817845 1 -0.817845 2 0.285714 3...,0 -0.167792 1 -0.167792 2 0.170456 3...,0 -0.093218 1 -0.093218 2 -0.117188 3...,0 0.157139 1 0.157139 2 0.719111 3...
22,0 -0.697643 1 -0.697643 2 -0.199924 3...,0 -0.561693 1 -0.561693 2 -0.820724 3...,0 -0.950458 1 -0.950458 2 1.146612 3...,0 -1.158567 1 -1.158567 2 -0.479407 3...,0 0.727101 1 0.727101 2 -0.410159 3...,0 -1.376964 1 -1.376964 2 0.130505 3...
24,0 0.279069 1 0.279069 2 0.707552 3...,0 -0.110199 1 -0.110199 2 -0.031651 3...,0 0.142979 1 0.142979 2 0.919129 3...,0 -0.026634 1 -0.026634 2 -0.631219 3...,0 -0.010653 1 -0.010653 2 0.290308 3...,0 0.077238 1 0.077238 2 0.122515 3...
5,0 -0.357300 1 -0.357300 2 -0.005055 3...,0 -0.584885 1 -0.584885 2 0.295037 3...,0 -0.792751 1 -0.792751 2 0.213664 3...,0 0.074574 1 0.074574 2 -0.157139 3...,0 0.159802 1 0.159802 2 -0.306288 3...,0 0.023970 1 0.023970 2 1.230478 3...
15,0 -0.359319 1 -0.359319 2 4.011746 3...,0 0.152819 1 0.152819 2 1.04881...,0 -0.064578 1 -0.064578 2 -1.804903 3...,0 0.039951 1 0.039951 2 -1.347667 3...,0 -0.042614 1 -0.042614 2 0.572625 3...,0 0.125179 1 0.125179 2 -1.861697 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:31,  7.91s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.84s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.84s/it]

Feature Extraction:  80%|████████  | 4/5 [00:31<00:07,  7.79s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.84s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.82s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,3.691654,14.661708,-0.00083,0.003338,0.008945,0.196186,-0.098368,-0.380317,0.033669,0.301496,...,1.0,-0.009847,-0.015825,-0.003979,0.0,0.0,0.0,0.166792,0.0,-79.451065
1,185.780037,96.462935,-0.003155,-0.014877,0.02083,3.955252,0.940692,-0.925605,2.686945,4.321427,...,1.0,0.02483,0.211492,0.586751,0.0,0.0,0.0,1.53744,1.0,-27.827278
2,344.052224,119.475536,0.002197,0.007196,0.05898,2.568986,0.562201,-1.055251,1.247625,3.955685,...,1.0,0.016007,0.265807,0.968304,0.0,0.0,0.0,2.364335,1.0,-27.716209
3,3.700726,10.71741,-0.043919,-0.046901,0.011366,0.056305,-0.146979,-0.344786,0.021389,0.188406,...,1.0,-0.000431,-0.000598,-0.000364,0.0,1.0,0.0,0.043535,0.0,6.718766
4,18144.435446,998.086426,0.005495,-0.061116,0.163137,21.954887,5.377214,-18.450373,190.873929,23.318779,...,1.0,20.476114,23.430968,52.614071,0.0,0.0,0.0,32.054682,1.0,1742.432343
