# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use `sktime` with [`tsfresh`](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_gunpoint
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/01_classification_univariate.ipynb).

In [3]:
X, y = load_gunpoint(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(150, 1) (150,) (50, 1) (50,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
11,0 -1.1368 1 -1.1407 2 -1.1395 3 ...
52,0 -0.71492 1 -0.71512 2 -0.71631 3...
36,0 -0.71406 1 -0.71635 2 -0.71743 3...
40,0 -0.74412 1 -0.73426 2 -0.73051 3...
100,0 -1.02240 1 -1.02210 2 -1.02400 3...


In [5]:
# binary classification task
np.unique(y_train)

array(['1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:15,  3.83s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.70s/it]

Feature Extraction:  60%|██████    | 3/5 [00:10<00:07,  3.65s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.62s/it]

Feature Extraction: 100%|██████████| 5/5 [00:17<00:00,  3.58s/it]

Feature Extraction: 100%|██████████| 5/5 [00:17<00:00,  3.56s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,148.999163,36.16706,0.498699,0.534719,0.042941,-0.656536,-1.05401,-1.380399,0.158766,1.064133,...,1.0,0.000591,0.000292,-1e-05,0.0,0.0,0.0,0.993328,0.0,-1451444.0
1,148.999513,26.201922,0.503264,0.558623,0.076979,-1.163242,-1.169487,-0.81163,-0.103877,-0.220482,...,1.0,0.004402,0.006452,0.005367,0.0,0.0,0.0,0.99333,0.0,-874262.7
2,148.999088,26.73083,0.477126,0.532071,0.081839,-1.115962,-1.132453,-0.797908,-0.105883,-0.305682,...,1.0,0.006002,0.011062,0.009728,0.0,0.0,0.0,0.993327,0.0,-526403.9
3,148.999113,25.84131,0.516237,0.567352,0.07405,-1.122074,-1.199491,-0.887698,-0.075442,0.108838,...,1.0,-0.00103,-0.002834,-0.002939,0.0,0.0,0.0,0.993327,0.0,-625517.7
4,148.999976,34.84268,0.456459,0.470207,0.051558,-0.9243,-0.937981,-1.036303,0.020113,-0.555833,...,1.0,0.002921,0.007479,0.01324,0.0,0.0,0.0,0.993333,0.0,-922834.3


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:14,  3.58s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:10,  3.56s/it]

Feature Extraction:  60%|██████    | 3/5 [00:10<00:07,  3.64s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.59s/it]

Feature Extraction: 100%|██████████| 5/5 [00:17<00:00,  3.56s/it]

Feature Extraction: 100%|██████████| 5/5 [00:17<00:00,  3.57s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:04,  1.23s/it]

Feature Extraction:  40%|████      | 2/5 [00:02<00:03,  1.21s/it]

Feature Extraction:  60%|██████    | 3/5 [00:03<00:02,  1.20s/it]

Feature Extraction:  80%|████████  | 4/5 [00:04<00:01,  1.19s/it]

Feature Extraction: 100%|██████████| 5/5 [00:05<00:00,  1.18s/it]

Feature Extraction: 100%|██████████| 5/5 [00:05<00:00,  1.18s/it]




0.98

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
3,0 -1.088052 1 -1.088052 2 -0.683620 3...,0 0.183832 1 0.183832 2 -2.909047 3...,0 -0.260871 1 -0.260871 2 1.507042 3...,0 -0.284981 1 -0.284981 2 0.415486 3...,0 0.487397 1 0.487397 2 0.013317 3...,0 1.081329 1 1.081329 2 0.820319 3...
12,0 2.221946 1 2.221946 2 -7.70417...,0 -0.783638 1 -0.783638 2 -4.56992...,0 0.142401 1 0.142401 2 2.447367 3...,0 0.055931 1 0.055931 2 -0.442120 3...,0 0.071911 1 0.071911 2 0.010653 3...,0 0.226387 1 0.226387 2 -1.978886 3...
30,0 -0.623875 1 -0.623875 2 -1.081529 3...,0 -2.123436 1 -2.123436 2 -0.121519 3...,0 -0.513654 1 -0.513654 2 0.809464 3...,0 -0.143822 1 -0.143822 2 -1.081329 3...,0 0.058594 1 0.058594 2 -0.127842 3...,0 1.086656 1 1.086656 2 0.066584 3...
39,0 1.211973 1 1.211973 2 -0.605948 3...,0 -0.247107 1 -0.247107 2 -3.855673 3...,0 0.327837 1 0.327837 2 7.113185 3...,0 0.058594 1 0.058594 2 0.900220 3...,0 -0.527348 1 -0.527348 2 -1.326360 3...,0 -0.042614 1 -0.042614 2 -0.095881 3...
8,0 -0.342233 1 -0.342233 2 -0.298542 3...,0 0.327415 1 0.327415 2 -0.527154 3...,0 0.157229 1 0.157229 2 0.248585 3...,0 0.394179 1 0.394179 2 -0.037287 3...,0 0.074574 1 0.074574 2 -0.087891 3...,0 -0.037287 1 -0.037287 2 -0.050604 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:08<00:33,  8.38s/it]

Feature Extraction:  40%|████      | 2/5 [00:17<00:25,  8.53s/it]

Feature Extraction:  60%|██████    | 3/5 [00:24<00:16,  8.26s/it]

Feature Extraction:  80%|████████  | 4/5 [00:32<00:08,  8.08s/it]

Feature Extraction: 100%|██████████| 5/5 [00:40<00:00,  7.93s/it]

Feature Extraction: 100%|██████████| 5/5 [00:40<00:00,  8.03s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,9.885558,18.719357,-0.015923,-0.009914,0.009444,0.416747,-0.155594,-0.789909,0.16028,1.732735,...,1.0,0.014714,0.009011,-0.016646,0.0,0.0,0.0,0.217038,0.0,19.242999
1,10701.446629,758.995994,-0.001368,-0.138129,0.176665,13.975994,1.681442,-14.370603,98.81245,17.69389,...,1.0,6.485339,24.752491,49.024603,0.0,0.0,0.0,20.123652,1.0,29.251607
2,4402.264342,368.61827,-0.02578,-0.068173,0.027653,14.404859,3.550912,-2.012951,33.151709,28.820866,...,1.0,-81.172944,21.164744,10.776898,0.0,0.0,0.0,15.749446,1.0,-112.371986
3,8508.951625,459.048651,-0.004866,-0.047343,0.044864,19.405911,7.093272,-1.529865,54.095927,29.210932,...,1.0,-41.246265,26.299019,59.697679,0.0,0.0,0.0,21.230098,1.0,58.366824
4,5.56889,14.259636,-0.004129,-0.011817,0.01239,0.0842,-0.217573,-0.366173,0.022779,0.437525,...,1.0,-9.6e-05,-0.00075,0.001248,0.0,4.0,0.0,0.027556,0.0,11.332234
