# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
112,0 -1.8515 1 -1.8436 2 -1.8209 3 ...
72,0 -1.7289 1 -1.6855 2 -1.6324 3 ...
28,0 -2.0043 1 -1.9856 2 -1.9533 3 ...
26,0 -1.9412 1 -1.9387 2 -1.9353 3 ...
170,0 -1.6251 1 -1.6230 2 -1.6261 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.63s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:13,  4.62s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:09,  4.60s/it]

Feature Extraction:  80%|████████  | 4/5 [00:18<00:04,  4.58s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.49s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.52s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,249.998715,81.240312,0.244798,0.29451,0.107511,-0.251815,-0.510598,-1.123656,0.088376,0.709372,...,1.0,0.060355,0.015805,0.001493,0.0,0.0,0.0,0.996011,0.0,1001995.0
1,249.997573,78.139454,0.18919,0.244795,0.142444,0.024702,-0.25029,-0.833451,0.082725,0.88657,...,1.0,0.039653,-0.000263,-0.012901,0.0,0.0,0.0,0.996006,0.0,-517558.4
2,250.000022,84.41367,0.269949,0.282165,0.075803,-0.197516,-0.547911,-1.120524,0.095315,0.674956,...,1.0,0.055973,0.009375,-0.022453,0.0,0.0,0.0,0.996016,0.0,920954.1
3,249.999306,86.218978,0.090826,0.112809,0.145072,0.247431,-0.157703,-1.0929,0.185859,1.281764,...,1.0,0.059999,-0.00695,-0.026415,0.0,0.0,0.0,0.996013,0.0,-1727580.0
4,249.999532,78.012346,0.35069,0.396303,0.081213,-0.294125,-0.681645,-1.209748,0.105016,0.681067,...,1.0,0.031419,0.002625,-0.017964,0.0,0.0,0.0,0.996014,0.0,-1304684.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.65s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:13,  4.61s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:09,  4.59s/it]

Feature Extraction:  80%|████████  | 4/5 [00:18<00:04,  4.58s/it]

Feature Extraction: 100%|██████████| 5/5 [00:23<00:00,  4.69s/it]

Feature Extraction: 100%|██████████| 5/5 [00:23<00:00,  4.64s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.61s/it]

Feature Extraction:  40%|████      | 2/5 [00:03<00:04,  1.60s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:03,  1.59s/it]

Feature Extraction:  80%|████████  | 4/5 [00:06<00:01,  1.58s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.48s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.51s/it]




0.8113207547169812

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
11,0 -0.193013 1 -0.193013 2 2.40398...,0 -0.106266 1 -0.106266 2 0.52392...,0 -0.636563 1 -0.636563 2 -1.166243 3...,0 -0.087891 1 -0.087891 2 -2.716640 3...,0 0.010653 1 0.010653 2 1.297062 3...,0 0.205080 1 0.205080 2 -0.609912 3...
7,0 -0.366403 1 -0.366403 2 0.126830 3...,0 0.331289 1 0.331289 2 1.060159 3...,0 -0.817845 1 -0.817845 2 0.285714 3...,0 -0.167792 1 -0.167792 2 0.170456 3...,0 -0.093218 1 -0.093218 2 -0.117188 3...,0 0.157139 1 0.157139 2 0.719111 3...
31,0 0.036607 1 0.036607 2 0.265778 3...,0 0.341686 1 0.341686 2 -0.164943 3...,0 -0.694948 1 -0.694948 2 -0.635560 3...,0 -0.253020 1 -0.253020 2 -0.354229 3...,0 -0.082565 1 -0.082565 2 -0.516694 3...,0 -0.090555 1 -0.090555 2 1.470182 3...
17,0 0.324449 1 0.324449 2 9.29442...,0 -0.977516 1 -0.977516 2 -6.96322...,0 -1.260218 1 -1.260218 2 -2.498493 3...,0 -0.788358 1 -0.788358 2 2.434323 3...,0 0.316941 1 0.316941 2 -0.079901 3...,0 0.588605 1 0.588605 2 6.535916 3...
37,0 -0.046089 1 -0.283051 2 -0.587748 3...,0 -0.738026 1 -0.314572 2 3.388108 3...,0 0.179667 1 -0.724257 2 -0.223563 3...,0 0.364882 1 -1.163894 2 -2.543521 3...,0 -0.237040 1 -0.101208 2 0.402169 3...,0 0.386189 1 -0.165129 2 -0.897557 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:10<00:41, 10.42s/it]

Feature Extraction:  40%|████      | 2/5 [00:18<00:28,  9.60s/it]

Feature Extraction:  60%|██████    | 3/5 [00:25<00:17,  8.98s/it]

Feature Extraction:  80%|████████  | 4/5 [00:33<00:08,  8.60s/it]

Feature Extraction: 100%|██████████| 5/5 [00:41<00:00,  8.37s/it]

Feature Extraction: 100%|██████████| 5/5 [00:41<00:00,  8.24s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,13280.972257,726.74499,0.005215,-0.029945,0.123133,18.288784,6.971683,-11.289033,100.790019,19.279316,...,1.0,6.521279,12.335605,25.936007,0.0,0.0,0.0,22.05006,1.0,-68.020217
1,3.691654,14.661708,-0.00083,0.003338,0.008945,0.196186,-0.098368,-0.380317,0.033669,0.301496,...,1.0,-0.009847,-0.015825,-0.003979,0.0,0.0,0.0,0.166792,0.0,-79.451065
2,5923.622075,351.902197,-0.022711,-0.054728,0.045949,18.952323,4.7868,-1.586263,44.834159,28.00206,...,1.0,-1.76202,4.734516,13.574916,0.0,0.0,0.0,12.647317,1.0,-97.393541
3,13876.020277,736.256653,0.005391,-0.171305,0.18451,14.628194,5.399811,-12.349274,75.352924,18.280817,...,1.0,2.145671,14.394118,24.538999,0.0,0.0,0.0,26.76324,1.0,-23.953564
4,4714.701692,322.21622,-0.022158,-0.084826,0.029915,14.48791,4.354974,-0.810924,29.428908,24.485662,...,1.0,-23.548452,-10.957499,3.340874,0.0,0.0,0.0,11.9764,1.0,10.264358
