# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
18,0 -1.9501 1 -1.9645 2 -1.9495 3 ...
17,0 -1.6537 1 -1.6510 2 -1.6319 3 ...
9,0 -1.4921 1 -1.4536 2 -1.4228 3 ...
13,0 -2.1395 1 -2.1189 2 -2.1044 3 ...
62,0 -1.9471 1 -1.9405 2 -1.9224 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.57s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:13,  4.53s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:09,  4.51s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.48s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.40s/it]

Feature Extraction: 100%|██████████| 5/5 [00:22<00:00,  4.42s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,250.000417,87.342758,0.205766,0.223395,0.098595,-0.170609,-0.461739,-1.268942,0.126348,0.80362,...,1.0,0.067745,0.018134,-0.00187,0.0,0.0,0.0,0.996018,0.0,515431.5
1,249.999406,76.562689,0.367765,0.423947,0.082711,-0.311368,-0.688561,-1.204716,0.098933,0.698685,...,1.0,0.035214,0.005789,-0.008803,0.0,0.0,0.0,0.996014,0.0,-319514.3
2,250.001518,68.936816,0.251116,0.330111,0.16159,0.171556,-0.212736,-0.503936,0.049149,1.01712,...,1.0,0.012871,-0.010108,-0.018167,0.0,0.0,0.0,0.996022,0.0,-3976195.0
3,249.999228,82.949022,0.138073,0.088238,0.096341,0.200089,-0.231736,-0.835281,0.102181,1.059081,...,1.0,0.045324,-0.010596,-0.035836,0.0,0.0,0.0,0.996013,0.0,-31312390.0
4,249.999405,85.201376,0.193592,0.230073,0.111014,-0.149566,-0.412193,-1.162203,0.120004,0.839482,...,1.0,0.060527,0.009508,-0.009285,0.0,0.0,0.0,0.996014,0.0,821308.9


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.57s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:13,  4.53s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:09,  4.50s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.47s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.36s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.39s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.60s/it]

Feature Extraction:  40%|████      | 2/5 [00:03<00:04,  1.58s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:03,  1.57s/it]

Feature Extraction:  80%|████████  | 4/5 [00:06<00:01,  1.57s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.47s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.50s/it]




0.9245283018867925

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
36,0 -1.801504 1 -1.801504 2 -0.480725 3...,0 2.344990 1 2.344990 2 -0.994385 3...,0 0.281253 1 0.281253 2 0.378807 3...,0 0.716447 1 0.716447 2 -0.870923 3...,0 0.162466 1 0.162466 2 0.095881 3...,0 0.921527 1 0.921527 2 -0.474080 3...
33,0 -2.488524 1 -2.488524 2 -3.298341 3...,0 0.360353 1 0.360353 2 -4.123512 3...,0 0.362847 1 0.362847 2 1.901927 3...,0 0.007990 1 0.007990 2 -0.234377 3...,0 -0.093218 1 -0.093218 2 0.346238 3...,0 0.844289 1 0.844289 2 0.439456 3...
12,0 2.221946 1 2.221946 2 -7.70417...,0 -0.783638 1 -0.783638 2 -4.56992...,0 0.142401 1 0.142401 2 2.447367 3...,0 0.055931 1 0.055931 2 -0.442120 3...,0 0.071911 1 0.071911 2 0.010653 3...,0 0.226387 1 0.226387 2 -1.978886 3...
21,0 -0.171905 1 -0.171905 2 -0.397472 3...,0 0.206276 1 0.206276 2 -3.217950 3...,0 -0.308410 1 -0.308410 2 -0.035401 3...,0 -0.189099 1 -0.189099 2 0.857606 3...,0 0.079901 1 0.079901 2 0.135832 3...,0 0.055931 1 0.055931 2 0.391516 3...
0,0 -0.740653 1 -0.740653 2 10.20844...,0 0.756509 1 0.756509 2 -9.216970 3...,0 -0.275809 1 -0.275809 2 -12.37890...,0 -0.423476 1 -0.423476 2 -14.69915...,0 0.013317 1 0.013317 2 4.578337 3...,0 0.013317 1 0.013317 2 -5.055081 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:30,  7.73s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.74s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.79s/it]

Feature Extraction:  80%|████████  | 4/5 [00:31<00:07,  7.88s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.90s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.89s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,5716.535296,375.788586,-0.010798,-0.08417,0.054152,19.190912,5.136329,-1.663644,38.880657,28.832684,...,1.0,-22.050941,-18.937862,3.202976,0.0,0.0,0.0,12.342621,1.0,-42.197112
1,5499.612539,453.956021,-0.015376,-0.10052,0.046386,12.972145,2.15503,-4.856715,40.469936,29.266172,...,1.0,3.943785,74.559369,81.181261,0.0,0.0,0.0,17.3554,1.0,-165.171798
2,10701.446629,758.995994,-0.001368,-0.138129,0.176665,13.975994,1.681442,-14.370603,98.81245,17.69389,...,1.0,6.485339,24.752491,49.024603,0.0,0.0,0.0,20.123652,1.0,29.251607
3,535.495127,135.580907,-0.027563,-0.045144,0.078813,4.071817,1.703098,-0.768076,2.699804,5.046151,...,1.0,0.196546,-0.351586,0.113937,0.0,0.0,0.0,3.208048,1.0,59.407542
4,117.736948,38.507459,-0.001277,-0.009787,0.001696,3.78114,0.250217,-0.449319,3.06543,10.208449,...,1.0,0.005487,0.00896,0.018841,0.0,8.0,0.0,0.287096,0.0,-11.170363
