# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use `sktime` with [`tsfresh`](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_gunpoint
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/01_classification_univariate.ipynb).

In [3]:
X, y = load_gunpoint(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(150, 1) (150,) (50, 1) (50,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
110,0 -0.61669 1 -0.61457 2 -0.61434 3...
44,0 -1.2848 1 -1.2887 2 -1.2782 3 ...
82,0 -1.6433 1 -1.6408 2 -1.6362 3 ...
11,0 -1.1730 1 -1.1721 2 -1.1659 3 ...
147,0 -0.73801 1 -0.73630 2 -0.73123 3...


In [5]:
# binary classification task
np.unique(y_train)

array(['1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:15,  3.87s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.76s/it]

Feature Extraction:  60%|██████    | 3/5 [00:10<00:07,  3.70s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.66s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.63s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.61s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,148.999048,25.83833,0.458544,0.510811,0.095597,-1.131448,-1.087768,-0.729832,-0.097042,-0.460083,...,1.0,0.001321,0.002268,0.007223,0.0,0.0,0.0,0.993327,0.0,-543631.6
1,149.000754,43.75439,0.287364,0.303604,0.092835,-0.173903,-0.684838,-1.605855,0.352471,0.89123,...,1.0,0.001098,-0.005136,-0.022212,0.0,0.0,0.0,0.993338,0.0,-667408.7
2,149.000761,51.55763,0.11427,0.107295,0.116319,0.118686,-0.395685,-1.905906,0.562767,0.694912,...,1.0,0.015231,0.004972,-0.031301,0.0,0.0,0.0,0.993338,0.0,-808105.7
3,149.000228,37.330848,0.396208,0.404982,0.059399,-0.316638,-0.755987,-1.409873,0.313829,1.0818,...,1.0,0.001597,0.00346,0.00366,0.0,0.0,0.0,0.993335,0.0,676467.3
4,148.999953,25.562862,0.478498,0.530811,0.086274,-1.102797,-1.159772,-0.840441,-0.086918,-0.037036,...,1.0,0.001593,0.000867,-0.001391,0.0,0.0,0.0,0.993333,0.0,-4671848.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:14,  3.65s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:10,  3.63s/it]

Feature Extraction:  60%|██████    | 3/5 [00:10<00:07,  3.61s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.62s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.61s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.61s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:05,  1.28s/it]

Feature Extraction:  40%|████      | 2/5 [00:02<00:03,  1.27s/it]

Feature Extraction:  60%|██████    | 3/5 [00:03<00:02,  1.26s/it]

Feature Extraction:  80%|████████  | 4/5 [00:04<00:01,  1.25s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.25s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.25s/it]




0.98

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
36,0 -1.801504 1 -1.801504 2 -0.480725 3...,0 2.344990 1 2.344990 2 -0.994385 3...,0 0.281253 1 0.281253 2 0.378807 3...,0 0.716447 1 0.716447 2 -0.870923 3...,0 0.162466 1 0.162466 2 0.095881 3...,0 0.921527 1 0.921527 2 -0.474080 3...
23,0 -0.647511 1 -0.647511 2 -0.156391 3...,0 -0.111979 1 -0.111979 2 -0.159968 3...,0 -0.739682 1 -0.739682 2 0.441646 3...,0 0.202416 1 0.202416 2 -0.615239 3...,0 0.165129 1 0.165129 2 0.007990 3...,0 0.074574 1 0.074574 2 0.127842 3...
6,0 1.275129 1 1.275129 2 -0.273185 3...,0 -1.024406 1 -1.024406 2 0.095152 3...,0 -0.545722 1 -0.545722 2 0.023203 3...,0 -0.463427 1 -0.463427 2 0.042614 3...,0 -0.367545 1 -0.367545 2 -0.109198 3...,0 -0.159802 1 -0.159802 2 0.183773 3...
7,0 -0.352746 1 -0.352746 2 -1.354561 3...,0 0.316845 1 0.316845 2 0.490525 3...,0 -0.473779 1 -0.473779 2 1.454261 3...,0 -0.327595 1 -0.327595 2 -0.269001 3...,0 0.106535 1 0.106535 2 0.021307 3...,0 0.197090 1 0.197090 2 0.460763 3...
25,0 -0.044205 1 -0.044205 2 -0.878387 3...,0 -0.496912 1 -0.496912 2 -1.725143 3...,0 -0.428723 1 -0.428723 2 1.558894 3...,0 0.620566 1 0.620566 2 0.082565 3...,0 0.229050 1 0.229050 2 0.098545 3...,0 0.649863 1 0.649863 2 -0.191763 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:08<00:32,  8.12s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:24,  8.04s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.93s/it]

Feature Extraction:  80%|████████  | 4/5 [00:31<00:07,  7.78s/it]

Feature Extraction: 100%|██████████| 5/5 [00:38<00:00,  7.71s/it]

Feature Extraction: 100%|██████████| 5/5 [00:38<00:00,  7.72s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,5716.535296,375.788586,-0.010798,-0.08417,0.054152,19.190912,5.136329,-1.663644,38.880657,28.832684,...,1.0,-22.050941,-18.937862,3.202976,0.0,0.0,0.0,12.342621,1.0,-42.197112
1,380.452882,130.339733,-0.008718,-0.041681,0.048424,5.132551,1.462494,-0.902601,3.773765,7.581589,...,1.0,0.16112,0.378758,1.095259,0.0,0.0,0.0,2.663761,1.0,35.197925
2,14.11858,16.614377,-0.006352,-0.018835,0.015877,0.594643,-0.156639,-0.556163,0.167986,1.275129,...,1.0,-0.000125,-2e-05,-9e-06,0.0,4.0,0.0,0.007111,0.0,7.4148
3,7.940863,17.538085,-0.012544,-0.03333,0.011363,0.308947,-0.199023,-0.815698,0.100413,0.590843,...,1.0,0.002756,0.004311,0.000921,0.0,0.0,0.0,0.078089,0.0,8.320448
4,193.996354,98.389577,-0.00797,-0.002092,0.044726,3.072738,0.655916,-1.20338,2.053698,3.507223,...,1.0,0.000822,0.254753,0.811217,0.0,0.0,0.0,1.781946,1.0,23.268575
