# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use sktime with [tsfresh](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_arrow_head
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/02_classification_univariate.ipynb).

In [3]:
X, y = load_arrow_head(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(158, 1) (158,) (53, 1) (53,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
23,0 -2.1597 1 -2.1756 2 -2.1656 3 ...
168,0 -1.5317 1 -1.5413 2 -1.5150 3 ...
35,0 -1.8010 1 -1.7989 2 -1.7784 3 ...
7,0 -1.8234 1 -1.8055 2 -1.7932 3 ...
79,0 -2.0399 1 -2.0382 2 -2.0384 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['0', '1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:04<00:18,  4.54s/it]

Feature Extraction:  40%|████      | 2/5 [00:08<00:13,  4.51s/it]

Feature Extraction:  60%|██████    | 3/5 [00:13<00:08,  4.48s/it]

Feature Extraction:  80%|████████  | 4/5 [00:17<00:04,  4.46s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.38s/it]

Feature Extraction: 100%|██████████| 5/5 [00:21<00:00,  4.40s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,249.999359,100.008588,0.166741,0.109416,0.06625,0.442363,-0.11707,-1.566576,0.370308,1.481666,...,1.0,0.065662,0.024915,0.002205,0.0,0.0,0.0,0.996013,0.0,-4106544.0
1,250.000137,74.267442,0.377671,0.444606,0.092169,-0.226793,-0.669387,-1.186474,0.095314,0.709578,...,1.0,0.027791,0.00408,-0.010069,0.0,0.0,0.0,0.996016,0.0,1278059.0
2,249.999836,77.691848,0.319336,0.359802,0.082387,-0.35562,-0.623975,-1.075446,0.080361,0.64364,...,1.0,0.048628,0.007067,-0.012443,0.0,0.0,0.0,0.996015,0.0,691987.3
3,249.999829,83.40715,0.25983,0.293994,0.094073,-0.267192,-0.53962,-1.232441,0.11196,0.690742,...,1.0,0.050867,0.012181,0.001511,0.0,0.0,0.0,0.996015,0.0,-574539.9
4,250.000586,97.160702,0.180348,0.185114,0.08357,-0.096237,-0.43569,-1.472035,0.178845,0.781362,...,1.0,0.064194,0.040646,0.010912,0.0,0.0,0.0,0.996018,0.0,-1828466.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:05<00:20,  5.18s/it]

Feature Extraction:  40%|████      | 2/5 [00:09<00:15,  5.07s/it]

Feature Extraction:  60%|██████    | 3/5 [00:14<00:09,  4.89s/it]

Feature Extraction:  80%|████████  | 4/5 [00:18<00:04,  4.77s/it]

Feature Extraction: 100%|██████████| 5/5 [00:23<00:00,  4.59s/it]

Feature Extraction: 100%|██████████| 5/5 [00:23<00:00,  4.63s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:06,  1.60s/it]

Feature Extraction:  40%|████      | 2/5 [00:03<00:04,  1.58s/it]

Feature Extraction:  60%|██████    | 3/5 [00:04<00:03,  1.56s/it]

Feature Extraction:  80%|████████  | 4/5 [00:06<00:01,  1.55s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.46s/it]

Feature Extraction: 100%|██████████| 5/5 [00:07<00:00,  1.48s/it]




0.8867924528301887

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
33,0 -0.166323 1 -0.166323 2 -0.227300 3...,0 0.768712 1 0.768712 2 -0.677960 3...,0 -0.344511 1 -0.344511 2 -1.58804...,0 0.218397 1 0.218397 2 2.141352 3...,0 0.135832 1 0.135832 2 0.250357 3...,0 0.162466 1 0.162466 2 1.808430 3...
20,0 -0.294498 1 -0.294498 2 -0.050044 3...,0 0.540218 1 0.540218 2 -0.515245 3...,0 0.218114 1 0.218114 2 -0.301108 3...,0 -0.045277 1 -0.045277 2 0.103872 3...,0 -0.002663 1 -0.002663 2 -0.183773 3...,0 0.031960 1 0.031960 2 0.037287 3...
35,0 -0.040961 1 -0.040961 2 0.338414 3...,0 -0.971100 1 -0.971100 2 -3.420216 3...,0 0.203560 1 0.203560 2 -2.053446 3...,0 0.061258 1 0.061258 2 0.250357 3...,0 -0.047941 1 -0.047941 2 -0.639209 3...,0 0.961478 1 0.961478 2 -0.298298 3...
31,0 0.130669 1 0.130669 2 0.06882...,0 -0.119724 1 -0.119724 2 -4.08360...,0 -1.019916 1 -1.019916 2 5.39025...,0 0.684487 1 0.684487 2 0.394179 3...,0 0.290308 1 0.290308 2 0.617902 3...,0 0.679160 1 0.679160 2 1.595360 3...
24,0 0.279069 1 0.279069 2 0.707552 3...,0 -0.110199 1 -0.110199 2 -0.031651 3...,0 0.142979 1 0.142979 2 0.919129 3...,0 -0.026634 1 -0.026634 2 -0.631219 3...,0 -0.010653 1 -0.010653 2 0.290308 3...,0 0.077238 1 0.077238 2 0.122515 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:31,  7.93s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.86s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.80s/it]

Feature Extraction:  80%|████████  | 4/5 [00:30<00:07,  7.76s/it]

Feature Extraction: 100%|██████████| 5/5 [00:38<00:00,  7.74s/it]

Feature Extraction: 100%|██████████| 5/5 [00:38<00:00,  7.73s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,10991.779448,555.675893,-0.000122,-0.039183,0.035843,23.558182,5.495269,-2.17036,58.349987,29.126005,...,1.0,36.630331,20.753415,-28.497307,0.0,0.0,0.0,30.200014,1.0,-14.681544
1,110.735119,85.854825,-0.018927,-0.03565,0.027377,2.309696,0.354539,-0.770196,0.975286,3.041947,...,1.0,-0.025912,0.047528,0.317551,0.0,0.0,0.0,1.089646,1.0,184.873936
2,6681.979256,441.093167,-0.003917,-0.024874,0.021016,18.693258,6.340561,0.353609,41.466413,29.321178,...,1.0,-28.438658,3.340718,15.100302,0.0,0.0,0.0,12.710682,1.0,31.065326
3,7638.280878,494.592749,-0.037506,-0.046871,0.015511,13.940877,5.248257,-0.141364,21.7896,27.20738,...,1.0,-11.139032,8.109147,-0.337394,0.0,0.0,0.0,13.456041,1.0,-11.257922
4,344.052224,119.475536,0.002197,0.007196,0.05898,2.568986,0.562201,-1.055251,1.247625,3.955685,...,1.0,0.016007,0.265807,0.968304,0.0,0.0,0.0,2.364335,1.0,-27.716209
