# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use `sktime` with [`tsfresh`](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_gunpoint
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/01_classification_univariate.ipynb).

In [3]:
X, y = load_gunpoint(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(150, 1) (150,) (50, 1) (50,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
96,0 -0.67835 1 -0.67799 2 -0.67523 3...
10,0 -0.81632 1 -0.81410 2 -0.81589 3...
17,0 -0.62213 1 -0.61916 2 -0.61331 3...
65,0 -0.72544 1 -0.73082 2 -0.73484 3...
6,0 -1.1063 1 -1.1656 2 -1.2526 3 ...


In [5]:
# binary classification task
np.unique(y_train)

array(['1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:15,  3.88s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.76s/it]

Feature Extraction:  60%|██████    | 3/5 [00:10<00:07,  3.70s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.66s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.62s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.60s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,149.000664,25.362668,0.429878,0.482114,0.10277,-1.213345,-1.08557,-0.700132,-0.117202,-0.532308,...,1.0,-0.001681,-0.001042,0.001164,0.0,0.0,0.0,0.993338,0.0,1010132.0
1,149.000967,31.54209,0.59185,0.643672,0.045542,-1.068173,-1.230264,-1.040828,-0.068216,-0.056917,...,1.0,0.005803,0.005564,0.006756,0.0,0.0,0.0,0.99334,0.0,357654.7
2,149.00068,24.96496,0.481046,0.53234,0.089584,-1.076461,-1.210659,-0.894465,-0.085382,-0.158798,...,1.0,0.00165,0.005504,0.008483,0.0,0.0,0.0,0.993338,0.0,612702.9
3,149.00049,26.699018,0.505264,0.565465,0.075689,-1.122256,-1.167389,-0.861415,-0.103517,-0.204129,...,1.0,-0.000167,-0.004344,-0.003713,0.0,0.0,0.0,0.993337,0.0,1397191.0
4,148.999681,41.38986,0.399061,0.413676,0.050069,-0.385935,-0.833533,-1.520563,0.279347,1.084767,...,1.0,-0.003086,-0.004844,-0.006565,0.0,0.0,0.0,0.993331,0.0,-524557.9


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:14,  3.62s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:10,  3.59s/it]

Feature Extraction:  60%|██████    | 3/5 [00:10<00:07,  3.57s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.62s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.62s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.60s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:05,  1.25s/it]

Feature Extraction:  40%|████      | 2/5 [00:02<00:03,  1.23s/it]

Feature Extraction:  60%|██████    | 3/5 [00:03<00:02,  1.22s/it]

Feature Extraction:  80%|████████  | 4/5 [00:04<00:01,  1.21s/it]

Feature Extraction: 100%|██████████| 5/5 [00:05<00:00,  1.20s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.20s/it]




1.0

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
7,0 -0.352746 1 -0.352746 2 -1.354561 3...,0 0.316845 1 0.316845 2 0.490525 3...,0 -0.473779 1 -0.473779 2 1.454261 3...,0 -0.327595 1 -0.327595 2 -0.269001 3...,0 0.106535 1 0.106535 2 0.021307 3...,0 0.197090 1 0.197090 2 0.460763 3...
15,0 -0.159076 1 -0.159076 2 -0.97770...,0 0.376722 1 0.376722 2 0.38349...,0 -0.445368 1 -0.445368 2 1.695360 3...,0 -0.029297 1 -0.029297 2 -0.255684 3...,0 0.029297 1 0.029297 2 0.375536 3...,0 -0.047941 1 -0.047941 2 0.516694 3...
34,0 0.052231 1 0.052231 2 -0.54804...,0 -0.730486 1 -0.730486 2 0.70700...,0 -0.518104 1 -0.518104 2 -1.179430 3...,0 -0.159802 1 -0.159802 2 -0.239704 3...,0 -0.045277 1 -0.045277 2 0.023970 3...,0 -0.029297 1 -0.029297 2 0.29829...
36,0 1.686827 1 1.686827 2 0.88247...,0 -3.375054 1 -3.375054 2 1.149305 3...,0 -1.295042 1 -1.295042 2 -0.97372...,0 -0.711121 1 -0.711121 2 -1.861697 3...,0 0.013317 1 0.013317 2 0.20508...,0 -0.207743 1 -0.207743 2 0.114525 3...
12,0 2.221946 1 2.221946 2 -7.70417...,0 -0.783638 1 -0.783638 2 -4.56992...,0 0.142401 1 0.142401 2 2.447367 3...,0 0.055931 1 0.055931 2 -0.442120 3...,0 0.071911 1 0.071911 2 0.010653 3...,0 0.226387 1 0.226387 2 -1.978886 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:31,  7.92s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.93s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.90s/it]

Feature Extraction:  80%|████████  | 4/5 [00:31<00:07,  7.90s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.85s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.87s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,7.940863,17.538085,-0.012544,-0.03333,0.011363,0.308947,-0.199023,-0.815698,0.100413,0.590843,...,1.0,0.002756,0.004311,0.000921,0.0,0.0,0.0,0.078089,0.0,8.320448
1,20089.782616,936.012458,-0.031604,-0.070448,0.144797,24.032611,6.174375,-16.526685,200.755496,27.548164,...,1.0,5.090285,19.718272,76.965414,0.0,1.0,0.0,34.822337,1.0,-24.347572
2,5948.915061,501.410345,-0.010562,-0.000552,0.03138,18.482604,2.516002,-2.336228,46.38186,28.609226,...,1.0,9.110184,51.616973,17.067363,0.0,0.0,0.0,16.337173,1.0,-151.306128
3,8211.351444,469.750718,-0.008743,-0.060022,0.032161,23.713041,6.369427,-4.062657,77.811984,27.615267,...,1.0,-8.932064,8.306429,50.532681,0.0,0.0,0.0,22.928426,1.0,-8.784175
4,10701.446629,758.995994,-0.001368,-0.138129,0.176665,13.975994,1.681442,-14.370603,98.81245,17.69389,...,1.0,6.485339,24.752491,49.024603,0.0,0.0,0.0,20.123652,1.0,29.251607
