# Feature Selection in a sklearn pipeline

This notebook is quite similar to [the first example](./01%20Feature%20Extraction%20and%20Selection.ipynb).
This time however, we use the `sklearn` pipeline API of `tsfresh`.
If you want to learn more, have a look at [the documentation](https://tsfresh.readthedocs.io/en/latest/text/sklearn_transformers.html).

In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

from tsfresh.examples import load_robot_execution_failures
from tsfresh.transformers import RelevantFeatureAugmenter
from tsfresh.utilities.dataframe_functions import impute

## Load and Prepare the Data

Check out the first example notebook to learn more about the data and format.

In [2]:
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures
download_robot_execution_failures() 
df_ts, y = load_robot_execution_failures()

We want to use the extracted features to predict for each of the robot executions, if it was a failure or not.
Therefore our basic "entity" is a single robot execution given by a distinct `id`.

A dataframe with these identifiers as index needs to be prepared for the pipeline.

In [3]:
X = pd.DataFrame(index=y.index)

# Split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Build the pipeline

We build a sklearn pipeline that consists of a feature extraction step (`RelevantFeatureAugmenter`) with a subsequent `RandomForestClassifier`.

The `RelevantFeatureAugmenter` takes roughly the same arguments as `extract_features` and `select_features` do.

In [5]:
ppl = Pipeline([
        ('augmenter', RelevantFeatureAugmenter(column_id='id', column_sort='time')),
        ('classifier', RandomForestClassifier())
      ])

<div class="alert alert-warning">
    
Here comes the tricky part!
    
The input to the pipeline will be our dataframe `X`, which one row per identifier.
It is currently empty.
But which time series data should the `RelevantFeatureAugmenter` to actually extract the features from?

We need to pass the time series data (stored in `df_ts`) to the transformer.
    
</div>

In this case, df_ts contains the time series of both train and test set, if you have different dataframes for 
train and test set, you have to call set_params two times 
(see further below on how to deal with two independent data sets)

In [6]:
ppl.set_params(augmenter__timeseries_container=df_ts);

We are now ready to fit the pipeline

In [7]:
ppl.fit(X_train, y_train)

Feature Extraction: 100%|██████████| 20/20 [00:06<00:00,  3.19it/s]
 'F_x__partial_autocorrelation__lag_8'
 'F_x__partial_autocorrelation__lag_9' ...
 'T_z__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"mean"'
 'T_z__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"var"'
 'T_z__query_similarity_count__query_None__threshold_0.0'] did not have any finite values. Filling with zeros.


The augmenter has used the input time series data to extract time series features for each of the identifiers in the `X_train` and selected only the relevant ones using the passed `y_train` as target.
These features have been added to `X_train` as new columns.
The classifier can now use these features during trainings.

## Prediction

During interference, the augmentor does only extract the relevant features it has found out in the training phase and the classifier predicts the target using these features.

In [8]:
y_pred = ppl.predict(X_test)

Feature Extraction: 100%|██████████| 19/19 [00:02<00:00,  8.15it/s]


So, finally we inspect the performance:

In [9]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00        16
        True       1.00      1.00      1.00         6

    accuracy                           1.00        22
   macro avg       1.00      1.00      1.00        22
weighted avg       1.00      1.00      1.00        22



You can also find out, which columns the augmenter has selected

In [10]:
ppl.named_steps["augmenter"].feature_selector.relevant_features

['F_x__value_count__value_-1',
 'F_x__range_count__max_1__min_-1',
 'F_x__mean_n_absolute_max__number_of_maxima_7',
 'F_x__variance',
 'F_x__abs_energy',
 'F_x__root_mean_square',
 'F_x__standard_deviation',
 'F_y__abs_energy',
 'F_y__root_mean_square',
 'F_x__fft_coefficient__attr_"abs"__coeff_1',
 'T_y__absolute_maximum',
 'F_y__mean_n_absolute_max__number_of_maxima_7',
 'F_x__absolute_maximum',
 'T_y__variance',
 'T_y__standard_deviation',
 'F_x__ratio_value_number_to_time_series_length',
 'T_x__variance',
 'T_x__standard_deviation',
 'T_x__absolute_maximum',
 'F_y__absolute_maximum',
 'T_y__fft_coefficient__attr_"abs"__coeff_1',
 'T_x__ratio_value_number_to_time_series_length',
 'F_z__agg_linear_trend__attr_"intercept"__chunk_len_10__f_agg_"var"',
 'T_x__fft_coefficient__attr_"abs"__coeff_2',
 'F_z__standard_deviation',
 'T_y__root_mean_square',
 'T_y__fft_coefficient__attr_"abs"__coeff_2',
 'F_z__variance',
 'F_x__fft_coefficient__attr_"abs"__coeff_4',
 'T_y__abs_energy',
 'T_y__m

<div class="alert alert-info">
    
In this example we passed in an empty (except the index) `X_train` or `X_test` into the pipeline.
However, you can also fill the input with other features you have (e.g. features extracted from the metadata)
or even use other pipeline components before.
    
</div>

## Separating the time series data containers

In the example above we passed in a single `df_ts` into the `RelevantFeatureAugmenter`, which was used both for training and predicting.
During training, only the data with the `id`s from `X_train` where extracted and during prediction the rest.

However, it is perfectly fine to call `set_params` twice: once before training and once before prediction. 
This can be handy if you for example dump the trained pipeline to disk and re-use it only later for prediction.
You only need to make sure that the `id`s of the enteties you use during training/prediction are actually present in the passed time series data.

In [11]:
df_ts_train = df_ts[df_ts["id"].isin(y_train.index)]
df_ts_test = df_ts[df_ts["id"].isin(y_test.index)]

In [12]:
ppl.set_params(augmenter__timeseries_container=df_ts_train);
ppl.fit(X_train, y_train);

Feature Extraction: 100%|██████████| 20/20 [00:05<00:00,  3.94it/s]
 'F_x__partial_autocorrelation__lag_8'
 'F_x__partial_autocorrelation__lag_9' ...
 'T_z__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"mean"'
 'T_z__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"var"'
 'T_z__query_similarity_count__query_None__threshold_0.0'] did not have any finite values. Filling with zeros.


In [None]:
import pickle
with open("pipeline.pkl", "wb") as f:
    pickle.dump(ppl, f)

Later: load the fitted model and do predictions on new, unseen data

In [None]:
import pickle
with open("pipeline.pkl", "rb") as f:
    ppk = pickle.load(f)

In [None]:
ppl.set_params(augmenter__timeseries_container=df_ts_test);
y_pred = ppl.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))