**Feature Selection in a sklearn pipeline:**

In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

from tsfresh.examples import load_robot_execution_failures
from tsfresh.transformers import RelevantFeatureAugmenter
from tsfresh.utilities.dataframe_functions import impute

**Load and Prepare the Data:**

In [2]:
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures
download_robot_execution_failures() 
df_ts, y = load_robot_execution_failures()

we want to use the extracted features to predict for each of the robot executions, if it was a failure or not. Therefore, our basic entity is a single robot execution given by a distinct id. A dataframe with these identifiers as index needs to be prepared for the pipeline.

In [5]:
X = pd.DataFrame(index = y.index)
# split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y)

**In retrospect**
- Here's what could have happened to avoid data leakage - the dataframe passed to the augmenter to learn useful/relevant features should only be that which has ids in the train set (X_train). Use the X_train ids to filter that dataframe and learn relevant features using only the training set

**Build the pipeline:**
- we now build a sklearn pipeline that consosts of a feature exreaction step and a classifier. the RelevantFeatureAugmenter takes roughly the same arguments as extract_features and select_features.

In [10]:
ppl = Pipeline([
    ('augmenter', RelevantFeatureAugmenter(column_id = 'id', column_sort='time')),
    ('classifier', RandomForestClassifier())
])

The input to the pipeline will be our dataframe X, with one row per identifier. It is currently empty, but which timeseries data should the RelevantFeatureAugmenter use to actually extract features from?
- need to pass the time series data stored in df_ts to the transformer. 

In this case, df_ts contains the time series of both the train and test sets, if however you have different dataframes for train and test sets, you have to call set_params two times

In [11]:
ppl.set_params(augmenter__timeseries_container = df_ts);

now fit the pipeline:

In [12]:
ppl.fit(X_train, y_train)

Feature Extraction: 100%|██████████| 40/40 [00:08<00:00,  4.75it/s]
 'F_z__partial_autocorrelation__lag_8'
 'F_z__partial_autocorrelation__lag_9' ...
 'F_y__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"mean"'
 'F_y__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"var"'
 'F_y__query_similarity_count__query_None__threshold_0.0'] did not have any finite values. Filling with zeros.


There's data leakage in this example - as the df_ts passed contains the test set as well.
Resolve for production code. 
- Can't have the relevant features being learned from the test set as well - as this means some features of the test set are being spilled over to the training set.
- However, here's whats happening: the augmenter uses the input time series data to extract time series features for each of the identifiers in the X_train and selected only the relevant features using the passed y_train as target. These features are then added to X_train as new columns - and the classifier will use these during trainings.

Prediction - during inference  -the augmenter only extracts those features it has found as being relevant in the training phase. The classifier predicts the target using these features:

In [13]:
y_pred = ppl.predict(X_test)

Feature Extraction: 100%|██████████| 33/33 [00:03<00:00,  8.50it/s]


In [14]:
# inspect the performance:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00        17
        True       1.00      1.00      1.00         5

    accuracy                           1.00        22
   macro avg       1.00      1.00      1.00        22
weighted avg       1.00      1.00      1.00        22



**Prediction:**
- In this example, we passed an empty (except the index) X_train or X_test into the pipeline. However, you can also fill the input with other features you have (e.g. features extracted from the metadata) or even use other pipeline components before.

**Separating the time series data containers:**
- In the above example, we passed a single df_ts into the RelevantFeatureAugmenter. which was then used for both training and predicting. During training, only the data with the ids from X_train were extracted. The rest of the data are extracted during prediction. 
- However, it is perfectly fine to call set_params twice - once before training and once before prediction. This can be handy if you for example dump the trained pipeline to disk and re-use it only later for prediction. 

In [15]:
df_ts_train = df_ts[df_ts["id"].isin(y_train.index)]
df_ts_test = df_ts[df_ts["id"].isin(y_test.index)]

In [16]:
ppl = Pipeline([
    ('augmenter', RelevantFeatureAugmenter(column_id = 'id', column_sort='time')),
    ('classifier', RandomForestClassifier())
])

In [17]:
ppl.set_params(augmenter__timeseries_container = df_ts_train);
ppl.fit(X_train, y_train)

Feature Extraction: 100%|██████████| 40/40 [00:08<00:00,  4.75it/s]
 'F_z__partial_autocorrelation__lag_8'
 'F_z__partial_autocorrelation__lag_9' ...
 'F_y__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"mean"'
 'F_y__agg_linear_trend__attr_"stderr"__chunk_len_50__f_agg_"var"'
 'F_y__query_similarity_count__query_None__threshold_0.0'] did not have any finite values. Filling with zeros.


In [19]:
ppl.set_params(augmenter__timeseries_container=df_ts_test);
y_pred = ppl.predict(X_test)

Feature Extraction: 100%|██████████| 33/33 [00:03<00:00,  9.09it/s]


In [20]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00        17
        True       1.00      1.00      1.00         5

    accuracy                           1.00        22
   macro avg       1.00      1.00      1.00        22
weighted avg       1.00      1.00      1.00        22



In [6]:
X_train

4
33
26
34
36
...
48
65
56
51
37
