# Quickstart: Timeseries Feature Extraction
Some data sets are big enough and structurally suitable to apply machine learning methods. Timeseries data, however can not be fed into most machine learning algorithms directly.

With `bletl_analysis.features`, you can apply a mix of biologically inspired and statistical methods to extract hundreds of features from timeseries of backscatter, pH and DO.

Under the hood, `bletl_analysis.features` uses [`tsfresh`](https://tsfresh.readthedocs.io) and combines it with an extendable API that you may use to provide additional custom designed feature extraction methods.

In [1]:
import pandas
import pathlib
from IPython.display import display

import bletl
import bletl_analysis
from bletl_analysis import features

## Parse the raw data file

In [2]:
filepath = pathlib.Path(r'..\bletl\tests\data\BL1\NT_1200rpm_30C_DO-GFP75-pH-BS10_12min_20171221_121339.csv')
bldata = bletl.parse(filepath, lot_number=1515, temp=30)

## Feature Extraction
You'll need to provide a list of `Extractor` objects for each filterset you want to extract from.

Additionally, you can specify the `last_cycle` after which the timeseries will be ignored, for example because of sacrifice sampling.

In [3]:
extractors = {
    "BS10" : [features.BSFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
    "pH" : [features.pHFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
    "DO" : [features.DOFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
}
last_cycles = {
    "A01" : 20,
    "B01" : 50
}

The feature extraction itself takes a while. In this case roughly 3 minutes for all 48 wells.

In [4]:
extracted_features = features.from_bldata(
    bldata=bldata,
    extractors=extractors,
    last_cycles=last_cycles
)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    1.4s
[Parallel(n_jobs=6)]: Done   1 out of   1 | elapsed:    1.5s finished
[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    1.3s
[Parallel(n_jobs=6)]: Done   1 out of   1 | elapsed:    1.4s finished
[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    1.1s
[Parallel(n_jobs=6)]: Done   1 out of   1 | elapsed:    1.1s finished
[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    1.5s
[Parallel(n_jobs=6)]: Done   1 out of   1 | elapsed:    1.5s finished
[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    1.1s
[Parallel(n_jobs=6)]: Done   1 out of   

[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=6)]: Done   1 out of   1 | elapsed:    0.6s finished
[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=6)]: Done   1 out of   1 | elapsed:    0.7s finished
[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    0.9s
[Parallel(n_jobs=6)]: Done   1 out of   1 | elapsed:    0.9s finished
[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    1.0s
[Parallel(n_jobs=6)]: Done   1 out of   1 | elapsed:    1.0s finished
[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=6)]: Batch computation too fast (0.0570s.) Setting batch_size=2.
[Parallel(n_jobs=6)]: Done   1 ou

### Show extracted data
The extracted data is a big `DataFrame`, indexed by well ID.
Each column starts with the name of the filterset from which the data was analyzed, followed by a double underscore and the name of the extracted feature.

For `tsfresh`-derived features, you'll have to look up the meaning of the features in their documentation.

In [5]:
extracted_features.head()

Unnamed: 0,BS10__inflection_point_t,BS10__inflection_point_y,BS10__mue_median,BS10__max,BS10__mean,BS10__median,BS10__min,BS10__span,BS10__stan_dev,BS10__time_max,...,DO_x__symmetry_looking__r_0.9500000000000001,DO_x__time_reversal_asymmetry_statistic__lag_1,DO_x__time_reversal_asymmetry_statistic__lag_2,DO_x__time_reversal_asymmetry_statistic__lag_3,DO_x__value_count__value_-1,DO_x__value_count__value_0,DO_x__value_count__value_1,DO_x__variance,DO_x__variance_larger_than_standard_deviation,DO_x__variation_coefficient
A01,15.58934,3.93831,1.53398,15.58934,13.236223,12.942862,11.786574,3.802765,1.13533,3.93831,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A02,114.049172,22.02844,0.512498,114.049172,43.78911,37.938058,10.662952,103.38622,27.608253,22.02844,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A03,158.104604,22.02891,0.013796,158.104604,52.730195,42.885822,11.25916,146.845445,39.335875,22.02891,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A04,148.610349,21.13505,0.096564,169.490066,52.94719,40.735298,11.072086,158.41798,41.251284,22.02937,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A05,129.941607,16.78825,0.367736,181.684883,68.253222,43.571497,11.010531,170.674352,59.208052,22.02988,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### What's next?
Many of the features are redundant.
At the same time, many features have `NaN` values, so you should consider to apply `.dropna()` before continuing.

Because of the high redundancy of many feature columns, you should consider to apply dimension reduction techniques like `PCA` to continue working with just a small set of non-redundant features.

Depending on your dataset, advanced high-dimensional visualization techniques such as t-SNE or UMAP are worth exploring.