# Feature extraction & t-SNE embedding of BioLector data
This notebook runs time series feature extraction on a BioLector dataset.
The results are saved to Excel spreadsheets and visualizations are done in the next notebook.

In [1]:
import pandas
import pathlib
import sklearn.preprocessing
import sklearn.manifold

import bletl
import bletl.features as feat

DP_DATA = pathlib.Path("..", "data")
DP_RESULTS = pathlib.Path("..", "results")

## Loading the Data

In [2]:
bldata = bletl.parse(DP_DATA / "8X4PF4.csv")
df_inductions = pandas.read_excel(
    DP_DATA / "8X4PF4_eventlog.xlsx",
    index_col=0, sheet_name="inductions"
).set_index("well").sort_index()
df_samplings = pandas.read_excel(
    DP_DATA / "8X4PF4_eventlog.xlsx",
    index_col=0, sheet_name="samplings"
).set_index("well").sort_index()

Here's a glimpse on a tiny part of the dataset:

In [3]:
bldata.get_unified_narrow_data()

Unnamed: 0,well,cycle,time,BS3,pH,DO
0,A01,1,0.013333,1.86,6.70,91.55
1,A01,2,0.229444,1.68,6.54,95.15
2,A01,3,0.446111,1.78,6.46,96.48
3,A01,4,0.662778,2.16,6.40,96.88
4,A01,5,0.879444,1.91,6.36,96.99
...,...,...,...,...,...,...
5371,F08,108,23.196111,4.21,6.28,100.44
5372,F08,109,23.412778,3.39,6.28,100.77
5373,F08,110,23.629444,3.25,6.28,101.03
5374,F08,111,23.846111,3.65,6.27,100.83


## Extract Timeseries Features
The time series feature extraction must be configured with a mapping of filterset names to `Extractor`s.

Here we'll extract statistical time series features with `tsfesh` and some handcrafted features with filterset-specific extractors.

In [4]:
df_features = feat.from_bldata(
    bldata=bldata,
    extractors={
        "BS3": [feat.TSFreshExtractor(), feat.BSFeatureExtractor()],
        "DO": [feat.TSFreshExtractor(), feat.DOFeatureExtractor()],
        "pH": [feat.TSFreshExtractor(), feat.pHFeatureExtractor()],
    },
    last_cycles=df_samplings.cycle.to_dict()
)

Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00,  3.37it/s]
Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:04<00:00,  2.77it/s]
Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00,  3.45it/s]


## Clean Extracted Features
This step removes features that are not available for all wells, or take the same value for all wells.

In [5]:
df_features_clean = df_features.dropna(axis="columns")
n_unique = df_features_clean.apply(pandas.Series.nunique)
cols_to_drop = n_unique[n_unique == 1].index
df_features_clean = df_features_clean.drop(cols_to_drop, axis=1)
df_features_clean

Unnamed: 0,BS3__inflection_point_t,BS3__inflection_point_y,BS3__mue_median,DO__peak,pH__sum_of_increase,pH__sum_of_reduction,BS3_x__has_duplicate_max,BS3_x__has_duplicate_min,BS3_x__has_duplicate,BS3_x__sum_values,...,pH_x__permutation_entropy__dimension_4__tau_1,pH_x__permutation_entropy__dimension_5__tau_1,pH_x__permutation_entropy__dimension_6__tau_1,pH_x__permutation_entropy__dimension_7__tau_1,"pH_x__matrix_profile__feature_""min""__threshold_0.98","pH_x__matrix_profile__feature_""max""__threshold_0.98","pH_x__matrix_profile__feature_""mean""__threshold_0.98","pH_x__matrix_profile__feature_""median""__threshold_0.98","pH_x__matrix_profile__feature_""25""__threshold_0.98","pH_x__matrix_profile__feature_""75""__threshold_0.98"
A01,17.85,11.496111,0.495787,0.0,0.524566,-0.490245,0.0,0.0,1.0,397.43,...,0.626937,0.690861,0.754997,0.819097,0.475635,8.645548,2.001659,0.894559,0.663311,1.587994
A02,9.32,8.463333,0.131177,0.0,0.586165,-0.395291,0.0,0.0,0.0,325.01,...,0.989098,1.062023,1.135197,1.208268,0.904067,8.578908,2.556828,1.435769,1.115151,2.66158
A03,10.14,10.196944,0.743782,0.0,0.779759,-0.365782,0.0,0.0,1.0,356.95,...,0.67556,0.742903,0.810924,0.879488,0.789715,8.350765,2.561316,1.493815,1.075941,2.859836
A04,13.38,13.013889,0.403213,0.0,0.603041,-0.426502,0.0,0.0,1.0,326.94,...,0.817194,0.885017,0.953361,1.022048,0.973627,8.38361,2.762543,1.633358,1.203547,3.168017
A05,10.08,8.030833,0.662264,0.0,0.863229,-0.140606,0.0,0.0,1.0,303.2,...,0.292884,0.297586,0.302455,0.307502,0.387117,2.241657,0.691394,0.513123,0.431758,0.758468
A06,1.41,0.015,0.86655,0.0,0.73496,-0.289212,0.0,0.0,1.0,454.37,...,1.177105,1.338234,1.464286,1.532229,0.52916,6.254409,2.261133,2.027962,1.178704,3.062448
A07,9.34,8.031667,0.624908,0.0,0.527528,-0.137426,0.0,0.0,1.0,301.27,...,0.292884,0.297586,0.302455,0.307502,0.480282,3.122913,0.703871,0.562041,0.520249,0.677159
A08,12.08,8.465278,0.626441,0.0,0.592128,-0.367008,0.0,0.0,1.0,347.5,...,0.650841,0.721734,0.792848,0.863815,0.766995,7.886994,2.164387,1.281558,0.932588,2.2925
B01,1.66,0.018333,0.35781,0.0,0.513921,-0.187477,0.0,0.0,1.0,282.11,...,1.135117,1.362043,1.446492,1.487422,1.309531,7.538303,2.8589,2.356083,1.722689,3.414224
B02,11.58,9.117778,0.557006,0.0,0.394071,-0.431162,0.0,1.0,1.0,319.9,...,0.704274,0.779427,0.855459,0.932186,1.098079,8.168846,3.18436,1.617887,1.433289,5.003883


## Export to XLSX

In [6]:
with pandas.ExcelWriter(DP_RESULTS / "8X4PF4_extracted_features.xlsx") as writer:
    df_features.to_excel(writer, sheet_name="raw")
    df_features_clean.to_excel(writer, sheet_name="clean")

## t-SNE Embedding
The t-SNE embedding is calculated based on cleaned features standardized by mean and standard deviation.

In [7]:
X = df_features_clean.values
X_scaled = sklearn.preprocessing.StandardScaler().fit_transform(X)
tsne = sklearn.manifold.TSNE(
    perplexity=10,
    init="pca",
    verbose=1,
    n_iter=10_000,
    random_state=20210302
)
X_tsne= tsne.fit_transform(X_scaled)

[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 48 samples in 0.000s...
[t-SNE] Computed neighbors for 48 samples in 0.004s...
[t-SNE] Computed conditional probabilities for sample 48 / 48
[t-SNE] Mean sigma: 15.628556
[t-SNE] KL divergence after 250 iterations with early exaggeration: 63.352573
[t-SNE] KL divergence after 1600 iterations: 0.253640


In [8]:
# Save the embedding to XLSX for plotting in a separate notebook
df_embedding = pandas.DataFrame(index=df_features_clean.index, columns=["tsne_1", "tsne_2"], data=X_tsne)
df_embedding.index.name = "well"
df_embedding.to_excel(DP_RESULTS / "8X4PF4_embedding.xlsx")
df_embedding

Unnamed: 0_level_0,tsne_1,tsne_2
well,Unnamed: 1_level_1,Unnamed: 2_level_1
A01,89.923141,12.997733
A02,-31.580858,-36.853664
A03,21.383499,75.04245
A04,11.664825,92.818901
A05,-99.802055,-96.073647
A06,-89.291649,6.818706
A07,-83.33979,-90.518616
A08,-2.714135,-103.425026
B01,-74.092125,-10.023693
B02,9.355048,-16.564596


In [9]:
%load_ext watermark
%watermark

Last updated: 2021-07-30T21:44:58.997483+02:00

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.25.0

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
CPU cores   : 6
Architecture: 64bit

