This notebook will handle the feature extraction of the intrinsic and task data. It will extract both these data types using the tsfresh library which can extract features from time series data automatically. 

In [1]:
import os
import pandas as pd
import numpy as np
from tsfresh import extract_features
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters, extract_features
from tsfresh.utilities.dataframe_functions import make_forecasting_frame
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

Since the values were stored in an HDF5 filed to make storage simpler, this file has to be loaded. An HDF5 file is loaded using a key which specifies which data you want to extract as HDF5 can store multiple datasets within one file. This HDF5 file should not hold more datasets then the one we need, but the previous project specifies that this key should be used.

In [2]:
# Read the DataFrame from the HDF5 file
df = pd.read_hdf(r"C:\Users\kaspe\Documents\GitHub\AAU-IoT-Solution-AI-REDGIO\data_ozren\Dataset.h5", key='dataset')

In [3]:
print(df)

                  Source  Time (ms)          Type  Value
0          i030520235006      0.000  Nset (1/min)  0.000
1          i030520235068      0.000  Nset (1/min)  0.000
2          i050520238018      0.000   Angle (deg)  0.000
3          i030520237070      0.000   Angle (deg)  0.000
4          i280420232085      0.000   Angle (deg)  0.000
...                  ...        ...           ...    ...
33375222  t1005202314051   5001.519    TCP_y (mm)  2.600
33375223  t1005202314051   5001.519    TCP_x (mm)  0.000
33375224  t1005202314051   5001.519  TCP_rz (rad) -1.204
33375225  t1005202314051   5001.519    TCP_z (mm) -4.600
33375226  t1005202314051   5001.519  TCP_rx (rad)  1.234

[33375227 rows x 4 columns]


We start of by creating the dataframe which we want to store the extracted features in and load in the dataset from the HDF5 file.

In [4]:
df_intrinsic = df

Then we can choose if we want to extract features from either the intrinsic or the task data or on both at once. 

In [5]:
#df_intrinsic = df[df['Source'].str.startswith('i')]   #Set to i for intrinsic robot data and t for screwdriver task data, use if want to train on one part of data specifically

print(df_intrinsic)

                  Source  Time (ms)          Type  Value
0          i030520235006      0.000  Nset (1/min)  0.000
1          i030520235068      0.000  Nset (1/min)  0.000
2          i050520238018      0.000   Angle (deg)  0.000
3          i030520237070      0.000   Angle (deg)  0.000
4          i280420232085      0.000   Angle (deg)  0.000
...                  ...        ...           ...    ...
33375222  t1005202314051   5001.519    TCP_y (mm)  2.600
33375223  t1005202314051   5001.519    TCP_x (mm)  0.000
33375224  t1005202314051   5001.519  TCP_rz (rad) -1.204
33375225  t1005202314051   5001.519    TCP_z (mm) -4.600
33375226  t1005202314051   5001.519  TCP_rx (rad)  1.234

[33375227 rows x 4 columns]


TSFresh has requirements  for how their columns should be formatted in order for it to automatically extract features. Therefore it is necessary  to rename columns as Source to id.

In [6]:
# Rename columns to meet tsfresh requirements
df_intrinsic = df_intrinsic.rename(columns={'Source': 'id', 'Time (ms)': 'time', 'Type': 'kind', 'Value': 'value'})

print(df_intrinsic)


                      id      time          kind  value
0          i030520235006     0.000  Nset (1/min)  0.000
1          i030520235068     0.000  Nset (1/min)  0.000
2          i050520238018     0.000   Angle (deg)  0.000
3          i030520237070     0.000   Angle (deg)  0.000
4          i280420232085     0.000   Angle (deg)  0.000
...                  ...       ...           ...    ...
33375222  t1005202314051  5001.519    TCP_y (mm)  2.600
33375223  t1005202314051  5001.519    TCP_x (mm)  0.000
33375224  t1005202314051  5001.519  TCP_rz (rad) -1.204
33375225  t1005202314051  5001.519    TCP_z (mm) -4.600
33375226  t1005202314051  5001.519  TCP_rx (rad)  1.234

[33375227 rows x 4 columns]


As these features are getting trained together with the extrinsic features, it is necesary to change the name indicators to "id" so that the correct data entries can be matched from extrinsic, task and intrinsic data. Note that this code uses "intrinsic" as a name for both the intrinsic and task data.

In [7]:
# Replace 't' or 'i' at the start of the id column with 'id'
df_intrinsic['id'] = df_intrinsic['id'].str.replace(r'^(t|i)', 'id', regex=True).str.replace(r'^id+', 'id', regex=True)

print(df_intrinsic)

                       id      time          kind  value
0          id030520235006     0.000  Nset (1/min)  0.000
1          id030520235068     0.000  Nset (1/min)  0.000
2          id050520238018     0.000   Angle (deg)  0.000
3          id030520237070     0.000   Angle (deg)  0.000
4          id280420232085     0.000   Angle (deg)  0.000
...                   ...       ...           ...    ...
33375222  id1005202314051  5001.519    TCP_y (mm)  2.600
33375223  id1005202314051  5001.519    TCP_x (mm)  0.000
33375224  id1005202314051  5001.519  TCP_rz (rad) -1.204
33375225  id1005202314051  5001.519    TCP_z (mm) -4.600
33375226  id1005202314051  5001.519  TCP_rx (rad)  1.234

[33375227 rows x 4 columns]


Now it is possible to start the feature extraction, we choose to do this with the EfficientFCParameters settting.

In [8]:
# Define feature extraction settings
#settings = MinimalFCParameters()
settings = EfficientFCParameters()

# Extract features
extracted_features = extract_features(df_intrinsic, column_id="id", column_kind="kind", column_sort="time", column_value="value", default_fc_parameters=EfficientFCParameters())

Feature Extraction: 100%|██████████| 40/40 [13:33<00:00, 20.35s/it]  


As the extraction for several reasons might have extracted features to be infinitely big or small, this will cause errors. Therefore these are converted to NaN (not a number) values instead. Then we drop the feature columns which consists of only NaN values. Now some feature columns might have individual NaN values which will also cause errors in the later stages of the training process, we have to do something to these. We chose to impute these values which means all NaN values are replaced with "0".

In [9]:
# Replace columns with infinity to NaN
imputed_features = extracted_features.replace([np.inf, -np.inf], np.nan)

# Drop columns with only NaN values
imputed_features = imputed_features.dropna(axis=1, how="all")

# Impute missing (NaN) values
imputed_features = impute(imputed_features)


In [10]:
print(imputed_features)

                TCP_rz (rad)__variance_larger_than_standard_deviation  \
id030520234000                                                0.0       
id030520234001                                                0.0       
id030520234002                                                0.0       
id030520234003                                                0.0       
id030520234004                                                0.0       
...                                                           ...       
id280420233082                                                0.0       
id280420233083                                                0.0       
id280420233084                                                0.0       
id280420233085                                                0.0       
id280420233086                                                0.0       

                TCP_rz (rad)__has_duplicate_max  \
id030520234000                              1.0   
id030520234001       

Lastly we export the feature dataframe to a new csv file.

In [11]:
imputed_features.to_csv("tsfresh_efficient_features.csv")