First, go through the example_run_me.ipynb notebook to get a glance of how to use pre-built datasets. For adding your own dataset, first load your dataset from its source, here we use an artificial dataset with one time-series and one feature for illustration purposes. Your dataset should have a column that determines the timestamp of each record, in case your original dataset does not have a timestamp column you can add an artificial timestamp as illustrated below.

In [7]:
import pandas as pd
import numpy as np

feature = np.random.normal(loc=0, scale=1, size=10000)

start_datetime = pd.Timestamp("2000-01-01 00:01:40")
datetime_column = pd.date_range(start=start_datetime, periods=feature.shape[0], freq='D')

df = pd.DataFrame({
    'date': datetime_column,
    'feature': feature
})

print(df.shape)
df.head()

(10000, 2)


Unnamed: 0,date,feature
0,2000-01-01 00:01:40,0.473861
1,2000-01-02 00:01:40,1.36845
2,2000-01-03 00:01:40,-0.916827
3,2000-01-04 00:01:40,-0.124147
4,2000-01-05 00:01:40,-2.010963


After loading the time-series of your dataset you should create four lists: target_data, target_sources, historic_data and historic_sources (names do not matter here, but keys in the resulting dictionary that contains the dataset **do matter**). 'target_data' contains the time-series data we want to perform evaluation on, 'target_sources' contains an identifier for each time-series which can be artificial or real (for example the unique identifier of an asset the time-series originates from e.g. sensors of a specific vehicle), 'historic_data' contains healthy (or clean) historic data for each source that we know/assume do not contain failures and finally historic_sources contains the mapping (in 1-1 fashion) from source to healthy historic time-series. 

Here, we do not have any historic data available so we will create two empty lists (when no historical data is available the semisupervised with historical data flavor cannot be executed).

In [8]:
target_data = [df]
target_sources = ['source_1']
historic_data = []
historic_sources = []

After creating these four lists you should define where the failures occur and the failures you are interested in your evaluation. In case of run-to-failure scenarios (as in our case), meaning the failures occur at the end of each time-series, you can pass an empty dataframe for 'event_data' and empty lists for 'event_preferences'.

In [9]:
from pdm_evaluation_types.types import EventPreferences, EventPreferencesTuple

event_data = pd.DataFrame(columns=["date", "type", "source", "description"])

event_preferences: EventPreferences = {
    'failure': [],
    'reset': []
}

If you have multiple failures per source you can create event_preferences as shown below. '*' , means that all available events will be matched and '=' means that the rules apply only for the source that the event occured in (this functionality exists because in some cases it might be beneficial to apply rules on multiple sources regardless of the source the event occured in, for example hard disk drives with different id but same manufacturer and model).

For event_data the columns of the dataframe 'description', 'type' and 'source' should be str (even in case of sources that can be casted as int, for example '1') and 'date' is of type datetime as the timestamp we created for the artificial dataset on this example.

In [10]:
# event_preferences: EventPreferences = {
#     'failure': [
#         EventPreferencesTuple(description='*', type='fail', source='*', target_sources='=')
#     ],
#     'reset': [
#         EventPreferencesTuple(description='*', type='reset', source='*', target_sources='='),
#         EventPreferencesTuple(description='*', type='fail', source='*', target_sources='=')
#     ]
# }

After defining the event related variables you can create a dictionary that will represent your dataset and use it as in the example_run_me.ipynb notebook.

Information about the other keys:
* predictive_horizon is the period we are interested in raising an alarm for an upcoming failure
* slide is the VUS (Volume under the surface) sliding window
* lead is the period that even if an alarm is raised an upcoming failure cannot be prevented
* beta corresponds to the f-score computed, for example beta=1 means f1-score
* min_historic_scenario_len is the minimum time-series length in the historic data
* min_target_scenario_len is the minimum time-series length in the target data
* max_wait_time is the maximum period we can tolerate not having an alarm raised

All the previous values are dataset/domain dependent and usually are determined by domain experts such as engineers.

In [11]:
import math
import sys

dataset={}
dataset["dates"]="date"
dataset["event_preferences"]=event_preferences
dataset["event_data"]=event_data
dataset["target_data"]=target_data
dataset["target_sources"]=target_sources
dataset["historic_data"]=historic_data
dataset["historic_sources"]=historic_sources
dataset["predictive_horizon"]='100 days'
dataset["slide"]=50
dataset["lead"]='2 days'
dataset["beta"]=1
dataset["min_historic_scenario_len"] = sys.maxsize
dataset["min_target_scenario_len"] = min(df.shape[0] for df in target_data)
dataset["max_wait_time"] = math.ceil((1/3) * dataset["min_target_scenario_len"])

In [12]:
from experiment.batch.auto_profile_semi_supervised_experiment import AutoProfileSemiSupervisedPdMExperiment
from constraint_functions.constraint import auto_profile_max_wait_time_constraint
from pipeline.pipeline import PdMPipeline
from method.ocsvm import OneClassSVM

from preprocessing.record_level.default import DefaultPreProcessor
from postprocessing.default import DefaultPostProcessor
from thresholding.constant import ConstantThresholder

my_pipeline = PdMPipeline(
    steps={
        'preprocessor': DefaultPreProcessor,
        'method': OneClassSVM,
        'postprocessor': DefaultPostProcessor,
        'thresholder': ConstantThresholder,
    },
    dataset=dataset,
    auc_resolution=100
)

param_space = {
    'method_kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
    'method_nu': [0.01, 0.05, 0.1, 0.15, 0.2, 0.5],
    'method_gamma': ['scale', 'auto'],
    'method_degree': [2, 3, 4, 5],
}

param_space['profile_size'] = [50, 100, 150, 200]

my_experiment = AutoProfileSemiSupervisedPdMExperiment(
    experiment_name='my first experiment with a custom dataset',
    pipeline=my_pipeline,
    param_space=param_space,
    num_iteration=4,
    n_jobs=4,
    initial_random=4,
    constraint_function=auto_profile_max_wait_time_constraint(my_pipeline),
    debug=True,
    optimization_param='VUS_AUC_PR'
)

best_params = my_experiment.execute()
print(best_params)

rm -f main.o evaluator.o evaluate
c++ -fPIC -Wall -std=c++11 -O2 -g   -c -o main.o main.cpp
c++ -fPIC -Wall -std=c++11 -O2 -g   -c -o evaluator.o evaluator.cpp
c++ -fPIC -Wall -std=c++11 -O2 -g   -o evaluate main.o evaluator.o
{'nu': 0.05, 'kernel': 'linear', 'gamma': 'auto', 'degree': 2, 'profile_size': 150}
{}
{'nu': 0.01, 'kernel': 'poly', 'gamma': 'scale', 'degree': 4, 'profile_size': 200}
{'nu': 0.1, 'kernel': 'linear', 'gamma': 'auto', 'degree': 2, 'profile_size': 150}
{'nu': 0.05, 'kernel': 'sigmoid', 'gamma': 'scale', 'degree': 3, 'profile_size': 150}
{}{}

{}


  0%|          | 0/4 [00:00<?, ?it/s]

{'degree': 2, 'gamma': 'scale', 'kernel': 'poly', 'nu': 0.5, 'profile_size': 200}
{'degree': 5, 'gamma': 'auto', 'kernel': 'rbf', 'nu': 0.05, 'profile_size': 50}
{'degree': 2, 'gamma': 'scale', 'kernel': 'linear', 'nu': 0.5, 'profile_size': 50}
{}
{}
{}
{'degree': 5, 'gamma': 'scale', 'kernel': 'rbf', 'nu': 0.15, 'profile_size': 150}
{}


Best score: 0.012224784455107704:  25%|██▌       | 1/4 [00:07<00:22,  7.42s/it]

{'degree': 2, 'gamma': 'auto', 'kernel': 'rbf', 'nu': 0.01, 'profile_size': 100}
{'degree': 4, 'gamma': 'scale', 'kernel': 'sigmoid', 'nu': 0.1, 'profile_size': 150}
{'degree': 3, 'gamma': 'scale', 'kernel': 'linear', 'nu': 0.5, 'profile_size': 150}
{'degree': 3, 'gamma': 'auto', 'kernel': 'rbf', 'nu': 0.2, 'profile_size': 150}
{}
{}
{}
{}


Best score: 0.012224784455107704:  50%|█████     | 2/4 [00:14<00:14,  7.43s/it]

{'degree': 4, 'gamma': 'scale', 'kernel': 'sigmoid', 'nu': 0.01, 'profile_size': 150}
{'degree': 3, 'gamma': 'auto', 'kernel': 'sigmoid', 'nu': 0.15, 'profile_size': 50}
{'degree': 2, 'gamma': 'scale', 'kernel': 'sigmoid', 'nu': 0.15, 'profile_size': 200}
{'degree': 2, 'gamma': 'auto', 'kernel': 'linear', 'nu': 0.01, 'profile_size': 200}
{}
{}
{}
{}


Best score: 0.012224784455107704:  75%|███████▌  | 3/4 [00:22<00:07,  7.56s/it]

{'degree': 5, 'gamma': 'scale', 'kernel': 'sigmoid', 'nu': 0.15, 'profile_size': 200}
{'degree': 2, 'gamma': 'auto', 'kernel': 'rbf', 'nu': 0.1, 'profile_size': 150}
{'degree': 3, 'gamma': 'auto', 'kernel': 'linear', 'nu': 0.5, 'profile_size': 50}
{'degree': 5, 'gamma': 'scale', 'kernel': 'sigmoid', 'nu': 0.15, 'profile_size': 100}
{}
{}
{}
{}


Best score: 0.012224784455107704: 100%|██████████| 4/4 [00:30<00:00,  7.75s/it]

{'profile_size': 150, 'method_nu': 0.05, 'method_kernel': 'sigmoid', 'method_gamma': 'scale', 'method_degree': 3}



