## General Notes:

**Online setting:** Our framework is designed to support methods suitable for online settings, specifically for streaming applications. This means that the methods can be implemented to work in real-time streaming environments or in batch processing with a streaming logic. **The key principle is that when calculating anomaly scores (or features, or any data associated with a timestamp t2), we only use data with timestamps earlier than t2.**
 For experimental purposes (and easier implementation), we use batch processing without violating the aforementioned principle.

**Multimodality:** Our framework incorporates continuous and event (discrete) data. For example, it handles sensor values (continuous) and automated alarms (event data). This integration allows users to build solutions that leverage information from different types. Moreover, events are used to evaluate different solutions, such as using events that signal the start of an anomaly period or a failure.

## Implementing Semi-Supervised Method: 


In our Framework Semi-Supervised methods are used by Semi-Supervised Flavors (online,icremental and semi-supervised with historical data). All these flavors expect methods that inherit from SemiSupervisedMethodInterface class (method.semi_supervised_method.SemiSupervisedMethodInterface), which inherets from  method.method.MethodInterface (an interface where all methods should respect to be compatible with our framerwork). For easy we follow the fit-predict scheme, where fit is done in a sample of data and instantiate a model in order to be used for generating anomaly scores for new data (when predict method is called). 

Implementation of Semi-Supervised Model:
* Each method should accept "event_preferences: EventPreferences" parameter in its constructor. This is done because in our framework we support multimodal data (both continoous and discrete). This parameter ecentially is used to provide the abillity for users to pass meta_data information in methods, for example when implementing a method and want to perform a particular action only after some events (can be ignored in other cases).
* Methods of method.method.MethodInterface  should be implemented: There are essential for logging parameters and models (using MLflow). 
* Implementation of `fit(historic_data: list[pd.DataFrame], historic_sources: list[str], event_data: pd.DataFrame)` method: This method implements the logic of fitting or training a model to relative normal data before starting producing anomaly scores for new data. It is crucial to note that we pass multiple data from different sources (i.e. historic_data is a list of Dataframes). Someone could fit a single model for all different sources (by combining their data) or handle each source seperately.
* Implementation of `predict(self, target_data: pd.DataFrame, source: str, event_data: pd.DataFrame)` method: This method accept a single dataframe (along with the source from which its originate) and event data. The return value must be always anomaly scores (where greater value means more anomalous data) and have the same size as the `target_data.shape[0]`.


Let's implement our own Semi-Supervised method for the Framework. 

**Knn method**: This method calculates the anomaly score to new samples as the distance from thier k-closest neighbor in a reference data. 
* \__init__ : Passing the parameter of k.
* fit: stores the data to KDTree index so the calculation of nearest neighbor could be faster.
* predict: for each sample return the distance from its k-closest neighbor in the reference set.


In [5]:
from method.semi_supervised_method import SemiSupervisedMethodInterface 
from pdm_evaluation_types.types import EventPreferences # This is used for integrating event data, can be ignored for now
from exceptions.exception import NotFitForSourceException # this is used in case predict method called before fit.
import pandas as pd
from sklearn.neighbors import KDTree
class my_method_knn(SemiSupervisedMethodInterface):
    def __init__(self, event_preferences: EventPreferences,k=40, *args, **kwargs):
        super().__init__(event_preferences=event_preferences) 
        self.k = k
        # Use dictionaries to keep indexers of different sources
        self.index_per_source={}
    def fit(self, historic_data: list[pd.DataFrame], historic_sources: list[str], event_data: pd.DataFrame) -> None:
        """
        Use the data of fit as reference data for each source.
        """
        for df,source in zip(historic_data,historic_sources):
            self.index_per_source[source]= KDTree(df.values, leaf_size=2) 

    def predict(self, target_data: pd.DataFrame, source: str, event_data: pd.DataFrame) -> list[float]:
        if source in self.index_per_source.keys():
            dist, ind =self.index_per_source[source].query(target_data.values, k=self.k)  
            scores=[d[-1] for d in dist]
            return scores
        else:
            raise NotFitForSourceException()

    def predict_one(self, new_sample: pd.Series, source: str, is_event: bool) -> float:
        if is_event==False:
            if source in self.index_per_source.keys():
                dist, ind =self.index_per_source[source].query(new_sample.values, k=self.k)  
                return dist[-1]
            else:
                raise NotFitForSourceException()
        return None

    def get_library(self) -> str:
        return 'no_save'

    def __str__(self) -> str:
        return 'K nearest neighbor'

    def get_params(self) -> dict:
        return {
            'k': self.k,
        }

    def get_all_models(self):
        pass

Now run the exact code from getting started example, with our own method and parameters.

In [6]:
from experiment.batch.auto_profile_semi_supervised_experiment import AutoProfileSemiSupervisedPdMExperiment
from constraint_functions.constraint import auto_profile_max_wait_time_constraint
from pipeline.pipeline import PdMPipeline
from utils import loadDataset

dataset = loadDataset.get_dataset("ims")

from preprocessing.record_level.default import DefaultPreProcessor
from postprocessing.default import DefaultPostProcessor
from thresholding.constant import ConstantThresholder

my_pipeline = PdMPipeline(
    steps={
        'preprocessor': DefaultPreProcessor,
        'method': my_method_knn,
        'postprocessor': DefaultPostProcessor,
        'thresholder': ConstantThresholder,
    },
    dataset=dataset,
    auc_resolution=100
)

param_space = {
    'method_k': [1,2,3,5,8,10,15,20,27,40,50],
}

param_space['profile_size'] = [60, 100, 150, 200]


my_experiment = AutoProfileSemiSupervisedPdMExperiment(
    experiment_name='my first experiment',
    pipeline=my_pipeline,
    param_space=param_space,
    num_iteration=4,
    n_jobs=4,
    initial_random=4,
    constraint_function=auto_profile_max_wait_time_constraint(my_pipeline),
    debug=True,
    optimization_param='VUS_AUC_PR'
)

best_params = my_experiment.execute()
print(best_params)

rm -f main.o evaluator.o evaluate
c++ -fPIC -Wall -std=c++11 -O2 -g   -c -o main.o main.cpp
c++ -fPIC -Wall -std=c++11 -O2 -g   -c -o evaluator.o evaluator.cpp




c++ -fPIC -Wall -std=c++11 -O2 -g   -o evaluate main.o evaluator.o


Best score: 0.28324032008941913: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:56<00:00, 14.17s/it]

{'k': 1, 'profile_size': 200}



