Fix the evaluation strategy#57
Merged
sarusso merged 18 commits intoICSC-Spoke3:developfrom Dec 9, 2025
Merged
Conversation
The names "series1" and "series2" have been changed to "series_1" and "series_2" inside test functions to be distinguished from series generated in the "setUp()" method
… "_calculate_model_scores()"
The new evaluation strategy has been used
The new version of the function have no more this argument
About the evaluation on a single series: Before, the calculation of the 'anomalies_ratio' value was a zero division for not anomalous series. Now, 'anomalies_ratio' value is "None" for not anomalous series. About the evaluation on a dataset: Now, the value 'anomalies_ratio' is calculated averaging only on anomalous series
When the series is not anomalous and when the dataset does not contain anomalous series
This argument set the evaluation strategy. Suported values are "flags" (default) and "events" (Not implemented)
Collaborator
|
Just tried, it seems to works great! Just one thing, I investigated common terminology and we should really change some naming in order to adopt more common nomenclature: The main reason behind this is that it is quite unclear what Then, when finishing implementing the breakdown: Also, can I make a few edits/fixes on the text of the PR? |
Collaborator
Author
|
Yes! |
Collaborator
|
Can you please change the branch name form |
Collaborator
Author
|
done, but the PR has been closed |
Collaborator
|
Right, my bad, sorry. Let's stick with |
Collaborator
Author
|
ok |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The new parameter
strategysets the evaluation strategy. Supported values are:strategy = 'flags': detected anomalies are counted matching the anomaly labels in the series and the anomaly flag of the model, one by one. This is the default evaluation strategy.strategy = 'events': Not yet implemented, will group anomaly flags in events.The parameter
granularitysets the "finesse" of the evaluation and affects how detected, inserted anomalies and false positives are counted. The evaluation logic has been fixed and improved. The available granularities are:granularity = 'variable': anomalies and false positives are counted for each variable of the seriesgranularity = 'point': anomalies and false positives are counted for each timestampgranularity = 'series': anomalies and false positives are counted for each series, so a series can be either anomalous or notIn the module "evaluators.py" there are 3 private functions aimed at the evaluation of the model on a single series:
_variable_granularity_evaluation(),_point_granularity_evaluation(),_series_granularity_evaluation(). As names suggest, each function evaluate with a specific granularity.Given:
$N_d=$ n. of correctly detected anomalies
$N_{tot}=$ n. of inserted anomalies
$K=$ normalization factor
$N_{FP}$ = n. of false positives
$N_p=$ series length
$n=$ series dimension
The output of the above functions is a dictionary with 4 couples key-value:
If there are no anomalies, the value is "None"
'variable''point''series'The evaluation of the model on the dataset is assembled in the private function$N$ is the number of series in the dataset:
_calculate_model_scores(). This function takes in input the evaluations on the single series (each done with a fixed granularity) and it returns a dictionary with the structure already discussed: what changes are the values. IfIf there are no anomalies, the value is "None"
In the following, an example of evaluation using the
MinMaxAnomalyDetector()and series generated withgenerate_timeseries_df():series1
series2
series3
The columns 'value_1_anomaly' and 'value_2_anomaly' are the result of model: 1 means anomalous, while 0 not anomalous.
Here the result of the evaluation on this dataset, using the 3 different granularities:
Notes:
Some commits are about generating series inside the
setUp()method in "test_evaluators.py", useful for testing the evaluation now and for the future.For now, the evaluation is based on the following assumption: if the timestamp t is marked as anomalous, the anomaly is present in all variables making up the series.
This PR partially addresses #56: the evaluation does not already distinguish between different type of anomalies. This will be done in the future adding the
breakdownswitch.The
evaluate()takes in input a dictionary {'model_name': model_instance}, so it can evaluate more than one model on the same dataset. The output for multiple models is a nested dictionary: its primary keys are model names and the coupled values are a dictionaries as the ones showed above.