Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions sphinx/source/api/api_code.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,13 @@ ParamInterface
:project: api
:path: ../../../include/chimbuko/param.hpp

CopodParam
----------

.. doxygenfile:: copod_param.hpp
:project: api
:path: ../../../include/chimbuko/param/copod_param.hpp

HbosParam
---------

Expand Down
4 changes: 2 additions & 2 deletions sphinx/source/appendix/appendix_usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Options for the provenance database:
Options for the parameter server:

- **ad_win_size** : Number of events around an anomaly to store; provDB entry size is proportional to this
- **ad_alg** : AD algorithm to use. "sstd" or "hbos"
- **ad_alg** : AD algorithm to use. "sstd" or "hbos" or "copod"
- **ad_outlier_sstd_sigma** : number of standard deviations that defines an outlier.
- **ad_outlier_hbos_threshold** : The percentile of events outside of which are considered anomalies by the HBOS algorithm.

Expand Down Expand Up @@ -172,7 +172,7 @@ Additional AD Variables
- **-program_idx** : For workflows with multiple component programs, a "program index" must be supplied to the AD instances attached to those processes.
- **-rank** : By default the data rank assigned to an AD instance is taken from its MPI rank in MPI_COMM_WORLD. This rank is used to verify the incoming trace data. This option allows the user to manually set the rank index.
- **-override_rank** : This option disables the data rank verification and instead overwrites the data rank of the incoming trace data with the data rank stored in the AD instance. The value supplied must be the original data rank (this is used to generate the correct trace filename).
- **-ad_algorithm** : This sets the AD algorithm to use for online analysis: "sstd" or "hbos". Default value is "hbos".
- **-ad_algorithm** : This sets the AD algorithm to use for online analysis: "sstd" or "hbos" or "copod". Default value is "hbos".
- **-hbos_threshold** : This sets the threshold to control density of detected anomalies used by HBOS algorithm. Its value ranges between 0 and 1. Default value is 0.99


Expand Down
8 changes: 5 additions & 3 deletions sphinx/source/introduction/ad.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,16 @@ of a function :math:`i`, respectively, and :math:`\alpha` is a control parameter

Advanced anomaly analysis
~~~~~~~~~~~~~~~~~~~~~~~~~
A determistic and non-parametric statistical anomaly detection algorithm called Histogram Based Outilier Scoring (HBOS) is implemented as part of Chimbuko's anomaly analysis module. HBOS is an unsupervised anomaly detection algorithm which scores data in linear time. It supports dynamic bin widths which ensures long-tail distributions of function executions are captured and global anomalies are detected better. HBOS normalizes the histogram and calculates the anomaly scores by taking inverse of estimated densities of function executions. The score is a multiplication of the inverse of the estimated densities given by the following Equation
1. Histogram Based Outlier Score (HBOS) is a deterministic and non-parametric statistical anomaly detection algorithm. It is implemented as part of Chimbuko's anomaly analysis module. HBOS is an unsupervised anomaly detection algorithm which scores data in linear time. It supports dynamic bin widths which ensures long-tail distributions of function executions are captured and global anomalies are detected better. HBOS normalizes the histogram and calculates the anomaly scores by taking inverse of estimated densities of function executions. The score is a multiplication of the inverse of the estimated densities given by the following Equation

.. math::
HBOS_{i} = \log_{2} (1 / density_{i})

where :math:`i` is a function execution and :math:`density_{i}` is function execution probability. HBOS works in :math:`O(nlogn)` using dynamic bin-width or in linear time :math:`O(n)` using fixed bin width. After scoring, the top 1% of scores are filtered as anomalous function executions. This filter value can be set at runtime to adjust the density of detected anomalies.
where :math:`i` is a function execution and :math:`density_{i}` is function execution probability. HBOS works in :math:`O(nlogn)` using dynamic bin-width or in linear time :math:`O(n)` using fixed bin width. After scoring, the top 1% of scores are filtered as anomalous function executions. This filter value can be set at runtime to adjust the density of detected anomalies.

(See `ADOutlier <../api/api_code.html#adoutlier>`__ and `HbosParam <../api/api_code.html#hbosparam>`__).
2. Another algorithm is added into Chimbuko's advanced anomaly analysis called the COPula based Outlier Detection (COPOD), which is a deterministic, parameter-free anomaly detection algorithm. It computes empirical copulas for each sample in the dataset. A copula defines the dependence structure between random variables. For each sample in the dataset, COPOD algorithm computes left-tail empirical copula from left-tail empirical cumulative distribution function, right-tail copula from right-tail empirical cumulative distribution function, and a skewness-corrected empirical copula using a skewness coefficient calculated from left-tail and right-tail empirical cumulative distribution functions. These three computed values are interpreted as left-tail, right-tail, and skewness-corrected probabilities, respectively. Lowest probability value results in largest negative-log value, which is the score assigned to the sample in the dataset. Samples with the highest scores in the dataset are tagged as anomalous.

(See `ADOutlier <../api/api_code.html#adoutlier>`__, `HbosParam <../api/api_code.html#hbosparam>`__ and `CopodParam <../api/api_code.html#copodparam>`__).

Provenance data collection
--------------------------
Expand Down
13 changes: 10 additions & 3 deletions sphinx/source/introduction/ps.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Design
:scale: 50 %
:alt: Simple parameter server architecture

Parameter server architecture
Parameter server architecture

(**C**)lients (i.e. on-node AD modules) send requests with their locally-computed anomaly detection algorithm parameters to be aggregated with the global parameters and the updated parameters returned to the client. Network communication is performed using the `ZeroMQ <https://zeromq.org>`_ library and using `Cereal <https://uscilab.github.io/cereal/>`_ for data serialization.

Expand All @@ -24,11 +24,18 @@ via the **Backend** router in round-robin fashion. For the task of updating para

A dedicated (**S**)treaming thread (cf. :ref:`api/api_code:PSstatSender`) is maintained that periodically sends the latest global statistics to the visualization server.

Anomaly ranking metrics
-----------------------

Two metrics are developed that are assigned to each outlier that allow the user to focus on the subset of anomalies that are most important:
the anomaly score reflects how unlikely an anomaly is, and the anomaly severity reflects how important the anomaly is to the runtime of the application.
PS includes these values in the provenance information and allow for the convenient sorting and filtering
of the anomalies in post-analysis. We have tested to present the individual choice of these metrics in the
online visualization module.

..
While testing has demonstratedThis simple parameter server becomes a bottleneck as the number of requests (or clients) are increasing.
While testing has demonstratedThis simple parameter server becomes a bottleneck as the number of requests (or clients) are increasing.
In the following subsection, we will describe the scalable parameter server.
Scalable Parameter Server
-------------------------
TBD

32 changes: 31 additions & 1 deletion sphinx/source/io_schema/pserver_schema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ The schema for the **'anomaly_stats'** object is as follows:
| [
| {
| **'data'**: *Number of anomalies and anomaly time window for process/rank broken down by io step (array)*
| [
| [
| {
| **'app'**: *Program index*,
| **'max_timestamp'**: *Latest time of anomaly in io step*,
Expand Down Expand Up @@ -89,6 +89,36 @@ The schema for the **'anomaly_stats'** object is as follows:
| },
| ...
| ], *end of* **anomaly** *array*
| **‘anomaly_metrics’**:
| [
| {
| **'app'**: *Application*,
| **'rank'**: *Program rank*,
| **'fid'**: *function ID*,
| **'fname'**: *funciton name*,
| **‘_id'**: *a global index to track each (app, rank, func), for internal use*,
| **'new_data'**: *Statistics of anomaly metrics aggregated over multiple IO steps since the last pserver->viz send*
| {
| **'first_io_step'**: *first io step in sum*
| **'last_io_step'**: *last io step in sum*
| **‘max_timestamp’**: *max timestamp of last IO step of this period*
| **‘min_timestamp’**: *min timestamp of first IO step of this period*
| **'severity'**: *RunStats assigned severity*
| **'score'**: *RunStats assigned score*
| **'count'**: *RunStats count*
| }
| **'all_data'**: *Statistics of anomaly metrics aggregated since the beginning of the run*
| {
| **'first_io_step'**: *first io step in sum*
| **'last_io_step'**: *last io step in sum*
| **‘max_timestamp’**: *max timestamp of last IO step since start of run*
| **‘min_timestamp’**: *min timestamp of first IO step since start of run*
| **'severity'**: *RunStats assigned severity*
| **'score'**: *RunStats score*
| **'count'**: *RunStats count*
| }
| }
| ], *end of* **anomaly_metrics**
| **'func'**: *Statistics on anomalies broken down by function, collected over entire run to-date (array)*
| [
| {
Expand Down