Skip to content

Commit

Permalink
Rename Classifer for Drift Detection to Domain Classifer (#368)
Browse files Browse the repository at this point in the history
* fix multiv-why placement
* renaming CDD to DC
* small fixes
* update DC plot title
  • Loading branch information
nikml committed Feb 27, 2024
1 parent c99b37b commit be87811
Show file tree
Hide file tree
Showing 16 changed files with 195 additions and 188 deletions.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@
"outputs": [],
"source": [
"# Let's compute multivariate drift\n",
"drift_classifier = nml.DriftDetectionClassifierCalculator(\n",
"drift_classifier = nml.DomainClassifierCalculator(\n",
" feature_column_names=feature_column_names,\n",
" timestamp_column_name='ordered',\n",
" chunk_size=DPP\n",
Expand Down

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions docs/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,12 @@ Glossary

You can read more about Data Periods in the :ref:`relevant data requirements section<data-drift-periods>`.

Domain Classifier
A domain classifer is a machine learning classification model trained to identify whether a given data point
belongs to one or another dataset. NannyML uses domain classifers as a multivariate drift detection method.
You can read more about them in :ref:`How it works: Domain Classifier<how-multiv-drift-dc>` and see how to use
them in :ref:`Tutorial: Domain Classifier<multivariate_drift_detection_dc>`.

Error
The error of a statistic on a sample is defined as the difference between the value of the observation and the true value.
The sample size can sometimes be 1 but it is usually bigger. When the error consists only of the effects
Expand Down
31 changes: 14 additions & 17 deletions docs/how_it_works/multivariate_drift.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,18 +149,17 @@ For more information on using Reconstruction Error with PCA check
the :ref:`Multivariate Drift - Data Reconstruction with PCA<multivariate_drift_detection_pca>`
tutorial.

.. _how-multiv-drift-cdd:
.. _how-multiv-drift-dc:

Classifier for Drift Detection
------------------------------
Domain Classifier
-----------------

Classifier for drift detection provides a measure of how easy it is to discriminate
the reference data from the examined chunk data. It is an implementation of domain classifiers, as
they are called in `relevant literature`_, using a LightGBM classifier.
A :term:`Domain Classifier` allows us to create a measure of how easy it is to discriminate
the reference data from the examined chunk data. NannyML uses a LightGBM classifier.
As a measure of discrimination performance NannyML uses the cross-validated AUROC score.
Similar to data reconstruction with PCA this method is also able to capture complex changes in our data.

The algorithm implementing Classifier for Drift Detection follows the steps described below.
The algorithm implementing Domain Classifier follows the steps described below.
Please note that the process described below is repeated for each :term:`Data Chunk`.
The process consists of two basic parts, data preprocessing and classifier cross validation.

Expand All @@ -187,10 +186,10 @@ The higher the AUROC score the easier it is to distinguish the datasets, hence t
more different they are.


Understanding Classifier for Drift Detection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Understanding Domain Classifier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Classifier for Drift Detection method relines on a machine learning
The Domain Classifier method relies on a machine learning
algorithm to distinguish between the reference and the chunk data.
We are using a LightGBM Classifier. Because of the versatility
of this approach the classifier is quite sensitive to shifts in the data.
Expand All @@ -199,10 +198,10 @@ directly translate classifier AUROC values to possible performance impact.
It is better to rely on :ref:`performance estimation<performance-estimation>`
methods for that.

Classifier for Drift Detection on the butterfly dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Domain Classifier on the butterfly dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now that we have a better understanding of Classifier for Drift Detection, let's see
Now that we have a better understanding of Domain Classifier, let's see
how it performs on the butterfly dataset.

.. nbimport::
Expand All @@ -214,8 +213,6 @@ how it performs on the butterfly dataset.
The change in the butterfly dataset is now clearly visible through the change in the
classifier's AUROC, while our earlier univariate approach detected no change.

For more information on using Classifier for Drift Detection check
the :ref:`Multivariate Drift - Classifier for Drift Detection<multivariate_drift_detection_cdd>`
For more information on using Domain Classifier check
the :ref:`Multivariate Drift - Domain Classifier<multivariate_drift_detection_dc>`
tutorial.

.. _`relevant literature`: https://arxiv.org/abs/1810.11953
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,13 @@
Multivariate Drift Detection
============================

Multivariate data drift detection compliments :ref:`univariate data drift detection methods<univariate_drift_detection>`.
It provides one summary number reducing the risk of false alerts, and detects more subtle changes
in the data structure that cannot be detected with univariate approaches. The trade off is that
multivariate drift results are less explainable compared to univariate drift results.

.. toctree::
:maxdepth: 2

multivariate_drift_detection/multiv_why
multivariate_drift_detection/pca
multivariate_drift_detection/cdd
multivariate_drift_detection/dc
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
.. _multivariate_drift_detection_cdd:
.. _multivariate_drift_detection_dc:

==============================
Classifier for Drift Detection
==============================
=================
Domain Classifier
=================

The second multivariate drift detection method of NannyML is Classifier for Drift Detection.
The second multivariate drift detection method of NannyML is Domain Classifier.
It provides a measure of how easy it is to discriminate the reference data from the examined chunk data.
You can read more about on the :ref:`How it works: Classifier for Drift Detection<how-multiv-drift-cdd>` section.
You can read more about on the :ref:`How it works: Domain Classifier<how-multiv-drift-dc>` section.
When there is no data drift the datasets can't discerned and we get a value of 0.5.
The more drift there is, the higher the returned measure will be, up to a value of 1.

Just The Code
-------------

.. nbimport::
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
:cells: 1 3 4 6 8

.. admonition:: **Advanced configuration**
Expand Down Expand Up @@ -43,14 +43,14 @@ Let's start by loading some synthetic data provided by the NannyML package set i
This synthetic data is for a binary classification model, but multi-class classification can be handled in the same way.

.. nbimport::
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
:cells: 1

.. nbtable::
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
:cell: 2

The :class:`~nannyml.drift.multivariate.classifier_for_drift_detection.calculator.DriftDetectionClassifierCalculator`
The :class:`~nannyml.drift.multivariate.domain_classifier.calculator.DomainClassifierCalculator`
module implements this functionality. We need to instantiate it with appropriate parameters:

- **feature_column_names:** A list with the column names of the features we want to run drift detection on.
Expand All @@ -67,7 +67,7 @@ module implements this functionality. We need to instantiate it with appropriate
order to create chunks.
- **chunker (Optional):** A NannyML :class:`~nannyml.chunk.Chunker` object that will handle the aggregation
provided data in order to create chunks.
- **cv_folds_num (Optional):** Number of cross-validation folds to use when calculating CDD discrimination value.
- **cv_folds_num (Optional):** Number of cross-validation folds to use when calculating DC discrimination value.
- **hyperparameters (Optional):** A dictionary used to provide your own custom hyperparameters when training the
discrimination model. Check out the available hyperparameter options in the `LightGBM docs`_.
- **tune_hyperparameters (Optional):** A boolean controlling whether hypertuning should be performed on the internal
Expand All @@ -84,29 +84,29 @@ which the results will be based on. Then the
calculate the multivariate drift results on the provided data.

.. nbimport::
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
:cells: 3

We can see these results of the data provided to the
:meth:`~nannyml.base.AbstractCalculator.calculate`
method as a dataframe.

.. nbimport::
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
:cells: 4

.. nbtable::
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
:cell: 5

The drift results from the reference data are accessible from the properties of the results object:

.. nbimport::
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
:cells: 6

.. nbtable::
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
:cell: 7


Expand All @@ -119,7 +119,7 @@ NannyML can also visualize the multivariate drift results in a plot. Our plot co
A red, diamond-shaped point marker additionally indicates this in the middle of the chunk.

.. nbimport::
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
:path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
:cells: 8

.. image:: /_static/tutorials/detecting_data_drift/multivariate_drift_detection/classifier-for-drift-detection.svg
Expand Down

This file was deleted.

2 changes: 1 addition & 1 deletion nannyml/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@
AlertCountRanker,
CorrelationRanker,
DataReconstructionDriftCalculator,
DriftDetectionClassifierCalculator,
DomainClassifierCalculator,
UnivariateDriftCalculator,
)
from .exceptions import ChunkerException, InvalidArgumentsException, MissingMetadataException
Expand Down
8 changes: 4 additions & 4 deletions nannyml/drift/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@
The multivariate drift detection methods include:
- Data reconstruction error: detects drift by performing dimensionality reduction on the model
inputs and then applying the inverse transformation on the latent (reduced) space.
inputs using PCA and then applying the inverse transformation on the latent (reduced) space.
- Domain Classifer: detects drift by looking at how performance a domain classifier is at distinguising
between the reference and the chunk datasets.
"""
from .multivariate.classifier_for_drift_detection import DriftDetectionClassifierCalculator
from .multivariate.domain_classifier import DomainClassifierCalculator
from .multivariate.data_reconstruction import DataReconstructionDriftCalculator
from .ranker import AlertCountRanker, CorrelationRanker
from .univariate import FeatureType, Method, MethodFactory, UnivariateDriftCalculator
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,5 @@
"""

from .calculator import DriftDetectionClassifierCalculator
from .calculator import DomainClassifierCalculator
from .result import Result
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@

from nannyml.base import AbstractCalculator, _list_missing, _split_features_by_type
from nannyml.chunk import Chunker
from nannyml.drift.multivariate.classifier_for_drift_detection.result import Result
from nannyml.drift.multivariate.domain_classifier.result import Result
from nannyml.exceptions import InvalidArgumentsException

# from nannyml.sampling_error import SAMPLING_ERROR_RANGE
Expand Down Expand Up @@ -71,8 +71,8 @@
}


class DriftDetectionClassifierCalculator(AbstractCalculator):
"""DriftDetectionClassifierCalculator implementation.
class DomainClassifierCalculator(AbstractCalculator):
"""DomainClassifierCalculator implementation.
Uses Drift Detection Classifier's cross validated performance as a measure of drift.
"""
Expand All @@ -92,7 +92,7 @@ def __init__(
hyperparameter_tuning_config: Optional[Dict[str, Any]] = DEFAULT_LGBM_HYPERPARAM_TUNING_CONFIG,
threshold: Threshold = ConstantThreshold(lower=0.45, upper=0.65),
):
"""Create a new DriftDetectionClassifierCalculator instance.
"""Create a new DomainClassifierCalculator instance.
Parameters:
-----------
Expand All @@ -116,7 +116,7 @@ def __init__(
chunker : Chunker, default=None
The `Chunker` used to split the data sets into a lists of chunks.
cv_folds_num: Optional[int]
Number of cross-validation folds to use when calculating CDD discrimination value.
Number of cross-validation folds to use when calculating DC discrimination value.
hyperparameters : Dict[str, Any], default = None
A dictionary used to provide your own custom hyperparameters when training the discrimination model.
Check out the available hyperparameter options in the
Expand Down Expand Up @@ -159,7 +159,7 @@ def __init__(
... col for col in reference_df.columns
... if col not in non_feature_columns
>>> ]
>>> calc = nml.DriftDetectionClassifierCalculator(
>>> calc = nml.DomainClassifierCalculator(
... feature_column_names=feature_column_names,
... timestamp_column_name='timestamp',
... chunk_size=5000
Expand All @@ -169,7 +169,7 @@ def __init__(
>>> figure = results.plot()
>>> figure.show()
"""
super(DriftDetectionClassifierCalculator, self).__init__(
super(DomainClassifierCalculator, self).__init__(
chunk_size, chunk_number, chunk_period, chunker, timestamp_column_name
)
if isinstance(feature_column_names, str):
Expand Down Expand Up @@ -201,9 +201,9 @@ def __init__(
# self._sampling_error_components: Tuple = ()
self.result: Optional[Result] = None

@log_usage(UsageEvent.CDD_CALC_FIT)
@log_usage(UsageEvent.DC_CALC_FIT)
def _fit(self, reference_data: pd.DataFrame, *args, **kwargs):
"""Fits the CDD calculator to a set of reference data."""
"""Fits the DC calculator to a set of reference data."""
if reference_data.empty:
raise InvalidArgumentsException('data contains no rows. Please provide a valid data set.')

Expand Down Expand Up @@ -232,9 +232,9 @@ def _fit(self, reference_data: pd.DataFrame, *args, **kwargs):

return self

@log_usage(UsageEvent.CDD_CALC_RUN)
@log_usage(UsageEvent.DC_CALC_RUN)
def _calculate(self, data: pd.DataFrame, *args, **kwargs) -> Result:
"""Calculate the data CDD calculator metric for a given data set."""
"""Calculate the data DC calculator metric for a given data set."""
if data.empty:
raise InvalidArgumentsException('data contains no rows. Please provide a valid data set.')

Expand Down Expand Up @@ -330,20 +330,20 @@ def _calculate_chunk(self, data: pd.DataFrame):
def _set_metric_thresholds(self, result_data: pd.DataFrame):
self.lower_threshold_value, self.upper_threshold_value = calculate_threshold_values(
threshold=self.threshold,
data=result_data.loc[:, ('classifier_auroc', 'value')],
data=result_data.loc[:, ('domain_classifier_auroc', 'value')],
lower_threshold_value_limit=self._lower_threshold_value_limit,
upper_threshold_value_limit=self._upper_threshold_value_limit,
logger=self._logger,
)

def _populate_alert_thresholds(self, result_data: pd.DataFrame) -> pd.DataFrame:
result_data[('classifier_auroc', 'upper_threshold')] = self.upper_threshold_value
result_data[('classifier_auroc', 'lower_threshold')] = self.lower_threshold_value
result_data[('classifier_auroc', 'alert')] = result_data.apply(
result_data[('domain_classifier_auroc', 'upper_threshold')] = self.upper_threshold_value
result_data[('domain_classifier_auroc', 'lower_threshold')] = self.lower_threshold_value
result_data[('domain_classifier_auroc', 'alert')] = result_data.apply(
lambda row: True
if (
row[('classifier_auroc', 'value')] > row[('classifier_auroc', 'upper_threshold')]
or row[('classifier_auroc', 'value')] < row[('classifier_auroc', 'lower_threshold')]
row[('domain_classifier_auroc', 'value')] > row[('domain_classifier_auroc', 'upper_threshold')]
or row[('domain_classifier_auroc', 'value')] < row[('domain_classifier_auroc', 'lower_threshold')]
)
else False,
axis=1,
Expand Down Expand Up @@ -401,7 +401,7 @@ def _create_multilevel_index(include_thresholds: bool = False):
'alert',
]
chunk_tuples = [('chunk', chunk_column_name) for chunk_column_name in chunk_column_names]
reconstruction_tuples = [('classifier_auroc', column_name) for column_name in results_column_names]
reconstruction_tuples = [('domain_classifier_auroc', column_name) for column_name in results_column_names]

tuples = chunk_tuples + reconstruction_tuples

Expand Down

0 comments on commit be87811

Please sign in to comment.