Rename Classifer for Drift Detection to Domain Classifer (#368)

* fix multiv-why placement * renaming CDD to DC * small fixes * update DC plot title
NannyML · Feb 27, 2024 · be87811 · be87811
1 parent c99b37b
commit be87811
Show file tree

Hide file tree

Showing 16 changed files with 195 additions and 188 deletions.
diff --git a/docs/_static/how-it-works/butterfly-multivariate-drift-cdd.svg b/docs/_static/how-it-works/butterfly-multivariate-drift-cdd.svg
diff --git a/...ting_data_drift/multivariate_drift_detection/classifier-for-drift-detection.svg b/...ting_data_drift/multivariate_drift_detection/classifier-for-drift-detection.svg
diff --git a/docs/example_notebooks/How It Works - Multivariate Drift.ipynb b/docs/example_notebooks/How It Works - Multivariate Drift.ipynb
@@ -195,7 +195,7 @@
    "outputs": [],
    "source": [
     "# Let's compute multivariate drift\n",
-    "drift_classifier = nml.DriftDetectionClassifierCalculator(\n",
+    "drift_classifier = nml.DomainClassifierCalculator(\n",
     "    feature_column_names=feature_column_names,\n",
     "    timestamp_column_name='ordered',\n",
     "    chunk_size=DPP\n",

diff --git a/...Multivariate - Classifier for Drift.ipynb → ... - Multivariate - Domain Classifier.ipynb b/...Multivariate - Classifier for Drift.ipynb → ... - Multivariate - Domain Classifier.ipynb
diff --git a/docs/glossary.rst b/docs/glossary.rst
@@ -110,6 +110,12 @@ Glossary
 
         You can read more about Data Periods in the :ref:`relevant data requirements section<data-drift-periods>`.
 
+    Domain Classifier
+        A domain classifer is a machine learning classification model trained to identify whether a given data point
+        belongs to one or another dataset. NannyML uses domain classifers as a multivariate drift detection method.
+        You can read more about them in :ref:`How it works: Domain Classifier<how-multiv-drift-dc>` and see how to use
+        them in :ref:`Tutorial: Domain Classifier<multivariate_drift_detection_dc>`.
+
     Error
         The error of a statistic on a sample is defined as the difference between the value of the observation and the true value.
         The sample size can sometimes be 1 but it is usually bigger. When the error consists only of the effects

diff --git a/docs/how_it_works/multivariate_drift.rst b/docs/how_it_works/multivariate_drift.rst
@@ -149,18 +149,17 @@ For more information on using Reconstruction Error with PCA check
 the :ref:`Multivariate Drift - Data Reconstruction with PCA<multivariate_drift_detection_pca>`
 tutorial.
 
-.. _how-multiv-drift-cdd:
+.. _how-multiv-drift-dc:
 
-Classifier for Drift Detection
-------------------------------
+Domain Classifier
+-----------------
 
-Classifier for drift detection provides a measure of how easy it is to discriminate
-the reference data from the examined chunk data. It is an implementation of domain classifiers, as
-they are called in `relevant literature`_, using a LightGBM classifier.
+A :term:`Domain Classifier` allows us to create a measure of how easy it is to discriminate
+the reference data from the examined chunk data. NannyML uses a LightGBM classifier.
 As a measure of discrimination performance NannyML uses the cross-validated AUROC score.
 Similar to data reconstruction with PCA this method is also able to capture complex changes in our data.
 
-The algorithm implementing Classifier for Drift Detection follows the steps described below.
+The algorithm implementing Domain Classifier follows the steps described below.
 Please note that the process described below is repeated for each :term:`Data Chunk`.
 The process consists of two basic parts, data preprocessing and classifier cross validation.
 
@@ -187,10 +186,10 @@ The higher the AUROC score the easier it is to distinguish the datasets, hence t
 more different they are.
 
 
-Understanding Classifier for Drift Detection
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Understanding Domain Classifier
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The Classifier for Drift Detection method relines on a machine learning
+The Domain Classifier method relies on a machine learning
 algorithm to distinguish between the reference and the chunk data.
 We are using a LightGBM Classifier. Because of the versatility
 of this approach the classifier is quite sensitive to shifts in the data.
@@ -199,10 +198,10 @@ directly translate classifier AUROC values to possible performance impact.
 It is better to rely on :ref:`performance estimation<performance-estimation>`
 methods for that.
 
-Classifier for Drift Detection on the butterfly dataset
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Domain Classifier on the butterfly dataset
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Now that we have a better understanding of Classifier for Drift Detection, let's see
+Now that we have a better understanding of Domain Classifier, let's see
 how it performs on the butterfly dataset.
 
 .. nbimport::
@@ -214,8 +213,6 @@ how it performs on the butterfly dataset.
 The change in the butterfly dataset is now clearly visible through the change in the
 classifier's AUROC, while our earlier univariate approach detected no change.
 
-For more information on using Classifier for Drift Detection check
-the :ref:`Multivariate Drift - Classifier for Drift Detection<multivariate_drift_detection_cdd>`
+For more information on using Domain Classifier check
+the :ref:`Multivariate Drift - Domain Classifier<multivariate_drift_detection_dc>`
 tutorial.
-
-.. _`relevant literature`: https://arxiv.org/abs/1810.11953
diff --git a/docs/tutorials/detecting_data_drift/multivariate_drift_detection.rst b/docs/tutorials/detecting_data_drift/multivariate_drift_detection.rst
@@ -4,10 +4,13 @@
 Multivariate Drift Detection
 ============================
 
+Multivariate data drift detection compliments :ref:`univariate data drift detection methods<univariate_drift_detection>`.
+It provides one summary number reducing the risk of false alerts, and detects more subtle changes
+in the data structure that cannot be detected with univariate approaches. The trade off is that
+multivariate drift results are less explainable compared to univariate drift results.
 
 .. toctree::
    :maxdepth: 2
 
-   multivariate_drift_detection/multiv_why
    multivariate_drift_detection/pca
-   multivariate_drift_detection/cdd
+   multivariate_drift_detection/dc
diff --git a/...rift/multivariate_drift_detection/cdd.rst → ...drift/multivariate_drift_detection/dc.rst b/...rift/multivariate_drift_detection/cdd.rst → ...drift/multivariate_drift_detection/dc.rst
@@ -1,20 +1,20 @@
-.. _multivariate_drift_detection_cdd:
+.. _multivariate_drift_detection_dc:
 
-==============================
-Classifier for Drift Detection
-==============================
+=================
+Domain Classifier
+=================
 
-The second multivariate drift detection method of NannyML is Classifier for Drift Detection.
+The second multivariate drift detection method of NannyML is Domain Classifier.
 It provides a measure of how easy it is to discriminate the reference data from the examined chunk data.
-You can read more about on the :ref:`How it works: Classifier for Drift Detection<how-multiv-drift-cdd>` section.
+You can read more about on the :ref:`How it works: Domain Classifier<how-multiv-drift-dc>` section.
 When there is no data drift the datasets can't discerned and we get a value of 0.5.
 The more drift there is, the higher the returned measure will be, up to a value of 1.
 
 Just The Code
 -------------
 
 .. nbimport::
-    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
+    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
     :cells: 1 3 4 6 8
 
 .. admonition:: **Advanced configuration**
@@ -43,14 +43,14 @@ Let's start by loading some synthetic data provided by the NannyML package set i
 This synthetic data is for a binary classification model, but multi-class classification can be handled in the same way.
 
 .. nbimport::
-    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
+    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
     :cells: 1
 
 .. nbtable::
-    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
+    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
     :cell: 2
 
-The :class:`~nannyml.drift.multivariate.classifier_for_drift_detection.calculator.DriftDetectionClassifierCalculator`
+The :class:`~nannyml.drift.multivariate.domain_classifier.calculator.DomainClassifierCalculator`
 module implements this functionality. We need to instantiate it with appropriate parameters:
 
 - **feature_column_names:** A list with the column names of the features we want to run drift detection on.
@@ -67,7 +67,7 @@ module implements this functionality. We need to instantiate it with appropriate
   order to create chunks.
 - **chunker (Optional):** A NannyML :class:`~nannyml.chunk.Chunker` object that will handle the aggregation
   provided data in order to create chunks.
-- **cv_folds_num (Optional):** Number of cross-validation folds to use when calculating CDD discrimination value.
+- **cv_folds_num (Optional):** Number of cross-validation folds to use when calculating DC discrimination value.
 - **hyperparameters (Optional):** A dictionary used to provide your own custom hyperparameters when training the
   discrimination model. Check out the available hyperparameter options in the `LightGBM docs`_.
 - **tune_hyperparameters (Optional):** A boolean controlling whether hypertuning should be performed on the internal
@@ -84,29 +84,29 @@ which the results will be based on. Then the
 calculate the multivariate drift results on the provided data.
 
 .. nbimport::
-    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
+    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
     :cells: 3
 
 We can see these results of the data provided to the
 :meth:`~nannyml.base.AbstractCalculator.calculate`
 method as a dataframe.
 
 .. nbimport::
-    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
+    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
     :cells: 4
 
 .. nbtable::
-    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
+    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
     :cell: 5
 
 The drift results from the reference data are accessible from the properties of the results object:
 
 .. nbimport::
-    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
+    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
     :cells: 6
 
 .. nbtable::
-    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
+    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
     :cell: 7
 
 
@@ -119,7 +119,7 @@ NannyML can also visualize the multivariate drift results in a plot. Our plot co
   A red, diamond-shaped point marker additionally indicates this in the middle of the chunk.
 
 .. nbimport::
-    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Classifier for Drift.ipynb
+    :path: ./example_notebooks/Tutorial - Drift - Multivariate - Domain Classifier.ipynb
     :cells: 8
 
 .. image:: /_static/tutorials/detecting_data_drift/multivariate_drift_detection/classifier-for-drift-detection.svg

diff --git a/docs/tutorials/detecting_data_drift/multivariate_drift_detection/multiv_why.rst b/docs/tutorials/detecting_data_drift/multivariate_drift_detection/multiv_why.rst
diff --git a/nannyml/__init__.py b/nannyml/__init__.py
@@ -55,7 +55,7 @@
     AlertCountRanker,
     CorrelationRanker,
     DataReconstructionDriftCalculator,
-    DriftDetectionClassifierCalculator,
+    DomainClassifierCalculator,
     UnivariateDriftCalculator,
 )
 from .exceptions import ChunkerException, InvalidArgumentsException, MissingMetadataException

diff --git a/nannyml/drift/__init__.py b/nannyml/drift/__init__.py
@@ -19,11 +19,11 @@
 The multivariate drift detection methods include:
 
 - Data reconstruction error: detects drift by performing dimensionality reduction on the model
-  inputs and then applying the inverse transformation on the latent (reduced) space.
-
-
+  inputs using PCA and then applying the inverse transformation on the latent (reduced) space.
+- Domain Classifer: detects drift by looking at how performance a domain classifier is at distinguising
+  between the reference and the chunk datasets.
 """
-from .multivariate.classifier_for_drift_detection import DriftDetectionClassifierCalculator
+from .multivariate.domain_classifier import DomainClassifierCalculator
 from .multivariate.data_reconstruction import DataReconstructionDriftCalculator
 from .ranker import AlertCountRanker, CorrelationRanker
 from .univariate import FeatureType, Method, MethodFactory, UnivariateDriftCalculator
diff --git a/...lassifier_for_drift_detection/__init__.py → ...ultivariate/domain_classifier/__init__.py b/...lassifier_for_drift_detection/__init__.py → ...ultivariate/domain_classifier/__init__.py
@@ -25,5 +25,5 @@
 
 """
 
-from .calculator import DriftDetectionClassifierCalculator
+from .calculator import DomainClassifierCalculator
 from .result import Result
diff --git a/...ssifier_for_drift_detection/calculator.py → ...tivariate/domain_classifier/calculator.py b/...ssifier_for_drift_detection/calculator.py → ...tivariate/domain_classifier/calculator.py
@@ -28,7 +28,7 @@
 
 from nannyml.base import AbstractCalculator, _list_missing, _split_features_by_type
 from nannyml.chunk import Chunker
-from nannyml.drift.multivariate.classifier_for_drift_detection.result import Result
+from nannyml.drift.multivariate.domain_classifier.result import Result
 from nannyml.exceptions import InvalidArgumentsException
 
 # from nannyml.sampling_error import SAMPLING_ERROR_RANGE
@@ -71,8 +71,8 @@
 }
 
 
-class DriftDetectionClassifierCalculator(AbstractCalculator):
-    """DriftDetectionClassifierCalculator implementation.
+class DomainClassifierCalculator(AbstractCalculator):
+    """DomainClassifierCalculator implementation.
 
     Uses Drift Detection Classifier's cross validated performance as a measure of drift.
     """
@@ -92,7 +92,7 @@ def __init__(
         hyperparameter_tuning_config: Optional[Dict[str, Any]] = DEFAULT_LGBM_HYPERPARAM_TUNING_CONFIG,
         threshold: Threshold = ConstantThreshold(lower=0.45, upper=0.65),
     ):
-        """Create a new DriftDetectionClassifierCalculator instance.
+        """Create a new DomainClassifierCalculator instance.
 
         Parameters:
         -----------
@@ -116,7 +116,7 @@ def __init__(
         chunker : Chunker, default=None
             The `Chunker` used to split the data sets into a lists of chunks.
         cv_folds_num: Optional[int]
-            Number of cross-validation folds to use when calculating CDD discrimination value.
+            Number of cross-validation folds to use when calculating DC discrimination value.
         hyperparameters : Dict[str, Any], default = None
             A dictionary used to provide your own custom hyperparameters when training the discrimination model.
             Check out the available hyperparameter options in the
@@ -159,7 +159,7 @@ def __init__(
         ...     col for col in reference_df.columns
         ...     if col not in non_feature_columns
         >>> ]
-        >>> calc = nml.DriftDetectionClassifierCalculator(
+        >>> calc = nml.DomainClassifierCalculator(
         ...     feature_column_names=feature_column_names,
         ...     timestamp_column_name='timestamp',
         ...     chunk_size=5000
@@ -169,7 +169,7 @@ def __init__(
         >>> figure = results.plot()
         >>> figure.show()
         """
-        super(DriftDetectionClassifierCalculator, self).__init__(
+        super(DomainClassifierCalculator, self).__init__(
             chunk_size, chunk_number, chunk_period, chunker, timestamp_column_name
         )
         if isinstance(feature_column_names, str):
@@ -201,9 +201,9 @@ def __init__(
         # self._sampling_error_components: Tuple = ()
         self.result: Optional[Result] = None
 
-    @log_usage(UsageEvent.CDD_CALC_FIT)
+    @log_usage(UsageEvent.DC_CALC_FIT)
     def _fit(self, reference_data: pd.DataFrame, *args, **kwargs):
-        """Fits the CDD calculator to a set of reference data."""
+        """Fits the DC calculator to a set of reference data."""
         if reference_data.empty:
             raise InvalidArgumentsException('data contains no rows. Please provide a valid data set.')
 
@@ -232,9 +232,9 @@ def _fit(self, reference_data: pd.DataFrame, *args, **kwargs):
 
         return self
 
-    @log_usage(UsageEvent.CDD_CALC_RUN)
+    @log_usage(UsageEvent.DC_CALC_RUN)
     def _calculate(self, data: pd.DataFrame, *args, **kwargs) -> Result:
-        """Calculate the data CDD calculator metric for a given data set."""
+        """Calculate the data DC calculator metric for a given data set."""
         if data.empty:
             raise InvalidArgumentsException('data contains no rows. Please provide a valid data set.')
 
@@ -330,20 +330,20 @@ def _calculate_chunk(self, data: pd.DataFrame):
     def _set_metric_thresholds(self, result_data: pd.DataFrame):
         self.lower_threshold_value, self.upper_threshold_value = calculate_threshold_values(
             threshold=self.threshold,
-            data=result_data.loc[:, ('classifier_auroc', 'value')],
+            data=result_data.loc[:, ('domain_classifier_auroc', 'value')],
             lower_threshold_value_limit=self._lower_threshold_value_limit,
             upper_threshold_value_limit=self._upper_threshold_value_limit,
             logger=self._logger,
         )
 
     def _populate_alert_thresholds(self, result_data: pd.DataFrame) -> pd.DataFrame:
-        result_data[('classifier_auroc', 'upper_threshold')] = self.upper_threshold_value
-        result_data[('classifier_auroc', 'lower_threshold')] = self.lower_threshold_value
-        result_data[('classifier_auroc', 'alert')] = result_data.apply(
+        result_data[('domain_classifier_auroc', 'upper_threshold')] = self.upper_threshold_value
+        result_data[('domain_classifier_auroc', 'lower_threshold')] = self.lower_threshold_value
+        result_data[('domain_classifier_auroc', 'alert')] = result_data.apply(
             lambda row: True
             if (
-                row[('classifier_auroc', 'value')] > row[('classifier_auroc', 'upper_threshold')]
-                or row[('classifier_auroc', 'value')] < row[('classifier_auroc', 'lower_threshold')]
+                row[('domain_classifier_auroc', 'value')] > row[('domain_classifier_auroc', 'upper_threshold')]
+                or row[('domain_classifier_auroc', 'value')] < row[('domain_classifier_auroc', 'lower_threshold')]
             )
             else False,
             axis=1,
@@ -401,7 +401,7 @@ def _create_multilevel_index(include_thresholds: bool = False):
             'alert',
         ]
     chunk_tuples = [('chunk', chunk_column_name) for chunk_column_name in chunk_column_names]
-    reconstruction_tuples = [('classifier_auroc', column_name) for column_name in results_column_names]
+    reconstruction_tuples = [('domain_classifier_auroc', column_name) for column_name in results_column_names]
 
     tuples = chunk_tuples + reconstruction_tuples