Add zero_division for single class prediction in MCC #28982

cynddl · 2024-05-08T22:57:25Z

Describe the bug

I have found a potential edge case issue with sklearn.metrics.matthews_corrcoef. The example provided in the documentation works as expected:

from sklearn.metrics import matthews_corrcoef
matthews_corrcoef([1, 1, 1, -1], [1, -1, 1, 1])  # returns -1/3 OK

However, edge cases appear when either y or y_true have only one single label.

Steps/Code to Reproduce

from sklearn.metrics import matthews_corrcoef
matthews_corrcoef([1, 1, 1, 1], [1, 1, 1, 1])  # returns 0 instead of 1
matthews_corrcoef([0, 0, 0, 0], [0, 0, 0, 0])  # returns 0 instead of 1

Expected Results

Outputs should be 1, not 0.

Actual Results

Outputs are 0.

Versions

System:
    python: 3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 16.0.6 ]
executable: /tmp/sklearn-test/.venv/bin/python
   machine: macOS-14.2.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.4.2
          pip: 24.0
   setuptools: 69.5.1
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /tmp/sklearn-test/.venv/lib/python3.11/site-packages/torch/lib/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /tmp/sklearn-test/.venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /tmp/sklearn-test/.venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: neoversen1

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /tmp/sklearn-test/.venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

hammad7 · 2024-05-09T11:54:53Z

From the definition, its actually going into the if condition at last, to avoid division by zero:

def matthews_corrcoef(y_true, y_pred, *, sample_weight=None):
    """Compute the Matthews correlation coefficient (MCC).

    The Matthews correlation coefficient is used in machine learning as a
    measure of the quality of binary and multiclass classifications. It takes
    into account true and false positives and negatives and is generally
    regarded as a balanced measure which can be used even if the classes are of
    very different sizes. The MCC is in essence a correlation coefficient value
    between -1 and +1. A coefficient of +1 represents a perfect prediction, 0
    an average random prediction and -1 an inverse prediction.  The statistic
    is also known as the phi coefficient. [source: Wikipedia]

    Binary and multiclass labels are supported.  Only in the binary case does
    this relate to information about true and false positives and negatives.
    See references below.

    Read more in the :ref:`User Guide <matthews_corrcoef>`.

    Parameters
    ----------
    y_true : array, shape = [n_samples]
        Ground truth (correct) target values.

    y_pred : array, shape = [n_samples]
        Estimated targets as returned by a classifier.

    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights.

        .. versionadded:: 0.18

    Returns
    -------
    mcc : float
        The Matthews correlation coefficient (+1 represents a perfect
        prediction, 0 an average random prediction and -1 and inverse
        prediction).

    References
    ----------
    .. [1] :doi:`Baldi, Brunak, Chauvin, Andersen and Nielsen, (2000). Assessing the
       accuracy of prediction algorithms for classification: an overview.
       <10.1093/bioinformatics/16.5.412>`

    .. [2] `Wikipedia entry for the Matthews Correlation Coefficient
       <https://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_.

    .. [3] `Gorodkin, (2004). Comparing two K-category assignments by a
        K-category correlation coefficient
        <https://www.sciencedirect.com/science/article/pii/S1476927104000799>`_.

    .. [4] `Jurman, Riccadonna, Furlanello, (2012). A Comparison of MCC and CEN
        Error Measures in MultiClass Prediction
        <https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0041882>`_.

    Examples
    --------
    >>> from sklearn.metrics import matthews_corrcoef
    >>> y_true = [+1, +1, +1, -1]
    >>> y_pred = [+1, -1, +1, +1]
    >>> matthews_corrcoef(y_true, y_pred)
    -0.33...
    """
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    check_consistent_length(y_true, y_pred, sample_weight)
    if y_type not in {"binary", "multiclass"}:
        raise ValueError("%s is not supported" % y_type)

    lb = LabelEncoder()
    lb.fit(np.hstack([y_true, y_pred]))
    y_true = lb.transform(y_true)
    y_pred = lb.transform(y_pred)

    C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
    t_sum = C.sum(axis=1, dtype=np.float64)
    p_sum = C.sum(axis=0, dtype=np.float64)
    n_correct = np.trace(C, dtype=np.float64)
    n_samples = p_sum.sum()
    cov_ytyp = n_correct * n_samples - np.dot(t_sum, p_sum)
    cov_ypyp = n_samples**2 - np.dot(p_sum, p_sum)
    cov_ytyt = n_samples**2 - np.dot(t_sum, t_sum)

    if cov_ypyp * cov_ytyt == 0:
        return 0.0
    else:
        return cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)

glemaitre · 2024-05-13T10:41:20Z

Basically the solution is to provide a zero_division parameter to have a better handling here because this is a corner case: #28509

cynddl added Bug Needs Triage Issue requires triage labels May 8, 2024

glemaitre added Enhancement and removed Bug Needs Triage Issue requires triage labels May 13, 2024

glemaitre changed the title ~~sklearn.metrics.matthews_corrcoef returns incorrect values~~ Add zero_division for single class prediction in MCC May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zero_division for single class prediction in MCC #28982

Add zero_division for single class prediction in MCC #28982

cynddl commented May 8, 2024

hammad7 commented May 9, 2024 •

edited

glemaitre commented May 13, 2024

Add zero_division for single class prediction in MCC #28982

Add zero_division for single class prediction in MCC #28982

Comments

cynddl commented May 8, 2024

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

hammad7 commented May 9, 2024 • edited

glemaitre commented May 13, 2024

hammad7 commented May 9, 2024 •

edited