Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Test FIL probabilities with absolute error thresholds in python #3582

Merged

Conversation

levsnv
Copy link
Contributor

@levsnv levsnv commented Mar 4, 2021

Probabilities are limited between [0.0, 1.0]. Also, we generally care more about large probabilities which are O(1/n_classes).
The largest relative probability errors are usually caused by a small ground truth probability (e.g. 1e-3), as opposed to a large absolute error.
Hence, relative probability error is not the best metric. Absolute probability error is more relevant.
Moreover, absolute probability error is more stable, as relative errors have a long tail. When training or even inferring on many rows, the chance of getting a ground truth probability sized 1e-3 or 1e-4 grows. In some cases, there is no reasonable and reliable threshold. Last, if the number of predicted probabilities (clipped values) per input row grows, so does the long tail of relative probability errors, due to less undersampling. This unfairly compares binary classification with regression, and multiclass classification with binary classification.

The changes below are based on collecting absolute errors under --run_unit, --run_quality and --run_stress. These thresholds are violated at most a couple times per million samples, in most cases never.

@levsnv levsnv requested a review from a team as a code owner March 4, 2021 01:19
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Mar 4, 2021
@levsnv levsnv requested a review from JohnZed March 4, 2021 01:19
@levsnv
Copy link
Contributor Author

levsnv commented Mar 4, 2021

Below are the absolute error distributions collected in the process
https://gist.github.com/levsnv/9af496ca27778724191d6b7658eddfd7

@levsnv levsnv added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 4, 2021
@levsnv levsnv requested a review from canonizer March 4, 2021 04:46
@levsnv levsnv added the 4 - Waiting on Reviewer Waiting for reviewer to review or respond label Mar 4, 2021
Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a clear win for clarity of the testing guarantees. The old rel thresholds were hard to reason about. Here, we are clarifying that we will test proba to a 3e-7 threshold or better in all cases. So I like it ;) Thanks!

@JohnZed
Copy link
Contributor

JohnZed commented Mar 5, 2021

I will wait for @canonizer ’s comment before merging though in case he as additional thoughts

@levsnv
Copy link
Contributor Author

levsnv commented Mar 5, 2021

Thanks John! Could you also review #2894 ? Just don't merge it yet, so that we have a proper documentation of where changes came from.

@levsnv
Copy link
Contributor Author

levsnv commented Mar 5, 2021

rerun tests
./test/ml: symbol lookup error: ./test/ml: undefined symbol: _ZN5faiss3gpu20StandardGpuResources20setCudaMallocWarningEb

@codecov-io
Copy link

Codecov Report

Merging #3582 (4e61c31) into branch-0.19 (9fa6e17) will increase coverage by 45.58%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-0.19    #3582       +/-   ##
================================================
+ Coverage        35.24%   80.83%   +45.58%     
================================================
  Files              378      227      -151     
  Lines            26910    17737     -9173     
================================================
+ Hits              9485    14337     +4852     
+ Misses           17425     3400    -14025     
Flag Coverage Δ
dask 45.30% <ø> (+0.30%) ⬆️
non-dask 73.09% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/dask/solvers/cd.py 100.00% <0.00%> (ø)
python/cuml/common/numba_utils.py 0.00% <0.00%> (ø)
python/cuml/neighbors/__init__.py 100.00% <0.00%> (ø)
python/cuml/internals/global_settings.py 100.00% <0.00%> (ø)
python/cuml/dask/preprocessing/encoders.py 100.00% <0.00%> (ø)
cuml/dask/solvers/__init__.py
cuml/dask/naive_bayes/naive_bayes.py
cuml/datasets/__init__.py
cuml/experimental/explainer/__init__.py
cuml/dask/linear_model/linear_regression.py
... and 285 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fa6e17...4e61c31. Read the comment docs.

@levsnv levsnv assigned JohnZed, levsnv and dantegd and unassigned JohnZed Mar 9, 2021
@dantegd
Copy link
Member

dantegd commented Mar 9, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cd220fc into rapidsai:branch-0.19 Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Reviewer Waiting for reviewer to review or respond Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants