Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Store data frequencies in tree nodes of RF #3647

Merged
merged 12 commits into from
Mar 25, 2021

Conversation

hcho3
Copy link
Contributor

@hcho3 hcho3 commented Mar 23, 2021

The RF model should store the number of data points associated with each tree node. This information is useful in many ways, including:

  • Visualizing the trees
  • Debugging performance problem
  • Computing SHAP values using the TreeSHAP algorithm

To that end, this PR does the following:

  • Add the instance_count field in the SparseTreeNode structure
  • Expose the instance_count field in the JSON dump
  • Add a unit test to ensure that the counts in the JSON dump are correct.

Note that this feature will work with the new backend only. If the old backend is used, instance_count field will be absent in the JSON dump.

Closes #3131

@hcho3 hcho3 requested review from a team as code owners March 23, 2021 05:28
@github-actions github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Mar 23, 2021
@hcho3 hcho3 added 3 - Ready for Review Ready for review by team CUDA / C++ CUDA issue Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change 2 - In Progress Currenty a work in progress and removed CUDA/C++ 3 - Ready for Review Ready for review by team labels Mar 23, 2021
@hcho3 hcho3 marked this pull request as draft March 23, 2021 05:34
@hcho3 hcho3 marked this pull request as ready for review March 23, 2021 07:17
@hcho3 hcho3 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currenty a work in progress labels Mar 23, 2021
@hcho3 hcho3 changed the title Store data frequencies in tree nodes of RF [REVIEW] Store data frequencies in tree nodes of RF Mar 23, 2021
@hcho3
Copy link
Contributor Author

hcho3 commented Mar 23, 2021

rerun tests

@JohnZed JohnZed added this to PR-WIP in v0.19 Release via automation Mar 24, 2021
@hcho3
Copy link
Contributor Author

hcho3 commented Mar 24, 2021

Rerun tests

@hcho3
Copy link
Contributor Author

hcho3 commented Mar 24, 2021

The Naive Bayes tests keep failing because the dataset download times out.

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I just had tiny comments.

python/cuml/test/dask/test_random_forest.py Outdated Show resolved Hide resolved
check_instance_count_for_non_leaf(tree['children'][0])
check_instance_count_for_non_leaf(tree['children'][1])

for tree in json_obj:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good but seems like it might be slow for deep trees. We are currently testing a lot of parameter variants here. How slow is the test overall? Should we reduce number of param variants or only do this test for some combos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It completes in about a minute on my machine.

python/cuml/test/test_random_forest.py Outdated Show resolved Hide resolved
python/cuml/test/test_random_forest.py Show resolved Hide resolved
v0.19 Release automation moved this from PR-WIP to PR-Reviewer approved Mar 24, 2021
@JohnZed
Copy link
Contributor

JohnZed commented Mar 24, 2021

@gpucibot merge

@dantegd dantegd added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 25, 2021
@hcho3
Copy link
Contributor Author

hcho3 commented Mar 25, 2021

rerun tests

@codecov-io
Copy link

Codecov Report

Merging #3647 (e1ffc7b) into branch-0.19 (c2f246a) will increase coverage by 0.28%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #3647      +/-   ##
===============================================
+ Coverage        80.87%   81.15%   +0.28%     
===============================================
  Files              228      228              
  Lines            17630    17813     +183     
===============================================
+ Hits             14258    14457     +199     
+ Misses            3372     3356      -16     
Flag Coverage Δ
dask 45.12% <ø> (+0.16%) ⬆️
non-dask 73.47% <ø> (+0.37%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/solvers/qn.pyx 97.20% <0.00%> (-0.43%) ⬇️
python/cuml/dask/solvers/cd.py 100.00% <0.00%> (ø)
python/cuml/pipeline/__init__.py 100.00% <0.00%> (ø)
python/cuml/common/numba_utils.py 0.00% <0.00%> (ø)
python/cuml/internals/global_settings.py 100.00% <0.00%> (ø)
python/cuml/dask/preprocessing/encoders.py 100.00% <0.00%> (ø)
python/cuml/metrics/_ranking.py 98.57% <0.00%> (+0.02%) ⬆️
python/cuml/preprocessing/encoders.py 95.10% <0.00%> (+0.02%) ⬆️
python/cuml/common/array_descriptor.py 98.24% <0.00%> (+0.03%) ⬆️
python/cuml/common/array.py 96.99% <0.00%> (+0.04%) ⬆️
... and 38 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2f246a...e1ffc7b. Read the comment docs.

@rapids-bot rapids-bot bot merged commit c48c081 into rapidsai:branch-0.19 Mar 25, 2021
v0.19 Release automation moved this from PR-Reviewer approved to Done Mar 25, 2021
@hcho3 hcho3 deleted the save_instance_count branch March 25, 2021 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CUDA / C++ CUDA issue Cython / Python Cython or Python issue Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
No open projects
v0.19 Release
  
Done
Development

Successfully merging this pull request may close these issues.

[FEA] Store data frequencies in tree nodes of RF
4 participants