Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #168. Enforce float32 type for split condition values for GBT models created using XGBoost #188

Merged
merged 3 commits into from Apr 5, 2020

Conversation

izeigerman
Copy link
Member

@izeigerman izeigerman commented Mar 30, 2020

As it turns out the issue reported in #168 is not unique to the "hist" tree construction algorithm. It seems that with "hist" method the likelihood of reprdocue is much higher due to relying on feature histograms. I was able to reproduce the same discrepancy with non-hist methods on a larger sample of test data.

The issue occurs due to a double precision error and reproduces every time when the feature value matches the split condition in one of the tree's nodes.

Example: feature value = 0.671, split condition = 0.671000004. When we hit this condition in the generated code the outcome of 0.671 < 0.671000004 is "true" (or "yes" branch). While in XGBoost the same condition leads to the "no" branch.

After some investigation I noticed that the XGBoost's DMatrix forces all values to be float32 (https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L565). At the same time in our assemblers we rely on default 64-bit floats. Forcing the split condition to be float32 seem to address the issue. At least I couldn't reproduce it so far.

@izeigerman
Copy link
Member Author

izeigerman commented Mar 30, 2020

@StrikerRUS In case if you're curious and want to play with this issue as well as for historical purposes I'm attaching a serialized (pickle) trained XGBoost regression model (trained using the "hist" method) - round_error_xgboost.bin.gz.
The feature vector on which the issue reproduces:

[9.82349, 0.0, 18.1, 0.0, 0.671, 6.794, 98.8, 1.358, 24.0, 666.0, 20.2, 396.9, 10.17]

Note the 0.671 value (feature f4). Because of this value the discrepancy occurs in the estimator with index 7, node ID 4:

"nodeid": 4, "depth": 2, "split": "f4", "split_condition": 0.671000004, "yes": 9, "no": 10, "missing": 10

This is where the generated code follows the "yes" path while XGBoost does the opposite.

@coveralls
Copy link

coveralls commented Mar 30, 2020

Coverage Status

Coverage increased (+0.005%) to 95.754% when pulling 645a49d on iaroslav/issue-168 into 52c601b on master.

@StrikerRUS
Copy link
Member

StrikerRUS commented Mar 31, 2020

Ah, brilliant investigation!
I completely forgot about that we can encounter the same issue as dmlc/treelite#95, dmlc/treelite#55.

BTW, doesn't this code contradict the following from one of treelite's issues?

# sklearn's trees internally work with float32 numbers, so in order
# to have consistent results across all supported languages, we convert
# all thresholds into float32.
threshold = threshold.astype(np.float32)

One of the model builder tests failed because scikit-learn uses double-precision floating-point to store split points, whereas Treelite was using single-precision.

Also refer to
https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_tree.pyx#L544-L545
and
https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_tree.pyx#L727

Copy link
Member

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just few minor questions.

@@ -134,7 +134,7 @@ def _assemble_tree(self, tree):
if "leaf" in tree:
return ast.NumVal(tree["leaf"])

threshold = ast.NumVal(tree["split_condition"])
threshold = ast.NumVal(np.float32(tree["split_condition"]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more general solution will be to add an optional dtype constructor argument? I mean,

class NumVal(NumExpr):
    def __init__(self, value, dtype=np.float64):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea 👍

CLASSIFICATION,
)


def regression_random(model):
def regression_random(model, test_fraction=0.02):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does test_fraction increase for random datasets allow to reproduce the original issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately the default fraction produced way too few samples to be able to reproduce the issue reliably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! We definitely need to refactor testing routines to be more tunable, e.g. allow to adjust test_fraction according to programming language. Refer to #114 (comment).

@izeigerman
Copy link
Member Author

@StrikerRUS Thanks for all the additional context!

scikit-learn uses double-precision floating-point to store split points

That's rather weird. I clearly remember that scikit-learn used float32 in its tree implementation as well. Perhaps a more recent fix?

@StrikerRUS
Copy link
Member

I clearly remember that scikit-learn used float32 in its tree implementation as well.

Have no idea...
Version 0.18 (2016). Thresholds are double.

https://github.com/scikit-learn/scikit-learn/blob/38030a00a7f72a3528bd17f2345f34d1344d6d45/sklearn/tree/_tree.pyx#L186
(was later removed as unused in scikit-learn/scikit-learn#13230)

https://github.com/scikit-learn/scikit-learn/blob/38030a00a7f72a3528bd17f2345f34d1344d6d45/sklearn/tree/_tree.pyx#L527-L528

@izeigerman
Copy link
Member Author

@StrikerRUS Got a chance to address your comment only now. Sorry about the delay.

@StrikerRUS
Copy link
Member

No problem at all! 🙂

Seems that linter is unhappy with imported numpy now. Also, I think we can have a new API test to ensure that passing different dtypes really takes effect.

For scikit-learn issue (#188 (comment)) I believe it is better to have a separate PR.

@izeigerman
Copy link
Member Author

Fixed the linter error and added a test. Thanks 👍

Copy link
Member

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for fixing this issue!
Everything looks OK to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants