Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle floating point values more accurate #277

Merged
merged 3 commits into from Aug 5, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Expand Up @@ -131,3 +131,7 @@ A: If this error occurs while generating code using an ensemble model, try to re
**Q: Generation fails with `ImportError: No module named <module_name_here>` error while transpiling model from a serialized model object.**

A: This error indicates that pickle protocol cannot deserialize model object. For unpickling serialized model objects, it is required that their classes must be defined in the top level of an importable module in the unpickling environment. So installation of package which provided model's class definition should solve the problem.

**Q: Generated by m2cgen code provides different results for some inputs compared to original Python model from which the code were obtained.**

A: Some models force input data to be particular type during prediction phase in their native Python libraries. Currently, m2cgen works only with ``float64`` (``double``) data type. You can try to cast your input data to another type manually and check results again. Also, some small differences can happen due to specific implementation of floating-point arithmetic in a target language.
2 changes: 1 addition & 1 deletion m2cgen/assemblers/boosting.py
Expand Up @@ -151,7 +151,7 @@ def __init__(self, model):

def _assemble_tree(self, tree):
if "leaf" in tree:
return ast.NumVal(tree["leaf"])
return ast.NumVal(tree["leaf"], dtype=np.float32)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


threshold = ast.NumVal(tree["split_condition"], dtype=np.float32)
split = tree["split"]
Expand Down
10 changes: 1 addition & 9 deletions m2cgen/assemblers/tree.py
@@ -1,5 +1,3 @@
import numpy as np

from m2cgen import ast
from m2cgen.assemblers import utils
from m2cgen.assemblers.base import ModelAssembler
Expand Down Expand Up @@ -49,11 +47,5 @@ def _assemble_leaf(self, node_id):

def _assemble_cond(self, node_id):
feature_idx = self._tree.feature[node_id]
threshold = self._tree.threshold[node_id]

# sklearn's trees internally work with float32 numbers, so in order
# to have consistent results across all supported languages, we convert
# all thresholds into float32.
threshold_num_val = ast.NumVal(threshold, dtype=np.float32)

threshold_num_val = ast.NumVal(self._tree.threshold[node_id])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to #190 (review).

Now threshold matches original type in scikit-learn (double).

return utils.lte(ast.FeatureRef(feature_idx), threshold_num_val)
16 changes: 11 additions & 5 deletions m2cgen/interpreters/code_generator.py
@@ -1,6 +1,10 @@
from io import StringIO
from weakref import finalize

import numpy as np

from m2cgen.interpreters.utils import format_float


class CodeTemplate:

Expand All @@ -11,12 +15,14 @@ def __str__(self):
return self.str_template

def __call__(self, *args, **kwargs):
# Force calling str() representation
# because without it numpy gives the same output
# for different float types

def _is_float(value):
return isinstance(value, (float, np.floating))

return self.str_template.format(
*[str(i) for i in args],
**{k: str(v) for k, v in kwargs.items()})
*[format_float(i) if _is_float(i) else i for i in args],
**{k: format_float(v) if _is_float(v) else v
for k, v in kwargs.items()})


class BaseCodeGenerator:
Expand Down
6 changes: 6 additions & 0 deletions m2cgen/interpreters/utils.py
@@ -1,5 +1,7 @@
import re

import numpy as np

from collections import namedtuple
from functools import lru_cache
from math import ceil, log
Expand All @@ -22,3 +24,7 @@ def _get_handler_name(expr_tpe):

def _normalize_expr_name(name):
return re.sub("(?!^)([A-Z]+)", r"_\1", name).lower()


def format_float(value):
return np.format_float_positional(value, unique=True, trim="0")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe format_float_scientific will be better: https://numpy.org/doc/stable/reference/generated/numpy.format_float_scientific.html. But I'm not sure how many languages support scientific notation.

2 changes: 1 addition & 1 deletion tests/e2e/executors/c.py
Expand Up @@ -54,7 +54,7 @@ def __init__(self, model):
def predict(self, X):

exec_args = [os.path.join(self._resource_tmp_dir, self.model_name)]
exec_args.extend(map(str, X))
exec_args.extend(map(interpreters.utils.format_float, X))
return utils.predict_from_commandline(exec_args)

def prepare(self):
Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/executors/c_sharp.py
Expand Up @@ -49,7 +49,7 @@ def __init__(self, model):

def predict(self, X):
exec_args = [os.path.join(self.target_exec_dir, self.project_name)]
exec_args.extend(map(str, X))
exec_args.extend(map(interpreters.utils.format_float, X))
return utils.predict_from_commandline(exec_args)

@classmethod
Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/executors/dart.py
Expand Up @@ -42,7 +42,7 @@ def predict(self, X):
f"{self.executor_name}.dart")
exec_args = [self._dart,
file_name,
*map(str, X)]
*map(interpreters.utils.format_float, X)]
return utils.predict_from_commandline(exec_args)

def prepare(self):
Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/executors/f_sharp.py
Expand Up @@ -35,7 +35,7 @@ def __init__(self, model):

def predict(self, X):
exec_args = [os.path.join(self.target_exec_dir, self.project_name)]
exec_args.extend(map(str, X))
exec_args.extend(map(interpreters.utils.format_float, X))
return utils.predict_from_commandline(exec_args)

@classmethod
Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/executors/go.py
Expand Up @@ -55,7 +55,7 @@ def __init__(self, model):
def predict(self, X):

exec_args = [os.path.join(self._resource_tmp_dir, self.model_name)]
exec_args.extend(map(str, X))
exec_args.extend(map(interpreters.utils.format_float, X))
return utils.predict_from_commandline(exec_args)

def prepare(self):
Expand Down
3 changes: 2 additions & 1 deletion tests/e2e/executors/haskell.py
Expand Up @@ -39,7 +39,8 @@ def __init__(self, model):
def predict(self, X):
app_name = os.path.join(self._resource_tmp_dir,
self.executor_name)
exec_args = [app_name, *map(str, X)]
exec_args = [app_name,
*map(interpreters.utils.format_float, X)]
return utils.predict_from_commandline(exec_args)

def prepare(self):
Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/executors/java.py
Expand Up @@ -24,7 +24,7 @@ def predict(self, X):
self._java_bin, "-cp", self._resource_tmp_dir,
"Executor", "Model", "score"
]
exec_args.extend(map(str, X))
exec_args.extend(map(m2c.interpreters.utils.format_float, X))
return utils.predict_from_commandline(exec_args)

def prepare(self):
Expand Down
6 changes: 4 additions & 2 deletions tests/e2e/executors/javascript.py
@@ -1,4 +1,5 @@
import os

from py_mini_racer import py_mini_racer

import m2cgen as m2c
Expand All @@ -16,10 +17,11 @@ def predict(self, X):
with open(file_name, 'r') as myfile:
code = myfile.read()

caller = f"score([{','.join(map(str, X))}]);\n"
args = ",".join(map(m2c.interpreters.utils.format_float, X))
caller = f"score([{args}]);\n"

ctx = py_mini_racer.MiniRacer()
result = ctx.eval(caller + code)
result = ctx.eval(f"{caller}{code}")

return result

Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/executors/php.py
Expand Up @@ -47,7 +47,7 @@ def predict(self, X):
exec_args = [self._php,
"-f",
file_name,
*map(str, X)]
*map(interpreters.utils.format_float, X)]
return utils.predict_from_commandline(exec_args)

def prepare(self):
Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/executors/powershell.py
Expand Up @@ -40,7 +40,7 @@ def predict(self, X):
"-File",
file_name,
"-InputArray",
",".join(map(str, X))]
",".join(map(interpreters.utils.format_float, X))]
return utils.predict_from_commandline(exec_args)

def prepare(self):
Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/executors/r.py
Expand Up @@ -34,7 +34,7 @@ def predict(self, X):
exec_args = [self._r,
"--vanilla",
file_name,
*map(str, X)]
*map(interpreters.utils.format_float, X)]
return utils.predict_from_commandline(exec_args)

def prepare(self):
Expand Down
4 changes: 3 additions & 1 deletion tests/e2e/executors/ruby.py
Expand Up @@ -38,7 +38,9 @@ def __init__(self, model):
def predict(self, X):
file_name = os.path.join(self._resource_tmp_dir,
f"{self.model_name}.rb")
exec_args = [self._ruby, file_name, *map(str, X)]
exec_args = [self._ruby,
file_name,
*map(interpreters.utils.format_float, X)]
return utils.predict_from_commandline(exec_args)

def prepare(self):
Expand Down
2 changes: 1 addition & 1 deletion tests/e2e/executors/visual_basic.py
Expand Up @@ -51,7 +51,7 @@ def __init__(self, model):

def predict(self, X):
exec_args = [os.path.join(self.target_exec_dir, self.project_name)]
exec_args.extend(map(str, X))
exec_args.extend(map(interpreters.utils.format_float, X))
return utils.predict_from_commandline(exec_args)

@classmethod
Expand Down
2 changes: 2 additions & 0 deletions tests/e2e/test_e2e.py
Expand Up @@ -568,6 +568,8 @@ def test_e2e(estimator, executor_cls, model_trainer,
with executor.prepare_then_cleanup():
for idx in idxs_to_test:
y_pred_executed = executor.predict(X_test[idx])
y_pred_executed = np.array(
y_pred_executed, dtype=y_pred_true.dtype, copy=False)
Comment on lines +571 to +572
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite often different packages not only cast input values in predict method, but also return result with different types. For instance, XGBoost always returns float: https://github.com/dmlc/xgboost/blob/12110c900eff0aaa06045ecf717e6c5a36a164d5/python-package/xgboost/core.py#L1373

print(f"expected={y_pred_true[idx]}, actual={y_pred_executed}")
res = np.isclose(y_pred_true[idx], y_pred_executed, atol=ATOL)
assert res if isinstance(res, bool) else res.all()
23 changes: 16 additions & 7 deletions tests/utils.py
Expand Up @@ -13,11 +13,13 @@
from lightning.impl.base import BaseClassifier as LightBaseClassifier
from sklearn import datasets
from sklearn.base import BaseEstimator, RegressorMixin, clone
from sklearn.ensemble._forest import ForestClassifier
from sklearn.ensemble._forest import ForestClassifier, BaseForest
from sklearn.utils import shuffle
from sklearn.linear_model._base import LinearClassifierMixin
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree._classes import BaseDecisionTree
from sklearn.svm import SVC, NuSVC
from sklearn.svm._base import BaseLibSVM
from xgboost import XGBClassifier

from m2cgen import ast
Expand Down Expand Up @@ -125,15 +127,22 @@ def __call__(self, estimator):
if isinstance(estimator, (LinearClassifierMixin, SVC, NuSVC,
LightBaseClassifier)):
y_pred = estimator.decision_function(self.X_test)
elif isinstance(estimator, DecisionTreeClassifier):
y_pred = estimator.predict_proba(self.X_test.astype(np.float32))
elif isinstance(
estimator,
(ForestClassifier, XGBClassifier, LGBMClassifier)):
(ForestClassifier, DecisionTreeClassifier,
XGBClassifier, LGBMClassifier)):
y_pred = estimator.predict_proba(self.X_test)
else:
y_pred = estimator.predict(self.X_test)

# Some models force input data to be particular type
# during prediction phase in their native Python libraries.
# For correct comparison of testing results we mimic the same behavior
if isinstance(estimator, (BaseDecisionTree, BaseForest)):
self.X_test = self.X_test.astype(np.float32, copy=False)
Comment on lines +141 to +142
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned that here we treat a symptom not the cause. By casting the input vector to float32 and them passing it as strings into estimators we don't exactly reproduce the actual environment, where casted values will be transformed back to doubles because of the score function signature. What do you think? Am I overthinking this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned that here we treat a symptom not the cause.

Yeah, you're absolutely right! To fix the root cause we should support multiple floating types in target languages. In this PR I just propose to make our tests a little bit fairer. Native libraries do double -> float conversion and in our tests we perform double -> float -> double. Unfortunately, I'm not a numerical expert but it seems that float -> double is a safe conversion: https://stackoverflow.com/questions/29648271/convert-float-double-float.

elif isinstance(estimator, BaseLibSVM):
self.X_test = self.X_test.astype(np.float64, copy=False)
Comment on lines +143 to +144
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return self.X_test, y_pred, fitted_estimator


Expand Down Expand Up @@ -238,9 +247,9 @@ def predict_from_commandline(exec_args):
items = stdout.decode("utf-8").strip().split(" ")

if len(items) == 1:
return float(items[0])
return np.float64(items[0])
else:
return [float(i) for i in items]
return [np.float64(i) for i in items]
Comment on lines +250 to +252
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use numpy types across all codebase for the consistency.



def cartesian_e2e_params(executors_with_marks, models_with_trainers_with_marks,
Expand Down Expand Up @@ -284,4 +293,4 @@ def inner(*args, **kwarg):


def _is_float(value):
return isinstance(value, (float, np.float16, np.float32, np.float64))
return isinstance(value, (float, np.floating))