Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added Visual Basic code generator #109

Merged
merged 11 commits into from Nov 22, 2019
Merged

added Visual Basic code generator #109

merged 11 commits into from Nov 22, 2019

Conversation

StrikerRUS
Copy link
Member

The motivation behind this PR is allowing users with poor programming skills access to strong ML models inside Office applications (mainly in Excel).

Also, if I'm not mistaken, VBA projects can be used in SOLIDWORKS.

After merging this PR users will be able to use ML models inside Excel in the following way.

Usage Example

As usual, generate a model via supported ML algorithm:

from sklearn.datasets import load_boston
from sklearn.svm import SVR

import m2cgen as m2c

X, y = load_boston(True)
X = X[:4, :2]
y = y[:4]

reg = SVR()
reg.fit(X, y)

After that output VBA code representation of the model via the m2cgen Python package:

print(m2c.export_to_vba(reg))
Function score(ByRef input_vector() As Double) As Double
    Dim var0 As Double
    var0 = (0) - (0.3333333333333333)
    score = ((((28.70000000001455) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.00632) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((18.0) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((2.31) - (input_vector(2)), 2))))) * (-1.0))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.02731) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0.0) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((7.07) - (input_vector(2)), 2))))) * (-1.0))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.02729) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0.0) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((7.07) - (input_vector(2)), 2))))) * (1.0))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.03237) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0.0) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((2.18) - (input_vector(2)), 2))))) * (1.0))
End Function

Create empty Visual Basic file example_module.bas and paste the copied output there.

Now open Excel, enable Developer tab and click Developer -> Visual Basic (Alt + F11). In VBA editor click File -> Import File and choose previously created example_module.bas file.

After doing that, one more required action is writing a proxy function that will convert Excel Range object to Array and call the model. For instance, such function for regression, for row-based features placed inside Excel can be:

Function SCOREROW(features As Range) As Double
    Dim arr() As Double
    ReDim Preserve arr(features.Columns.Count - 1)
    Dim i As Integer
    For i = 0 To UBound(arr)
        arr(i) = features(1, i + 1)
    Next i
    SCOREROW = score(arr)
End Function

Now this proxy function can be used on Excel sheet as any built-in Excel functions:

image

Let's compare Excel predictions with ones from the native Python model:

reg.predict(X)
array([27.7       , 28.70034543, 28.70034543, 29.7       ])

Seems that everything is fine!


def __init__(self, indent=4, *args, **kwargs):
cg = VbaCodeGenerator(indent=indent)
kwargs["feature_array_name"] = "input_vector"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to strongly recommend changing default value to something else, because input is a reserved word in many languages. For instance, here, for VBA this arg name results in invalid syntax.

@coveralls
Copy link

Coverage Status

Coverage decreased (-3.3%) to 91.89% when pulling d1cbea9 on StrikerRUS:vba into e33fe2d on BayesWitnesses:master.

@coveralls
Copy link

coveralls commented Oct 18, 2019

Coverage Status

Coverage increased (+0.3%) to 95.601% when pulling d53e3d7 on StrikerRUS:vba into e0bae16 on BayesWitnesses:master.

@StrikerRUS
Copy link
Member Author

I'm stuck with two things.

1

Is there any functionality for breaking long code lines in m2cgen? Seems that VBA editor limits the length of lines to ~1k.
Steps to reproduce:

  • change X = X[:4, :3] to X = X[:6, :3] and y = y[:4] to y = y[:6] in the example in my previous comment and get the following error:
    image
  • apply some manual code breakage with _ syntax around any infix operator:
Function score(ByRef input_vector() As Double) As Double
    Dim var0 As Double
    var0 = (0) - (0.333333333333333)
    score = ((((((30.0504802048497) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.00632) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((18#) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((2.31) - (input_vector(2)), 2))))) * (-1#))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.02731) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0#) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((7.07) _
    - (input_vector(2)), 2))))) * (-1#))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.02729) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0#) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((7.07) - (input_vector(2)), 2))))) * (1#))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.03237) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0#) - (input_vector(1)), 2))) _
    + (Application.WorksheetFunction.Power((2.18) - (input_vector(2)), 2))))) * (1#))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.06905) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0#) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((2.18) - (input_vector(2)), 2))))) * (1#))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.02985) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0#) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((2.18) - (input_vector(2)), 2))))) * (-1#))
End Function
  • profit!
    image

2

Where this function is used? Is it used only once for returning vectors from the score function?

def interpret_vector_val(self, expr, **kwargs):
self.with_vectors = True
nested = [self._do_interpret(expr, **kwargs) for expr in expr.exprs]
return self._cg.vector_init(nested)

The thing is that there is no actually initialization syntax for typed vectors in VBA. And raising NotImplementedError results in NotImplemented error (quite obvious, yeah, but I hoped 😄 ). The closest construction will be

Dim return_arr(2) As Double
return_arr(0) = 42.42
return_arr(1) = 42.42
return_arr(2) = 42.42

https://github.com/BayesWitnesses/m2cgen/pull/109/files#diff-2df27f221106e3ee330fc9622e90b525R65-R75

However, this causes invalid syntax for return statement in classification.

score = Dim return_arr(3) As Double
return_arr(0) = (((0.269347108900547) + ((var1) * (-1.0))) + ((var2) * (1.0))) + ((var3) * (0.0))
return_arr(1) = (((((-0.019536450313180197) + ((var4) * (-0.40685972890332345))) + ((var5) * (-0.6179503667256067))) + ((var6) * (-0.033853208957492116))) + ((var2) * (0.6805633764047522))) + ((var3) * (0.3780999281816701))
return_arr(2) = ((((-0.5235401093959808) + ((var4) * (-0.0))) + ((var5) * (-1.0))) + ((var6) * (-0.0))) + ((var1) * (1.0))

Looking forward for any help with the issues above. And many thanks in advance!

@krinart
Copy link
Member

krinart commented Oct 19, 2019

For the first issue you can try to play with BinExpressionDepthTrackingMixin, it is used for python interpreter to solve a seemingly similar issue. This is the best tool we have right now to address such issues.

For the second issue, what exactly causes syntax error? Would it be solved by the following approach (I am not familiar with VBA syntax, so think about it as a pseudo code):

var0 = ...
var1 = ...
var2 = ...
return_arr(0) = var0
return_arr(1) = var1
return_arr(2) = var2

If that's the case, then here you can find an example of using a temporary variable. This example might not be sufficient as it only creates a single variable whereas you might need to create multiple. Again, this is the closest thing we have right now.

@StrikerRUS
Copy link
Member Author

@krinart

For the second issue, what exactly causes syntax error?

The problem is in return statement if we speak about abstract language. Let me demonstrate it with Python-like syntax, for example. Imagine, we must declare variables in Python:

...
declare return_array
return_array = [1, 2, 3]
return return_array

If we have no init mechanism for lists, then it should be:

...
declare return_array
return_array[0] = 1
return_array[1] = 2
return_array[2] = 3
return return_array

But now I get the following construction:

return declare return_array
return_array[0] = 1
return_array[1] = 2
return_array[2] = 3

@StrikerRUS
Copy link
Member Author

StrikerRUS commented Oct 20, 2019

I tried to play around with bin_depth_threshold and the only value which helps to not exceed allowed code line length is 2-3 on boston dataset with SVM model. However, it leads to new error: Compile error: Too many local, nonstatic variables. Am I right that at present we cannot see whether assigned to a variable expression contains input features or not to make some variables static as it's suggested?

Also, I tried to combine existing depth threshold and inserting code line breakage symbol around every binary operator. Unfortunately, it leads to Too many line continuations error. If I understood correctly, it's not allowed to use more that 24 code line breakage symbols into one expression, because I managed to use 25th and 26th symbols successfully in another variable assignment expression. I set bin_depth_threshold = 24 and code breakage symbol after each binary operator, but it created expressions with significantly more than 24 binary operators, which led to Too many line continuations error again.

I'm very frustrated. I see only one solution for now: to track code line length somehow and insert the breakage symbol around the nearest binary operator when hit a threshold.

@StrikerRUS StrikerRUS changed the title [WIP] added VBA code generator added Visual Basic code generator Nov 13, 2019
@StrikerRUS
Copy link
Member Author

I think that without any help and answers to my previous questions this PR in its' initial form is stuck for a very long time. So, I decided to change direction and make Visual Basic generator and provide a guide about how to manually convert it to VBA. Also, I'm not sure that it's possible in theory to auto-generate VBA code with all its' limitations. In addition I had no idea how to write unit tests for VBA code.

Now I guess this PR is ready for review.

@izeigerman
Copy link
Member

@StrikerRUS I'm sorry about the lack of traction on this topic. It just seems like the impact of having VBA in our catalog doesn't seem to match the complexity associated with adding it.

Thank you for adding the VB support though! Please give us some time to review it.

@StrikerRUS
Copy link
Member Author

Happy to help! 🎉

Sorry for the mess in the latest commits: I'm not familiar enough with CLI interface yet. 🙁

Copy link
Member

@izeigerman izeigerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, a few questions in comments. Thank you!

with some small manual changes, see a note below)
code representation of the given model.

.. note::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this note should belong to the FAQ section of the readme

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe it should be in both places: here and there?.. I think some users do not tend to read any sections of READMEs except API reference, they finish reading on docstrings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just seems unlikely that anyone will look up a guide on how to generate code for VBA in API docs to VB generator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe! But I thought this note here can be useful in following scenarios:

  • help to discover that it possible to generate VBA for those who use VB;
  • some users might come here after failed search for export_to_vba as VBA is a dialect of VB;
  • in case someone in the future modify/update this code, the note helps to indicate that code should be compatible with both VB and VBA.

def __init__(self, module_name="Model", indent=4, *args, **kwargs):
self.module_name = module_name
cg = VisualBasicCodeGenerator(indent=indent)
kwargs["feature_array_name"] = "input_vector"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering is there any reason why we can't use the existing default name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see this #109 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, missed that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! Some languages can successfully distinguish when input is a built-in function and when it isn't, e.g. Python, but VBA cannot do that 🙁 .
But I think that the best practice is to not name variables with reserved keywords (print, input, file, etc.) in all languages.

@@ -0,0 +1,26 @@
Function addVectors(ByRef v1() As Double, ByRef v2() As Double) As Double()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no knowledge about VB whatsoever so forgive me my ignorance: why do static files have extension bas but generated ones - vb?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bas is a file extension of pure BASIC language. I used it here to show that these files are both VBA and VB compatible. vb is a file extension for modern VB language, for which examples are generated.

def prepare_global(cls):
if cls._global_resource_tmp_dir is None:
with utils.tmp_dir() as tmp_dirpath:
cls._global_resource_tmp_dir = tmp_dirpath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory will be cleaned up after you leave the scope of with which is not exactly what you want, do you? You probably want to clean it up once executer has done its job. Otherwise I don't see any logic associated with purging this directory anywhere else.

I also don't see much value having the prepare_global logic in the base class. Why can't this logic be a part of prepare of the VisualBasicExecutor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to clean it up once executer has done its job.

Why can't this logic be a part of prepare of the VisualBasicExecutor?

I need to create this directory only once for the whole test session. It allows to decrease testing time for ~5 minutes. prepare is called for every executor which is not the thing I need. Please share more appropriate solution to achieve this goal.

Let me show what I need by example:

diff --git a/.travis.yml b/.travis.yml
index e5e2203..51e50e3 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -30,4 +30,4 @@ script:
   # Then we install it as a library, remove source code and run e2e tests.
   - python setup.py install
   - rm -rfd m2cgen/
-  - pytest -v tests/e2e/
+  - pytest -v --capture=no tests/e2e/
diff --git a/tests/e2e/executors/visual_basic.py b/tests/e2e/executors/visual_basic.py
index dbcaacb..f055488 100644
--- a/tests/e2e/executors/visual_basic.py
+++ b/tests/e2e/executors/visual_basic.py
@@ -51,6 +51,7 @@ class VisualBasicExecutor(base.BaseExecutor):
         self.model_ast = assembler_cls(model).assemble()

     def predict(self, X):
+        print("!!!{}!!!".format(self.target_exec_dir))
         exec_args = [os.path.join(self.target_exec_dir, self.project_name)]
         exec_args.extend(map(str, X))
         return utils.predict_from_commandline(exec_args)

Please notice that the directory is the same during the whole session. That is what I need to speed up tests.

Link to huge part of a huge log, because GitHub doesn't allow to paste it here in comment: https://pastebin.com/mZ65Unti.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you perhaps use a module-scoped pytest fixture for this?

@pytest.fixture(scope="module", autouse=True)
def my_fixture():
    print ('Creating a VB executor dir')
    # Create dir
    yield vb_dir
    # Remove dir
    print ('TEAR DOWN')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my first attempt to achieve the needed behavior. Unfortunately, I'm not familiar with pytest enough and didn't get it how to use a created dir inside executor code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should put the fixture into conftest.py or into the test_e2e.py itself. Then pass it as an argument into the test_e2e test like:

def test_e2e(estimator, executor_cls, model_trainer, is_fast, my_fixture):

my_fixture here would be a path to a directory that you yielded in the fixture's body. From there you might want to pass the value into the executor's constructor. This way you'd have to make other constructors accept **kwargs.

Were there any particular difficulties that you encountered? We can discuss them here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the guide! I did it. However, actually I need to set up VB project in that temporary dir. Is it acceptable to do executor-dependent stuff inside global fixture? If you are OK with it, I'll replace class method with such fixture.

https://github.com/StrikerRUS/m2cgen/blob/2b4454fada26d4e54a1bab6300310748ede44da1/tests/e2e/executors/visual_basic.py#L58-L73

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS In that fixture I only suggest to setup and cleanup the temporary directory that can be passed into the executor itself where VB-specific preparations can be made. I'd rather avoid having VB specific stuff in that fixture though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it! Thanks again for the explanation! Will make changes in a few hours.

@izeigerman izeigerman merged commit 6b277dd into BayesWitnesses:master Nov 22, 2019
@izeigerman
Copy link
Member

Thank you 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants