Exporting pipelines to PMML/PFA #152

jln-ho · 2016-05-20T09:07:24Z

I'm currently doing research in the area of model/pipeline persistence and came across PMML. It's basically an XML schema that lets you define all sorts of data mining/machine learning processes for both persistence and interoperability. It was specifically designed to decouple the tools that are used to generate pipelines from the tools that are used to apply them.

There are several libraries by openscoring.io that can be used to export pipelines from popular environments such as sklearn, Apache Spark MLlib, R or XGBoost. Their counterparts allow for evaluation (i.e. execution) of said exported pipelines in e.g. plain Java, in a Spark context, an Android context or even in a database context such as PostgreSQL. The most interesting library for TPOT in particular should be sklearn2pmml, which is a python wrapper around jpmml-sklearn that converts pickled pipelines to PMML and is written in Java (talk about dependency hell).

I think that PMML (or its successor in the making PFA) would be a great format to use for persisting pipelines generated with TPOT as most people using TPOT will want to deploy the models "found" by it to some other platform. At least there seems to be some sort of demand for persisting models in general according to these issues #2, #11, #51, #65. Some of them suggest using python's pickle format, but I think a dedicated, platform independent solution should always be preferred. Not to talk about the security issues that come with pickle, that's a whole other story.

Excited to hear your thoughts on this!

rhiever · 2016-05-21T02:41:54Z

Interesting idea. How popular is this format? XML seems a bit outdated. (I thought JSON replaced XML everywhere.)

jln-ho · 2016-05-21T18:58:59Z

The format was first specified in 1998, hence the use of XML. PFA, PMML's successor, will be based on JSON and will provide quite a lot more flexibility. It's almost a high-level programming language actually. However, it is still in the making (IIRC the first draft of the specification was published towards the end of last year) and therefore not really being used in production by more than a handful of people.

PMML, on the other hand, is being used quite a lot, and I don't think it's going away even if PFA picks up speed over the next couple years. Here is a list of projects/frameworks that use/support PMML (the list can't be complete, though, because there is no mention of Pattern).

rhiever · 2016-05-22T00:54:22Z

That makes sense about XML then. :-)

This is something that we'll have to explore more before committing to. I'm primarily interested in finding out:

Can it support the arbitrary pipeline structures that TPOT may create?
Does it support all operators in sklearn?
Is there any nice visualization software that can read and visualize ML pipelines in PMML? (Related to Export TPOT pipelines to Orange file format #51)

jln-ho · 2016-05-24T11:53:23Z

I guess 1) and 2) can be answered with a straight no for PMML, although I think it would still be beneficial to support it with only a subset of TPOT pipelines being actually exportable.
PFA should meet the requirements, though, at least on paper. I can't say how easy or hard it will be to come up with a generic way of exporting an intricate TPOT pipeline to PFA without manual work. As of now the only open source PFA tool I could find, namely Hadrian/Titus, doesn't really focus on converting models from existing frameworks, but rather on standalone model construction (example).

As for 3), I haven't been able to find any visualization tools so far. There are a few mentions of a tool that was developed during a research project, but it seems like it was never made available to the public...

Seems to me that we may have to wait a while for better PFA tooling to emerge in order to meet all the requirements.

rhiever · 2016-05-24T11:59:42Z

Agreed. In the meantime, we're still interested in #51 and being able to export TPOT pipelines to Orange. Seems like that would be quite useful.

vruusmann · 2016-05-24T12:19:12Z

As a contributor to JPMML-family of projects, here's my perspective.

Fundamentally, PMML is about capturing the final state of a model development workflow (not the workflow itself). In other words, the final state is the winning solution, which is appropriate for scoring data in production environment (where the goal is not to replay model development procedure, but to apply the model to new data records).

I think PFA is also more about capturing the state rather than the process. So, you should be focusing more on domain-specific languages that address workflow markup.

JPMML-SkLearn supports the conversion of over 50 Scikit-Learn transformers and estimators. The only class that couldn't be represented in "native" PMML (but could be put in action using Java user-defined function) has been sklearn.decomposition.NMF. So, if TPOT is relying on regular Scikit-Learn classes, then it's probably fairly straightforward to implement a converter to it.

Anyway, if you have any pointers for getting started with TPOT (something to do with the Iris dataset?), then I wouldn't mind giving a shot at implementing some PMML interoperability.

rhiever · 2016-05-24T12:34:29Z

Thank you @vruusmann! TPOT is built almost entirely on top of sklearn classes, with the exception of XGBoost and a couple custom feature preprocessors. However, we are dropping XGBoost due to installation issues for many users, so that will not be an issue in the next version.

Here are some resources that will help you get started with TPOT:

jln-ho · 2016-05-24T13:37:23Z

giving a shot at implementing some PMML interoperability.

@vruusmann that would be great! Thanks in advance for the effort.

vruusmann · 2016-05-24T18:56:13Z

You could check out WhizzML, which is oriented towards representing ML workflows. Of course, WhizzML is a very new thing, and by supporting it you would become interoperable only with BigML's platform.

But it should be interesting reference material nonetheless.

jln-ho · 2016-09-12T20:50:47Z

@rhiever are you sure that commit fixes this issue or was that a typo?

rhiever · 2016-09-13T05:36:02Z

That did appear to be a typo.

esanchezSavvyds · 2017-11-02T14:43:40Z

Hi @rhiever @vruusmann @jln-ho , I'm trying to export my tpot model to a PMML xml using jpmml-sklearn but I'm getting crazy to do it. Is there any option to do it? If not, how can I export my tpot model to use it later in Java? Thank you all, I'm looking forward to your response.

rhiever · 2017-11-02T14:46:18Z

If you use TPOT's export function, you can export the code to a scikit-learn pipeline. From there, whatever process you use to convert scikit-learn pipelines to PMML xml should work fine (as long as it supports the scikit-learn Pipeline object).

esanchezSavvyds · 2017-11-02T15:27:28Z

@rhiever Thank you very much for your time. I'm new to this world. How can I do that? With the export function I create .py file. In that file there is a read_csv function which I don´t understand what does. How can I get the scikit-learn pipeline from that file?

weixuanfu · 2017-11-02T15:35:03Z

@esanchezSavvyds read_csv function is for reading input dataset, you may need change the file path string 'PATH/TO/DATA/FILE' in that function to the dataset path and also need change COLUMN_SEPARATOR based on the dataset.

You may find a line in the .py file that starts with exported_pipeline, which is the scikit-learn pipeline. For example:

exported_pipeline = make_pipeline(
    SelectPercentile(score_func=f_classif, percentile=65),
    DecisionTreeClassifier(criterion="gini", max_depth=7, min_samples_leaf=4, min_samples_split=18)
)

esanchezSavvyds · 2017-11-02T15:44:26Z

Thank you for your fast answer @weixuanfu . My aim is to automatize the model generation with tpot and export it to PMML automatically. For that reason, I need to access to the scikit-learn pipeline at execution time in my code. Is that possible or the only possibility is to access to it through that file manually?

weixuanfu · 2017-11-02T15:58:05Z

@esanchezSavvyds Please check the TPOT API. I think the fitted_pipeline_ attribute (e.g. tpot_object.fitted_pipeline_) is what you need.

esanchezSavvyds · 2017-11-03T08:32:12Z

@weixuanfu I apologize for not seeing that earlier. That's what I need. Thank you very much for your job.

weixuanfu · 2017-11-03T10:49:17Z

@esanchezSavvyds Cool, good to know it solved the issue. No need to apologize.

esanchezSavvyds · 2017-11-03T12:44:36Z

@weixuanfu Problems have returned. I'm having problems at exporting the pipeline to PMML when tpot generates a model using StackingEstimator as it says it's not a supported transformation. What can I do? Is there a possibility to not use it? Thank you

rhiever · 2017-11-03T12:49:13Z

It's probable that the StackingEstimator is not supported in PMML, as it's a custom class outside of scikit-learn that's implemented by @rasbt. PMML would need to be extended to support estimator stacking.

You can look at alternatives to the best-scoring pipeline in the pareto_front_fitted_pipelines_ attribute of TPOT, accessed with tpot_object.pareto_front_fitted_pipelines_. That attribute should contain multiple possible solution pipelines for your problem, ranging from more complex and high-scoring to less complex and slightly-lower-scoring. Perhaps one of the less complex pipelines won't have the StackingEstimator.

rasbt · 2017-11-03T17:37:28Z

I am happy about PRs to the StackingClassifier & StackingCVClassifier fixing this; however, it sounds like it's rather due to PMML? In that case, I am curious, does the VotingClassifier from scikit-learn work? I implemented the VotingClassifier quite similarly (however, it does not have the level-2/meta estimator).

weixuanfu · 2017-11-03T17:52:19Z

Old version (< v0.7.3) of TPOT used VotingClassifier but it caused the issue #457 for stacking regressor in TPOTRegressor so we added to StackingEstimator for solving this issue.

@esanchezSavvyds could you please try to export the pipeline below to PMML?

exported_pipeline = make_pipeline(
    make_union(
        make_union(VotingClassifier([('branch',
            DecisionTreeClassifier(criterion="gini", max_depth=8, min_samples_leaf=5, min_samples_split=5)
        )]), FunctionTransformer(lambda X: X)),
        SelectKBest(score_func=f_classif, k=20)
    ),
    KNeighborsClassifier(n_neighbors=10, p=1, weights="uniform")
)

esanchezSavvyds · 2017-11-06T12:38:32Z

@weixuanfu @rasbt I have tried to export that pipeline to PMML. I have removed the FunctionTransformer because I can't pickle it. The issue I'm having when I try it is that FeatureUnion is not supported. The specification of the JAVA API I'm using to export to PMML says VotingClassifier is supported. Here you have it https://github.com/jpmml/jpmml-sklearn. Thank you all

rhiever · 2017-11-06T13:48:00Z

That's too bad. Perhaps we can add an option (or series of flags) to disable features such as stacking and pipeline splitting (i.e., CombineDFs). Disabling those features should then make the TPOT pipelines PMML compliant.

vruusmann · 2017-11-06T13:59:45Z

First, the FeatureUnion meta-transformation has been supported for over a year or so. I'm afraid that @esanchezSavvyds is simply using an outdated sklearn2pmml package version.

Second, I intend to catch up with TPOT-specific estimator types sometimes in December. You can track my progress by subscribing to this issue: jpmml/jpmml-sklearn#54

As a workaround, it might be worthwhile to define a PMML-specific configuration dictionary (argument config_dict = "TPOT pmml" to TPOT estimator types), which restricts the use of estimator and transformer types that are currently not convertible to PMML. However, this configuration dictionary should be maintained by the SkLearn2PMML/JPMML-SkLearn projects, because it should not be TPOT's concern long-term?

rhiever · 2017-11-06T14:12:29Z

Yes, that was my thinking as well, @vruusmann. The issue is that stacking and pipeline splitting are not currently configurable in TPOT configuration dictionaries; they are always on by default. Hence my suggestion to add options to turn them off.

I agree that it would be wise for SkLearn2PMML/JPMML-SkLearn to maintain a TPOT configuration dictionary that is 100% PMML compliant. We could document the location of that configuration dictionary in the TPOT docs and point to the corresponding SkLearn2PMML/JPMML-SkLearn docs page.

esanchezSavvyds · 2017-11-06T14:46:25Z

@vruusmann I am using the last version of the java command line application and this versions:

python: 3.6.2
sklearn: 0.18.2
sklearn.externals.joblib: 0.10.3
pandas: 0.21.0
sklearn_pandas: 1.6.0
sklearn2pmml: 0.26.0

And we are getting that FeatureUnion is not supported with this error:

Failed to convert
java.lang.IllegalArgumentException: The estimator object (Python class sklearn.pipeline.FeatureUnion) is not an Estimator or is not a supported Estimator subclass
at sklearn.EstimatorUtil$1.apply(EstimatorUtil.java:90)
at sklearn.EstimatorUtil$1.apply(EstimatorUtil.java:78)
at sklearn.EstimatorUtil.asEstimator(EstimatorUtil.java:42)
at sklearn2pmml.PMMLPipeline.getEstimator(PMMLPipeline.java:216)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:73)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)
Caused by: java.lang.ClassCastException: sklearn.pipeline.FeatureUnion cannot be cast to sklearn.Estimator
at sklearn.EstimatorUtil$1.apply(EstimatorUtil.java:88)

vruusmann · 2017-11-06T14:52:14Z

@rhiever @weixuanfu Here's a possible workflow for automatically generating PMML-compatible TPOT configuration dictionaries: jpmml/jpmml-sklearn#55

vruusmann · 2017-11-06T14:54:12Z

@esanchezSavvyds You're using the FeatureUnion transformation type in a context which requires an estimator type. Specifically, feature union cannot be the last step of a pipeline.

weixuanfu · 2017-11-06T15:01:59Z

@esanchezSavvyds For the pickable issue, you may try to use 'copy.copy' instead of the lambda function used in old version of TPOT.

from copy import copy
exported_pipeline = make_pipeline(
    make_union(
        make_union(VotingClassifier([('branch',
            DecisionTreeClassifier(criterion="gini", max_depth=8, min_samples_leaf=5, min_samples_split=5)
        )]), FunctionTransformer(copy)),
        SelectKBest(score_func=f_classif, k=20)
    ),
    KNeighborsClassifier(n_neighbors=10, p=1, weights="uniform")
)

esanchezSavvyds · 2017-11-06T15:20:37Z

@vruusmann I think the problem is not creating a compatible config directory, which I agree it has to be done. The problem is that, as far as I can understand, StackingEstimator cannot be disabled in tpot and it's not supported by the sklearn2pmml.

weixuanfu · 2017-11-06T17:11:51Z

@esanchezSavvyds One of my dev branch of TPOT called noCDF_noStacking has a option named simple_pipeline, which can disable both StackingEstimator and CombineDFs if simple_pipeline=True (e.g. TPOTClassifier(simple_pipeline=True)). But it is noted that this dev branch is not fully tested yet. If you want to try TPOT without StackingEstimator and FeatureUnion, you may install this branch in your test environment via the command below:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu/tpot.git@noCDF_noStacking

esanchezSavvyds · 2017-11-07T13:59:31Z

Hi @weixuanfu , first of all, thank you very much for your effort. We are testing this feature and we have found that ZeroCount transformation is neither supported. This is the error:

java.lang.IllegalArgumentException: The transformer object (Python class tpot.builtins.zero_count.ZeroCount) is not a Transformer or is not a supported Transformer subclass

vruusmann · 2017-11-07T14:06:44Z

@esanchezSavvyds You should list all "problematic" TPOT estimator and transformer types here: jpmml/jpmml-sklearn#54

weixuanfu · 2017-11-07T14:07:05Z

@esanchezSavvyds you may want to use a configuration dictionary for excluding ZeroCount and XGBClassififer from the default dictionary and pass it to config_dict parameter in TPOT, in order to avoid these operators that PMML do not supported so far. Please check the example in this link

esanchezSavvyds · 2017-11-07T15:19:09Z

@weixuanfu we have been testing your feature and we think it works correctly, at least it works correctly for us. We have been able to export every pipeline without any problem (using an appropiate configuration dictionary as you have said).

weixuanfu · 2017-11-07T16:48:03Z

@esanchezSavvyds OK, sounds good!

esanchezSavvyds · 2017-12-05T11:11:07Z

Hello @weixuanfu ,
We have been using your "noCDF_noStacking" feature during this time and we didn´t have any problem. We would like to know if you have planned to finish this branch and deploy it to the master branch.

Thank you very much.

weixuanfu · 2017-12-05T14:14:19Z

@esanchezSavvyds It is good to know that it works for you. That branch is one of my test branches to test the performance of using the simple linear pipelines in TPOT vs. the tree-based ones. We need more tests and discussions before deciding to merge this branch to master branch.

esanchezSavvyds · 2018-02-01T09:32:02Z

Hello @weixuanfu ,
Sorry for asking it again but, do you know more or less when your "noCDF_noStacking" branch will be merged with the master branch? It fits perfectly in our work, but now it's a lot of commits behind the master branch.
Thank you very much :) .

vruusmann · 2019-06-10T11:31:38Z

I've refactored TPOT support in SkLearn2PMML package version 0.46.0 (available in PyPI), and explained some technical details/gotchas in the following technical article:
https://openscoring.io/blog/2019/06/10/converting_sklearn_tpot_pipeline_pmml/

TLDR: TPOT fitted pipelines convert very nicely into PMML data format.

MONTYYUAN · 2020-08-31T11:08:05Z

Hi @weixuanfu @rhiever @rasbt @vruusmann,
When the exported pipeline contains "from sklearn.preprocessing import Normalizer", TPOT fitted pipeline cannot be converted into PMML format, as sklearn2pmml package does not support it. SkLearn2PMML/JPMML-SkLearn
How can I solve it ？Is it possible to remove or limit the function like sklearn.preprocessing.Normalizer ?

weixuanfu · 2020-08-31T12:28:13Z

@MONTYYUAN Please check the function about customizing TPOT's operators.

rhiever added the question label May 21, 2016

rhiever closed this as completed in 29e3788 Sep 2, 2016

rhiever reopened this Sep 13, 2016

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

ktran9891 mentioned this issue Jun 30, 2017

Pickling TPOT Objects #520

Closed

vruusmann mentioned this issue Nov 5, 2017

Add support for TPOT estimator types jpmml/jpmml-sklearn#54

Closed

rasbt mentioned this issue Nov 8, 2017

doubt: mlxtend on top of tpot rasbt/mlxtend#282

Closed

weixuanfu mentioned this issue Dec 7, 2017

CombineDFs(input_matrix, input_matrix) operator #636

Closed

weixuanfu mentioned this issue May 7, 2018

Questions about the StackingEstimator #690

Open

weixuanfu mentioned this issue Dec 20, 2018

Visualize constructed features and get best pipeline found. #459

Closed

Exporting pipelines to PMML/PFA #152

Exporting pipelines to PMML/PFA #152

Comments

jln-ho commented May 20, 2016

rhiever commented May 21, 2016

jln-ho commented May 21, 2016

rhiever commented May 22, 2016

jln-ho commented May 24, 2016

rhiever commented May 24, 2016

vruusmann commented May 24, 2016

rhiever commented May 24, 2016 • edited Loading

jln-ho commented May 24, 2016

vruusmann commented May 24, 2016

jln-ho commented Sep 12, 2016

rhiever commented Sep 13, 2016

esanchezSavvyds commented Nov 2, 2017

rhiever commented Nov 2, 2017

esanchezSavvyds commented Nov 2, 2017

weixuanfu commented Nov 2, 2017 • edited Loading

esanchezSavvyds commented Nov 2, 2017

weixuanfu commented Nov 2, 2017

esanchezSavvyds commented Nov 3, 2017

weixuanfu commented Nov 3, 2017

esanchezSavvyds commented Nov 3, 2017

rhiever commented Nov 3, 2017 • edited Loading

rasbt commented Nov 3, 2017

weixuanfu commented Nov 3, 2017 • edited Loading

esanchezSavvyds commented Nov 6, 2017

rhiever commented Nov 6, 2017

vruusmann commented Nov 6, 2017

rhiever commented Nov 6, 2017 • edited Loading

esanchezSavvyds commented Nov 6, 2017

vruusmann commented Nov 6, 2017

vruusmann commented Nov 6, 2017

weixuanfu commented Nov 6, 2017

esanchezSavvyds commented Nov 6, 2017

weixuanfu commented Nov 6, 2017

esanchezSavvyds commented Nov 7, 2017

vruusmann commented Nov 7, 2017

weixuanfu commented Nov 7, 2017 • edited Loading

esanchezSavvyds commented Nov 7, 2017

weixuanfu commented Nov 7, 2017

esanchezSavvyds commented Dec 5, 2017

weixuanfu commented Dec 5, 2017

esanchezSavvyds commented Feb 1, 2018

vruusmann commented Jun 10, 2019 • edited Loading

MONTYYUAN commented Aug 31, 2020 • edited Loading

weixuanfu commented Aug 31, 2020

rhiever commented May 24, 2016 •

edited

Loading

weixuanfu commented Nov 2, 2017 •

edited

Loading

rhiever commented Nov 3, 2017 •

edited

Loading

weixuanfu commented Nov 3, 2017 •

edited

Loading

rhiever commented Nov 6, 2017 •

edited

Loading

weixuanfu commented Nov 7, 2017 •

edited

Loading

vruusmann commented Jun 10, 2019 •

edited

Loading

MONTYYUAN commented Aug 31, 2020 •

edited

Loading