-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exporting pipelines to PMML/PFA #152
Comments
Interesting idea. How popular is this format? XML seems a bit outdated. (I thought JSON replaced XML everywhere.) |
The format was first specified in 1998, hence the use of XML. PFA, PMML's successor, will be based on JSON and will provide quite a lot more flexibility. It's almost a high-level programming language actually. However, it is still in the making (IIRC the first draft of the specification was published towards the end of last year) and therefore not really being used in production by more than a handful of people. PMML, on the other hand, is being used quite a lot, and I don't think it's going away even if PFA picks up speed over the next couple years. Here is a list of projects/frameworks that use/support PMML (the list can't be complete, though, because there is no mention of Pattern). |
That makes sense about XML then. :-) This is something that we'll have to explore more before committing to. I'm primarily interested in finding out:
|
I guess 1) and 2) can be answered with a straight no for PMML, although I think it would still be beneficial to support it with only a subset of TPOT pipelines being actually exportable. As for 3), I haven't been able to find any visualization tools so far. There are a few mentions of a tool that was developed during a research project, but it seems like it was never made available to the public... Seems to me that we may have to wait a while for better PFA tooling to emerge in order to meet all the requirements. |
Agreed. In the meantime, we're still interested in #51 and being able to export TPOT pipelines to Orange. Seems like that would be quite useful. |
As a contributor to JPMML-family of projects, here's my perspective. Fundamentally, PMML is about capturing the final state of a model development workflow (not the workflow itself). In other words, the final state is the winning solution, which is appropriate for scoring data in production environment (where the goal is not to replay model development procedure, but to apply the model to new data records). I think PFA is also more about capturing the state rather than the process. So, you should be focusing more on domain-specific languages that address workflow markup. JPMML-SkLearn supports the conversion of over 50 Scikit-Learn transformers and estimators. The only class that couldn't be represented in "native" PMML (but could be put in action using Java user-defined function) has been Anyway, if you have any pointers for getting started with TPOT (something to do with the Iris dataset?), then I wouldn't mind giving a shot at implementing some PMML interoperability. |
Thank you @vruusmann! TPOT is built almost entirely on top of sklearn classes, with the exception of XGBoost and a couple custom feature preprocessors. However, we are dropping XGBoost due to installation issues for many users, so that will not be an issue in the next version. Here are some resources that will help you get started with TPOT: |
@vruusmann that would be great! Thanks in advance for the effort. |
You could check out WhizzML, which is oriented towards representing ML workflows. Of course, WhizzML is a very new thing, and by supporting it you would become interoperable only with BigML's platform. But it should be interesting reference material nonetheless. |
@rhiever are you sure that commit fixes this issue or was that a typo? |
That did appear to be a typo. |
Hi @rhiever @vruusmann @jln-ho , I'm trying to export my tpot model to a PMML xml using jpmml-sklearn but I'm getting crazy to do it. Is there any option to do it? If not, how can I export my tpot model to use it later in Java? Thank you all, I'm looking forward to your response. |
If you use TPOT's |
@rhiever Thank you very much for your time. I'm new to this world. How can I do that? With the export function I create .py file. In that file there is a read_csv function which I don´t understand what does. How can I get the scikit-learn pipeline from that file? |
@esanchezSavvyds You may find a line in the .py file that starts with
|
Thank you for your fast answer @weixuanfu . My aim is to automatize the model generation with tpot and export it to PMML automatically. For that reason, I need to access to the scikit-learn pipeline at execution time in my code. Is that possible or the only possibility is to access to it through that file manually? |
@esanchezSavvyds Please check the TPOT API. I think the |
@weixuanfu I apologize for not seeing that earlier. That's what I need. Thank you very much for your job. |
@esanchezSavvyds Cool, good to know it solved the issue. No need to apologize. |
@weixuanfu Problems have returned. I'm having problems at exporting the pipeline to PMML when tpot generates a model using StackingEstimator as it says it's not a supported transformation. What can I do? Is there a possibility to not use it? Thank you |
It's probable that the StackingEstimator is not supported in PMML, as it's a custom class outside of scikit-learn that's implemented by @rasbt. PMML would need to be extended to support estimator stacking. You can look at alternatives to the best-scoring pipeline in the |
I am happy about PRs to the StackingClassifier & StackingCVClassifier fixing this; however, it sounds like it's rather due to PMML? In that case, I am curious, does the |
Old version (< v0.7.3) of TPOT used @esanchezSavvyds could you please try to export the pipeline below to PMML?
|
@weixuanfu @rasbt I have tried to export that pipeline to PMML. I have removed the FunctionTransformer because I can't pickle it. The issue I'm having when I try it is that FeatureUnion is not supported. The specification of the JAVA API I'm using to export to PMML says VotingClassifier is supported. Here you have it https://github.com/jpmml/jpmml-sklearn. Thank you all |
That's too bad. Perhaps we can add an option (or series of flags) to disable features such as stacking and pipeline splitting (i.e., CombineDFs). Disabling those features should then make the TPOT pipelines PMML compliant. |
First, the Second, I intend to catch up with TPOT-specific estimator types sometimes in December. You can track my progress by subscribing to this issue: jpmml/jpmml-sklearn#54 As a workaround, it might be worthwhile to define a PMML-specific configuration dictionary (argument |
Yes, that was my thinking as well, @vruusmann. The issue is that stacking and pipeline splitting are not currently configurable in TPOT configuration dictionaries; they are always on by default. Hence my suggestion to add options to turn them off. I agree that it would be wise for |
@vruusmann I am using the last version of the java command line application and this versions:
And we are getting that FeatureUnion is not supported with this error: Failed to convert |
@rhiever @weixuanfu Here's a possible workflow for automatically generating PMML-compatible TPOT configuration dictionaries: jpmml/jpmml-sklearn#55 |
@esanchezSavvyds You're using the |
@esanchezSavvyds For the pickable issue, you may try to use 'copy.copy' instead of the lambda function used in old version of TPOT.
|
@vruusmann I think the problem is not creating a compatible config directory, which I agree it has to be done. The problem is that, as far as I can understand, StackingEstimator cannot be disabled in tpot and it's not supported by the sklearn2pmml. |
@esanchezSavvyds One of my dev branch of TPOT called noCDF_noStacking has a option named
|
Hi @weixuanfu , first of all, thank you very much for your effort. We are testing this feature and we have found that ZeroCount transformation is neither supported. This is the error: java.lang.IllegalArgumentException: The transformer object (Python class tpot.builtins.zero_count.ZeroCount) is not a Transformer or is not a supported Transformer subclass |
@esanchezSavvyds You should list all "problematic" TPOT estimator and transformer types here: jpmml/jpmml-sklearn#54 |
@esanchezSavvyds you may want to use a |
@weixuanfu we have been testing your feature and we think it works correctly, at least it works correctly for us. We have been able to export every pipeline without any problem (using an appropiate configuration dictionary as you have said). |
@esanchezSavvyds OK, sounds good! |
Hello @weixuanfu , Thank you very much. |
@esanchezSavvyds It is good to know that it works for you. That branch is one of my test branches to test the performance of using the simple linear pipelines in TPOT vs. the tree-based ones. We need more tests and discussions before deciding to merge this branch to master branch. |
Hello @weixuanfu , |
I've refactored TPOT support in SkLearn2PMML package version 0.46.0 (available in PyPI), and explained some technical details/gotchas in the following technical article: TLDR: TPOT fitted pipelines convert very nicely into PMML data format. |
Hi @weixuanfu @rhiever @rasbt @vruusmann, |
@MONTYYUAN Please check the function about customizing TPOT's operators. |
I'm currently doing research in the area of model/pipeline persistence and came across PMML. It's basically an XML schema that lets you define all sorts of data mining/machine learning processes for both persistence and interoperability. It was specifically designed to decouple the tools that are used to generate pipelines from the tools that are used to apply them.
There are several libraries by openscoring.io that can be used to export pipelines from popular environments such as sklearn, Apache Spark MLlib, R or XGBoost. Their counterparts allow for evaluation (i.e. execution) of said exported pipelines in e.g. plain Java, in a Spark context, an Android context or even in a database context such as PostgreSQL. The most interesting library for TPOT in particular should be sklearn2pmml, which is a python wrapper around jpmml-sklearn that converts pickled pipelines to PMML and is written in Java (talk about dependency hell).
I think that PMML (or its successor in the making PFA) would be a great format to use for persisting pipelines generated with TPOT as most people using TPOT will want to deploy the models "found" by it to some other platform. At least there seems to be some sort of demand for persisting models in general according to these issues #2, #11, #51, #65. Some of them suggest using python's pickle format, but I think a dedicated, platform independent solution should always be preferred. Not to talk about the security issues that come with pickle, that's a whole other story.
Excited to hear your thoughts on this!
The text was updated successfully, but these errors were encountered: