New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator.inheritors() returns a list rather than a set. #350
Conversation
…ed and this made that even with the same seed, different operators were chosen.
Will look into the failed build soon. |
… did not seem to always keep order anyway on a different setup.
Great find. We have some upcoming changes on the dev branch that will fix this issue, but I'd like to get this out as a hotfix on the current |
I have actually found that after running a few hundred more experiments, it is still not actually always consistent, though it is generally much more consistent than it was before. I am also not sure if it is now always inconsistent, or only under certain environments. That considered, it still seems like an improvement regardless (I'd classify having unordered sets when the randomness should be determined by the choice of the index as a bug), so I will rebase the PR. |
It seems that it still wants to also commit the doc-changes. I am not sure how to fix this properly (still learning git). I can also commit the changes on a new branch which is based off of master, if this is a problem. |
Further randomness observed is probably explained by issue #353. |
The reproducibly-related issue maybe fixed with this PR #351. |
Any idea on when it will be merged with the dev branch? I am looking to extend TPOT, and while I can use my own version now (with a few alternations to make results reproducible), it's nicer to go with the solution that would be implemented later. Will save a headache in the end, hopefully. |
@PG-TUe Thank @rhiever for merging it with dev branch last night. Please let us know if the reproducibility issue still exist in this branch. |
Great! I'll give it a go soon and post back with results. |
Okay, so. As far as I can tell, results are consistent now! (Well, they were with my other fixes too but this is much better) |
Thank you for the feedback! I think we may need add a exception for |
@weixuanfu2016, I had a question for you; since you reworked some of how the operators work. I was unsure of how to contact you directly, so I am asking it here. I hope that's ok. Currently, I set out to create such a method manually, but I figured I might be missing some useful tricks already in TPOT that I just overlooked. I find myself using string parsing a lot. The goal currently is provided the desired pipeline (right now provided as a row from a pandas dataframe, specifying in string the classifier and parameters), to create the respective individual.
Do you think this is generally the correct approach, or can you think of a better way? I am sorry if this is a lot to ask, then please feel free to just say so, and I'll keep hacking stuff together on my own :) |
I am happy to answer all these good ideas in your comments. Do you think this is generally the correct approach, or can you think of a better way? I quickly checked the function you posed herein and thought it is a right way for building deap individual from string. Maybe you could make a PR for this. I feel it is very useful for unit tests or many future functions in TPOT Is any terminal declared as default? Eg. if I do not have information about the LinearSVC__tol setting, can I revert to a default terminal? Right now, we don't have default value for most terminals. But it is easy to set a customized operator dictionary. For example, The And if I wish to use a value that is not declared in any terminal, should I just create one in a similar fashion to it being done in You can also use a customized operator dictionary to add value that is not listed in default operator dictionary. These terminals are flexible for taking mixed type. For example: |
Thank you for taking the time to answer those questions, it's very helpful! I will be working on it again tomorrow. I'll post back if I have any other remarks/questions :) |
Okay, so I have been experimenting for a bit, and I seem to be able to add my own terminals for missing values in the function itself (as an alternative to first declaring all needed terminals). Declaring which terminals are valid beforehand in a dictionary is good, but when you want more sophisticated search techniques, such as Bayesian Optimization, you want to be able to define parameter values on the fly. This left me still with the lack of possible default values for terminals. However, I noticed that if I do not provide a terminal for the operator, then it doesn't tend to mind, as long as I give a valid tree to However, the expr_to_tree function currently doesn't really allow for missing terminals. I added the following lines of code to the bottom of this function, which will allow the topmost primitive to not be provided with its arity amount of terminals:
This means that inner operators will not be able to have less terminals than their arity, but the main one does. For example, it will now make a proper tree out of the individual |
Thank you for the feedback. The BTW, could you please make a PR for the nice |
Thank you. Yes, that makes sense, leaving it out is not the way to go, we have to have explicitly state missing (default) values. One remark; the method currently does not take a stringified individual, but more something like "LinearSVC(input_matrix, C=0.9, penalty=l1, dual=True)" (a stringified individual looks differently). I will do a PR of this soon. I actually might be able to work out how to properly stringify an individual as well, so I want to give that a go first (that would be much nicer to work with). |
I looked more into this, and from what I can tell, we can just use the from_string method, given that we take care that our terminals have unique names. This can be done rather easily (I think), rather than having the terminals be named by a stringified version of their value, as is default, we can give them an explicit unique name by combining classifier, parameter name and value in the string (code is modification of this):
From what I tested so far, this makes sure that an individual can be reconstructed from its string. |
@PG-TUe I got a lot of stderr |
Can you provide more details? An example of how I can reproduce the issue? I'm fairly sure I worked a little bit on this afterwards, but my memory is a bit fuzzy about what I changed exactly. Maybe I can see if I had added changes which fixed this (though it would have been accidental, as I do not believe I saw such errors come by). |
I only make 2 changes based on lastest dev branch in TPOT.
Or you can merge the PR #367 for fixing another bug in You could use the codes below to reproduce the issue:
|
Thank you, I will check this tomorrow. |
Alright, I looked into this, and a couple of observations:
Lastly, for the matter at hand of making terminal names unique, there was a problem with the changes I suggested above. Giving names to terminals means they will only be used as symbolic links, in order to get the actual value, such as needed in After all above changes, from what I can tell, the only exceptions thrown in pre-test are now actually because certain parameter settings are illegal - not because of coding errors. I also worked on accepting missing values. The only changes needed are in this commit. I am thinking of adding an option so that 'MISSING' will not be used during the generation of individuals (as this means the default values of scikit-learn will be used more often in the generation process if it is also represented by its explicit value). |
Thanks you for looking into it. I will check your demo. |
Could you please provide an example about using |
No need. I think I figure it out. Need more tests Here is a demo for
Thank you again! |
No problem! And those are indeed the type of strings I meant. Closing this issue now as the discussion regarding stringifying individuals should now be more or less settled for now (at least there is a working version into the dev branch) and the original issue has also been solved through earlier reworks. |
What does this PR do?
In the current version of TPOT, different pipelines are generated, even if the same random state is given. This PR aims to remedy that.
When picking a new primitive when generating a new individual, this is done by calling
np.random.choice(pset.terminals[type_])
. While numpy properly had its seed set,pset.terminals
was not ordered - resulting in a different primitive picked anyway.The reason
pset.terminals
was unordered, is that it is constructed in the same order asOperator.inheritors()
returns the available operators. However, this function internally stored the operators in aset
, which is unordered in Python. Hence, multiple calls ofOperator.inheritors()
could result in differently ordered sets, meaning the random choice, while always picking the same index, would pick different operators.This problem is fixed by storing the operators in a
list
, since that remembers the order elements were added to it.Edit: Apparently on my windows system, it was not enough to just use a list, but the list had to be explicitly sorted as well, this change has been made in this PR. On my Linux (Ubuntu) systems it worked fine without.
Where should the reviewer start?
All actual code changes are here.
The new unit test is here.
How should this PR be tested?
To test if
Operator.inheritors()
now always returns the same list, simply call it a few times and see if it indeed works now. This is also done in the unit test. Unfortunately I don't know of a way to better test this change (note that the old code could also return the same order).Any background context you want to provide?
I feel enough background was given.
What are the relevant issues?
#349
Questions:
I feel that the docs do not need to be updated, as to the best of my knowledge it is intended that the results should always be reproducible, this is a bugfix.
It adds no new dependencies.