Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator.inheritors() returns a list rather than a set. #350

Closed
wants to merge 8 commits into from
Closed

Operator.inheritors() returns a list rather than a set. #350

wants to merge 8 commits into from

Conversation

PGijsbers
Copy link
Contributor

@PGijsbers PGijsbers commented Feb 10, 2017

What does this PR do?

In the current version of TPOT, different pipelines are generated, even if the same random state is given. This PR aims to remedy that.

When picking a new primitive when generating a new individual, this is done by calling np.random.choice(pset.terminals[type_]). While numpy properly had its seed set, pset.terminals was not ordered - resulting in a different primitive picked anyway.

The reason pset.terminals was unordered, is that it is constructed in the same order as Operator.inheritors() returns the available operators. However, this function internally stored the operators in a set, which is unordered in Python. Hence, multiple calls of Operator.inheritors() could result in differently ordered sets, meaning the random choice, while always picking the same index, would pick different operators.

This problem is fixed by storing the operators in a list, since that remembers the order elements were added to it.

Edit: Apparently on my windows system, it was not enough to just use a list, but the list had to be explicitly sorted as well, this change has been made in this PR. On my Linux (Ubuntu) systems it worked fine without.

Where should the reviewer start?

All actual code changes are here.

The new unit test is here.

How should this PR be tested?

To test if Operator.inheritors() now always returns the same list, simply call it a few times and see if it indeed works now. This is also done in the unit test. Unfortunately I don't know of a way to better test this change (note that the old code could also return the same order).

Any background context you want to provide?

I feel enough background was given.

What are the relevant issues?

#349

Questions:

I feel that the docs do not need to be updated, as to the best of my knowledge it is intended that the results should always be reproducible, this is a bugfix.

It adds no new dependencies.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.08%) to 83.914% when pulling 59e18ae on PG-TUe:ordered-operators into aa6f673 on rhiever:development.

@PGijsbers
Copy link
Contributor Author

Will look into the failed build soon.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 84.086% when pulling 36d6a73 on PG-TUe:ordered-operators into aa6f673 on rhiever:development.

@rhiever
Copy link
Contributor

rhiever commented Feb 13, 2017

Great find. We have some upcoming changes on the dev branch that will fix this issue, but I'd like to get this out as a hotfix on the current master branch release. Can you please rebase this PR and send it to the master branch so we can send out a hotfix for TPOT 0.6?

@PGijsbers
Copy link
Contributor Author

I have actually found that after running a few hundred more experiments, it is still not actually always consistent, though it is generally much more consistent than it was before. I am also not sure if it is now always inconsistent, or only under certain environments. That considered, it still seems like an improvement regardless (I'd classify having unordered sets when the randomness should be determined by the choice of the index as a bug), so I will rebase the PR.

@PGijsbers PGijsbers changed the base branch from development to master February 14, 2017 09:23
@PGijsbers
Copy link
Contributor Author

It seems that it still wants to also commit the doc-changes. I am not sure how to fix this properly (still learning git). I can also commit the changes on a new branch which is based off of master, if this is a problem.

@PGijsbers
Copy link
Contributor Author

Further randomness observed is probably explained by issue #353.
Whenever the evaluation of an individual does not complete because of this bug, there is also one less call to np.random, resulting in the random number sequence being different- which means changes in mutation and mating.

@weixuanfu
Copy link
Contributor

The reproducibly-related issue maybe fixed with this PR #351.

@PGijsbers
Copy link
Contributor Author

Any idea on when it will be merged with the dev branch? I am looking to extend TPOT, and while I can use my own version now (with a few alternations to make results reproducible), it's nicer to go with the solution that would be implemented later. Will save a headache in the end, hopefully.

@weixuanfu
Copy link
Contributor

@PG-TUe Thank @rhiever for merging it with dev branch last night. Please let us know if the reproducibility issue still exist in this branch.

@PGijsbers
Copy link
Contributor Author

Great! I'll give it a go soon and post back with results.

@PGijsbers
Copy link
Contributor Author

PGijsbers commented Feb 17, 2017

Okay, so. As far as I can tell, results are consistent now! (Well, they were with my other fixes too but this is much better)
One little thing though; the default operator config includes xgboost, while in the documentation it is still listed as optional. So I would personally prefer the default operator dict not to include xgboost (or perhaps better; just give a notice for the libraries which could not be imported, and continue).
Currently, if no xgboost is installed, with all default settings, TPOT crashes upon initialization trying to import xgboost.

@weixuanfu
Copy link
Contributor

Thank you for the feedback! I think we may need add a exception for xgboost if no xgboost is installed.

@PGijsbers
Copy link
Contributor Author

PGijsbers commented Feb 23, 2017

@weixuanfu2016, I had a question for you; since you reworked some of how the operators work. I was unsure of how to contact you directly, so I am asking it here. I hope that's ok.
I am looking to create an individual given a specific pipeline. For now, the pipeline only consists of an algorithm with set parameters. Later, I will probably want to also specify more operators.
I tried to construct pipelines with their from_string method, but this doesn't work so well since it does not know how to interpret the terminals properly (it sees a float value which is used for multiple terminals, eg LinearSVC__C and also LogisticRegression__C - and confuses which to use).

Currently, I set out to create such a method manually, but I figured I might be missing some useful tricks already in TPOT that I just overlooked. I find myself using string parsing a lot.

The goal currently is provided the desired pipeline (right now provided as a row from a pandas dataframe, specifying in string the classifier and parameters), to create the respective individual.

def build_gp(row):
    expr = []
    classifiers = tpot_.pset.primitives[gp_types.Output_DF]
    candidates = [clf for clf in classifiers if clf.name == row.classifier]
    if len(candidates) == 0 or len(candidates)>1:
        print("Should be exactly one candidate, found {}".format(len(candidates)))
        raise
    clf_prim = candidates[0]
    expr.append(clf_prim)
    parameters = { params.split('=')[0]: params.split('=')[1] for params in row.parameters.split(',')}
    for arg in clf_prim.args:
        terms = tpot._pset.terminals[arg]
        if len(terms)==1:
            expr.append(terms[0])
        else:
            param_name = arg.__name__.split('__')[1]
            if param_name in parameters:
                param_value = parameters[param_name]
                if param_value in map(lambda t:t.name, terms):
                    term = terms[list(map(lambda t:t.name, terms)).index(param_value)]
                    expr.append(term)
                else:
                    print("Could not find param value for {}".format(arg.__name__))
            else:
                print("Could not find listing for {}".format(param_name))
            
        
    return creator.Individual(expr)
    
ind = build_gp(benchmark_data.loc()[1015848])

Do you think this is generally the correct approach, or can you think of a better way?
Is any terminal declared as default? Eg. if I do not have information about the LinearSVC__tol setting, can I revert to a default terminal? And if I wish to use a value that is not declared in any terminal, should I just create one in a similar fashion to it being done in TPOTBase._setup_pset?

I am sorry if this is a lot to ask, then please feel free to just say so, and I'll keep hacking stuff together on my own :)

@weixuanfu
Copy link
Contributor

weixuanfu commented Feb 23, 2017

I am happy to answer all these good ideas in your comments.
I found this from_string issue before and posted on one of our closed PR #343. I thought it is caused by a bug in mapping in deap.gp.

Do you think this is generally the correct approach, or can you think of a better way?

I quickly checked the function you posed herein and thought it is a right way for building deap individual from string. Maybe you could make a PR for this. I feel it is very useful for unit tests or many future functions in TPOT

Is any terminal declared as default? Eg. if I do not have information about the LinearSVC__tol setting, can I revert to a default terminal?

Right now, we don't have default value for most terminals. But it is easy to set a customized operator dictionary. For example, The PolynomialFeatures in default operator dictionary has three default terminal. Then we can use operator_dict in TPOT (-operator for command line, input dictionary file). Please check this manual under dev for more details,

And if I wish to use a value that is not declared in any terminal, should I just create one in a similar fashion to it being done in TPOTBase._setup_pset?

You can also use a customized operator dictionary to add value that is not listed in default operator dictionary. These terminals are flexible for taking mixed type. For example: max_features in RandomForestClassifier, you can set as 'max_features': list(np.arange(0, 1.01, 0.05))+['auto', None] in the customized operator dictionary . Just make sure it is a 1-dimensional list.

@PGijsbers
Copy link
Contributor Author

Thank you for taking the time to answer those questions, it's very helpful! I will be working on it again tomorrow. I'll post back if I have any other remarks/questions :)

@rhiever rhiever mentioned this pull request Feb 23, 2017
@PGijsbers
Copy link
Contributor Author

Okay, so I have been experimenting for a bit, and I seem to be able to add my own terminals for missing values in the function itself (as an alternative to first declaring all needed terminals). Declaring which terminals are valid beforehand in a dictionary is good, but when you want more sophisticated search techniques, such as Bayesian Optimization, you want to be able to define parameter values on the fly.
I currently made this in a little bit of a hacky manner, but it works for now.

This left me still with the lack of possible default values for terminals. However, I noticed that if I do not provide a terminal for the operator, then it doesn't tend to mind, as long as I give a valid tree to generate_pipeline_code, it seemed to use scikit-learn defaults if no terminal is provided.

However, the expr_to_tree function currently doesn't really allow for missing terminals. I added the following lines of code to the bottom of this function, which will allow the topmost primitive to not be provided with its arity amount of terminals:

    if len(stack) != 0:
        prim, args = stack[0][0], stack[0][1]
        tree = prim_to_list(prim, args)

    return tree

This means that inner operators will not be able to have less terminals than their arity, but the main one does. For example, it will now make a proper tree out of the individual ['LinearSVC', 'ARG0', '0.9', 'False', 'squared_hinge', 'l1'], eventhough primitive 'LinearSVC' has arity 6, and only 5 terminals are provided. Do you think it would be a good idea to make some kind of NotCare terminal, where it will be replaced with the default of scikit-learn values for the respective parameter when generating the actual code? This will allow you to explicitly not specify terminals, which is nicer, and also allows for inner operators to use default values.

@weixuanfu
Copy link
Contributor

weixuanfu commented Feb 24, 2017

Thank you for the feedback. The expr_to_tree do not allow missing terminal because some terminal values may be shared among different terminals inside a operator. For example, min_samples_split and min_samples_leaf in RandomForestClassifier share terminal value from 2 to 20. So it is hard to tell which one is missing in the individual. I think maybe put something like "" or "MISSING" into the ordered individual list to specify which terminal is missing is nicer.

BTW, could you please make a PR for the nice from_string function (may be put into export_utils.py). I think it is very useful for TPOT.

@PGijsbers
Copy link
Contributor Author

Thank you. Yes, that makes sense, leaving it out is not the way to go, we have to have explicitly state missing (default) values.

One remark; the method currently does not take a stringified individual, but more something like "LinearSVC(input_matrix, C=0.9, penalty=l1, dual=True)" (a stringified individual looks differently). I will do a PR of this soon. I actually might be able to work out how to properly stringify an individual as well, so I want to give that a go first (that would be much nicer to work with).

@PGijsbers
Copy link
Contributor Author

I looked more into this, and from what I can tell, we can just use the from_string method, given that we take care that our terminals have unique names. This can be done rather easily (I think), rather than having the terminals be named by a stringified version of their value, as is default, we can give them an explicit unique name by combining classifier, parameter name and value in the string (code is modification of this):

        # Terminals
        for _type in self.arguments:
            for val in _type.values:
                terminal_name = _type.__name__ + "=" + str(val)
                self._pset.addTerminal(val, _type, name=terminal_name)

From what I tested so far, this makes sure that an individual can be reconstructed from its string.
One caveat however, this still only works when all terminals are specified for each primitive, and those values exist. So I think I could modify the default from_string function of the PrimitiveTree to also allow for new values to be specified in the primitive tree and/or missing terminals. For this, I think we would indeed introduce a "missing"/"default" terminal for each parameter.

@weixuanfu
Copy link
Contributor

weixuanfu commented Mar 15, 2017

@PG-TUe I got a lot of stderr 'Individual' object has no attribute 'arity' when testing the codes in the last comment. I think it is because the connection between operator (Primitive) and its parameters (Terminal) is unlinked due to unique name instead of an unique class. Maybe modifying from_string function is better.

@PGijsbers
Copy link
Contributor Author

Can you provide more details? An example of how I can reproduce the issue? I'm fairly sure I worked a little bit on this afterwards, but my memory is a bit fuzzy about what I changed exactly. Maybe I can see if I had added changes which fixed this (though it would have been accidental, as I do not believe I saw such errors come by).

@weixuanfu
Copy link
Contributor

I only make 2 changes based on lastest dev branch in TPOT.

  1. Based on the codes below in your comment:
# Terminals
        for _type in self.arguments:
            for val in _type.values:
                terminal_name = _type.__name__ + "=" + str(val)
                self._pset.addTerminal(val, _type, name=terminal_name)
  1. For printing out error, change this line in the _pre_test decorator to:
            except Exception as e:
                 print(e)

Or you can merge the PR #367 for fixing another bug in _pre_test into your dev branch for the test.

You could use the codes below to reproduce the issue:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.25, test_size=0.75)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs = 2, random_state = 44)
tpot.fit(X_train, y_train)

@PGijsbers
Copy link
Contributor Author

Thank you, I will check this tomorrow.

@PGijsbers
Copy link
Contributor Author

PGijsbers commented Mar 16, 2017

Alright, I looked into this, and a couple of observations:

  • This bug exists in the current dev branch, it is not specific to my addition.
    The bug is caused because the _pre_test wrapper expects a single return value from the function it wraps, as seen in the line 147: expr = func(self, *args, **kwargs). However, the wrapping is also applied to the mutate and mate operators, both which return tuples (in the form of ind1, for the former, and ind1, ind2 for the latter). Because of this during the expr_to_tree conversion in _pre_test, the line while len(stack[-1][1]) == stack[-1][0].arity: tries to get the arity property of the individual (the first element in the tuple), rather than the primitive (first element in the individual). This rises above exception. By creating separate wrapper functions, or by first analysing the return results to proper only forward the individual to expr_to_tree and not the tuple, this exception will no longer be raised (example).
  • After the changes, instead, an exception will be raised that "'Individual' object has no attribute 'operators'". Because both the mutate and mate operators are wrapped inline with _pre_test, rather than through the @-syntax. It is not exactly clear to my why (for now, still reading up), but the self passed to _pre_test will be an individual, not the tpot object, when the decoration is done in-line. Adding @_pre_test to the _random_mutation_operator method for example, would be a solution to avoid said error with the mutate operator (example).

Lastly, for the matter at hand of making terminal names unique, there was a problem with the changes I suggested above. Giving names to terminals means they will only be used as symbolic links, in order to get the actual value, such as needed in prim_to_list, we need to actually have access to the primitive set of which the terminals originate to access the context. This change is also relatively straightforward (example).

After all above changes, from what I can tell, the only exceptions thrown in pre-test are now actually because certain parameter settings are illegal - not because of coding errors.

I also worked on accepting missing values. The only changes needed are in this commit. I am thinking of adding an option so that 'MISSING' will not be used during the generation of individuals (as this means the default values of scikit-learn will be used more often in the generation process if it is also represented by its explicit value).

@weixuanfu
Copy link
Contributor

Thanks you for looking into it. I will check your demo.

@weixuanfu
Copy link
Contributor

weixuanfu commented Mar 16, 2017

Could you please provide an example about using from_string to build a deap individual object?

@weixuanfu
Copy link
Contributor

weixuanfu commented Mar 16, 2017

No need. I think I figure it out. Need more tests

Here is a demo for from_string:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from deap import creator
import numpy as np

# Set up the MNIST data set for testing
mnist_data = load_digits()
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(mnist_data.data.astype(np.float64), mnist_data.target.astype(np.float64), random_state=42)


tpot_obj = TPOTClassifier()
pipeline_string= 'KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=10, KNeighborsClassifier__p=1,KNeighborsClassifier__weights=uniform)'
tpot_obj._optimized_pipeline = creator.Individual.from_string(pipeline_string, tpot_obj._pset)
tpot_obj._fitted_pipeline = tpot_obj._toolbox.compile(expr=tpot_obj._optimized_pipeline)
tpot_obj._fitted_pipeline.fit(training_features, training_classes)
score = tpot_obj.score(testing_features, testing_classes)

print(score)

Thank you again!

@PGijsbers
Copy link
Contributor Author

No problem! And those are indeed the type of strings I meant. Closing this issue now as the discussion regarding stringifying individuals should now be more or less settled for now (at least there is a working version into the dev branch) and the original issue has also been solved through earlier reworks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants