-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Algorithms, Dimensions, Space and Embeddings #36
Conversation
- Missing transformers but it should be functional without them for now. Transformers will be later compositor objects. Check module documentation. In general I think this refactoring will be able to support arbitrary transformations and distributions on dimensions.
- Fix shape to use scipy internal function
Also, silence a new 'wrong-import-order' error form pylint since it conflicts with current 'google' style from flake-import-order plugin.
- Change how Categorical is repr-ed so that probabilities are printed .2f
Codecov Report
@@ Coverage Diff @@
## master #36 +/- ##
===========================================
+ Coverage 86.95% 97.54% +10.59%
===========================================
Files 16 34 +18
Lines 1012 2691 +1679
Branches 102 223 +121
===========================================
+ Hits 880 2625 +1745
+ Misses 132 42 -90
- Partials 0 24 +24
Continue to review full report at Codecov.
|
- Highlight Scipy's bug with a xfail test
1. Declare namespace packages: ``metaopt`` and ``metaopt.algo`` 2. Rename distribution of this package as ``metaopt.core`` 3. Make `Factory` metaclass search and register for available entry points with a group named after the by-itself-and-for-itself class 4. Add base class for optimization algorithms
🎉 😄 We are rushing for ICML so I might not have enough time to review it unfortunately. Worst case scenario would be February 12th. I'll do my best, I can hardly wait to review and have this PR merged. 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review commits up to 55 / 55
a4414af - Space: Add _Discrete class and ban interval from Categorical
@@ -283,6 +283,9 @@ def configure(self, config): | |||
if self._init_done: | |||
raise RuntimeError("Configuration is done; cannot reset an Experiment.") | |||
|
|||
# Sanitize and replace some sections with objects | |||
candidate_algorithms = Experiment._sanitize_and_instantiate_config(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is weird. Why would _sanitize_and_instantiate_config() return optimization algorithms. I understand the convenience, but it's weird. Smells like there is a bad coupling.
I would at least split that line and fetch candidate_algorithms from config after instantiation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this is a little bit nasty, but I will explain:
What we 'want':
- Most recent metaopt configuration has priority over the existing in database. In addition, having the same experiment name, means that we run the same algorithms.
- Configurations should match except allowed exceptions. Currently those are:
non_forking_attrs = ('status', 'pool_size', 'max_trials')
- User can specify an algorithm to run with its defaults parameters just by specifying the algo's name. So a user will not be forced to write for instance:
algorithms: random: {}
instead ofalgorithms: random
in moptconfig. Note that in former case the value of algorithms is a dictionary, while in the latter case it is a string. - We would like to save in the database the complete specification of an algorithm, with all of its parameters specified. That's a measure in order to have consistency. Consider the case where the defaults values for an algorithm's parameters change.
What's the problem and solution:
Comparing the configuration, needs them to have a similar form. For our convenience, we can proactively instantiate algorithms from the new configuration (so that (3) can be a feature) and sanitize it by executing (code from Experiment._sanitize_and_instantiate_config
):
algorithms = PrimaryAlgo(space, config['algorithms'])
# Sanitize 'algorithms' section in given configuration
config['algorithms'] = algorithms.configuration
, as BaseAlgorithm.configuration
always returns the complete specification.
So now we can compare them safely, because we didn't mess non-forking attributes (2) and we brought it to a consistent form (4). We are able to proactively instantiate the algorithms because of (1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on:
I would at least split that line and fetch candidate_algorithms from config after instantiation.
I did not quite get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, I understand the convenience, but thanks for the explanation anyway. 🙂
Is Experiment._sanitize_and_instantiate_config()
expected to affect the state of the experiment object at some point? I know it does not in the current implementation but maybe you have some plan in mind where this would happen? If it's not the case, then I would make the function edit a copy of config and return the modified version, leaving the passed one intact.
instantiated_config = Experiment._sanitize_and_instantiate_config(config)
candidate_algorithms = instantiated_config['algorithms']
Or if we keep current inplace
implementation, my two line split proposition would look like this.
Experiment._sanitize_and_instantiate_config(config)
candidate_algorithms = config['algorithms']
I'd prefer the former one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsirif Can we settle on this one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it can be changed this way. or I am not able to see the way. I am going to go through the code with a particular example in mind (I will remove some code to focus where it's necessary):
def configure(self, config):
# ``config['algorithms'] == 'random'`` ; this was set from user's configuration
candidate_algorithms = Experiment._sanitize_and_instantiate_config(config)
# after this call variable config will be: ``config['algorithms'] == {'random': {}}``
if self.status is None: # Suppose that experiment with same name already exists
is_new = True
else: # So we will be here
# Now we have to check against what already exists in database
# If were not to to `_sanitize_and_instantiate_config` as it is,
# then ``config['algorithms'] == 'random'``.
is_new = self._is_different_from(config)
# ``is_new is True``, if ``config['algorithms'] == 'random'``.
# But ``is_new is False``, if ``config['algorithms'] == {'random': {}}``
# That is because in database, a complete (not partial) specification of the experiment
# and its algorithms is being saved.
if is_new:
self._fork_config(config)
for section, value in config.items():
setattr(self, section, value)
self._init_done = True
# In any case, current config from user has the top precedence in defining
# which algorithms are going to be run.
self.algorithms = candidate_algorithms
self.status = 'pending'
# Get complete specification of experiment
final_config = self.configuration
# Push it to the database
if is_new:
self._db.write('experiments', final_config)
self._id = final_config['_id']
else:
self._db.write('experiments', final_config, {'_id': self._id})
def _sanitize_and_instantiate_config(config):
space = SpaceBuilder().build_from(config['metadata']['user_args'])
if not space:
raise ValueError("Parameter space is empty. There is nothing to optimize.")
# Instantiate algorithms
algorithms = PrimaryAlgo(space, config['algorithms'])
# Get back a complete specification of algorithms configuration!
# Basically this step converts, ``config['algorithms'] == 'random'`` to
# ``config['algorithms'] == {'random': {}}``.
config['algorithms'] = algorithms.configuration
return algorithms
def _is_different_from(self, config):
is_diff = False
for section, value in config.items():
item = getattr(self, section)
# See below in the comparison, that no special case is being considered for
# ``section == 'algorithms'``. Because of this, ``is_new is True`` will be returned
# falsely, if the previous steps in ``_sanitize_and_instantiate_config`` do not happen.
# Special treatment of ``section == 'algorithms'``, will effectively do twice the thing that
# `BaseAlgorithm.__init__` does through `PrimaryAlgo`.
# It breaks delegation of responsibility.
if item != value:
is_diff = True
break
What is being proposed above, does not resolve this case. The reason is that config['algorithms']
is being replaced by an actual PrimaryAlgo
instance which makes this key not immediately comparable with what (possibly) already exists in the database.
I am not very fond of this particular part of current implementation. But I think it is nonetheless a sufficient measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! There was an important detail I missed. The method _sanitize_and_instantiate_config()
will
- instantiate the space and algorithm(s) and will set the algorithm's config inside the experiment's config (sanitize),
- but it will return the instantiated algorithm, not its config (instantiate)
This is because
- We want a unique form of config to compare (ex: algo: random and algo: {random: {}} maps to the same config {random: {}}) (sanitize)
- We want to instantiate the actual algorithm to set it as an attribute of the experiment (instantiate)
The problem is we need the instantiated config in order to get the sanitized config, that's why both operation are currently coupled in a single method.
I think we could get rid of the sanitize part by doing the tests for differences and forking on a temporary copy of the experiment. This way we would still avoid changes on the experiment itself during the tests. Something like this:
def configure(self, config):
experiment = copy.deepcopy(self)
experiment._instantiate_config(config)
if self.status is None:
is_new = True
else:
is_new = self._is_different_from(experiment.configuration)
if is_new:
experiment._fork_config(config)
self._instantiate_config(experiment.configuration)
self._init_done = True
self.status = 'Pending'
# ...
The method _instantiate_config()
would setup the experiment with instantiated objects of the config.
def _instantiate_config(self, config):
space = SpaceBuilder().build_from(config['metadata']['user_args'])
if not space:
raise ValueError("Parameter space is empty. There is nothing to optimize.")
# Instantiate algorithms
self.algorithms = PrimaryAlgo(space, config['algorithms'])
for section, value in config.items():
setattr(self, section, value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea! Much cleaner!!
algorithms = PrimaryAlgo(space, config['algorithms']) | ||
# Sanitize 'algorithms' section in given configuration | ||
config['algorithms'] = algorithms.configuration | ||
return algorithms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment
src/metaopt/algo/base.py
Outdated
@@ -90,8 +90,8 @@ def suggest(self, num=1): | |||
|
|||
@abstractmethod | |||
def observe(self, points, results): | |||
"""Observe evaluation `results` corresponding to list of `points` in | |||
space. | |||
"""Observe the `results` of the evaluation of the `points` in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a similar note as for judge
and make clear the difference between the two methods.
src/metaopt/algo/space.py
Outdated
|
||
""" | ||
return self.categories | ||
raise RuntimeError("Categories are not ordered. Use ``self.categories`` instead") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find ordered
confusing in this context. I would simply say that Categories have no interval.
Also, elaborate more on its usage in docs
Also, use 'norm' distributions instead of 'alpha' in tests as they are more popular.
Use the trial for debugging the reason why it failed to execute the script. - Add explanatory comment on ``point`` argument on `BaseAlgorithm.judge`. - Rename some Consumer's attributes to facilitate understanding the code
Good! Let me recap this tomorrow morning. It seems we are close to a merge. 🎉 |
@tsirif Sorry I spent too much time on the survey and I'm lacking time now for the PR. I'll recap tomorrow afternoon (Friday). |
@tsirif By the way, would be nice if you could make the change for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good overall. I opened issues for things I believe we can discuss and fix outside this PR. There is 3 things I'd like us to settle on before the PR is merged.
- The _Discrete class
- The way Categorical sample points (I find it overcomplicated)
- Whether
Experiment._sanitize_and_instantiate_config
should be inplace or not.
src/metaopt/algo/space.py
Outdated
return size | ||
|
||
|
||
class _Discrete(Dimension): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe _Discrete class is useless. The sample
method could solely reside inside Integer
class. Categorical
's sample
method is straightforward using numpy.random.choice. Also, interval
method is overwritten anyway by Categorical's to raise RuntimeError.
Although I could easily see how Integer
and Categorical
are related in theory, in our context combining them is not giving any advantage. It complexifies Categorical without giving any advantage in return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not able to pass somehow a separate "sandboxed" seed without affecting global numpy RNG. numpy.random.choice
does not offer such a choice. Temp saving the random state by:
orig_state = numpy.random.get_state()
numpy.random.seed(seed)
samples = # Do stuff with `numpy.random` here
numpy.random.set_state(orig_state)
return samples
i am not sure whether affecting global numpy RNG is safe, if this is not your intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about setting a self.rng = numpy.RandomState()
object inside the Dimension
? We could then use rng.choices()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use it only inside Categorical
. The rest of Dimension
use scipy.stats
which does this thing internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I patched Categorical to use numpy.RandomState
, using check_random_state
helper function.
However, I found a difficulty removing the _Discrete
subclass. Remember that _Discrete
class was introduced as an alternative to remove discrete
keyword argument from Dimension
. So the programming flow currently is as follows, and it takes advantage of Python mechanics regarding class inheritance:
Currently:
Suppose a call to Integer(...).sample(...)
. In general calling a method in an Integer
object will first search within Integer
and then super
calls from there will try to find corresponding methods first in Real
, then _Discrete
and then Dimension
.
class Integer(Real, _Discrete):
def interval(self, alpha=1.0):
low, high = super(Integer, self).interval(alpha) # Call to Real.interval
# make them integers
return int_low, int_high
def __contains__(self, point):
# Check if they are integers...
return super(Integer, self).__contains__(point) # Call to Dimension.__contains__
# calling Integer(...).sample(...) will enter first Real.sample;
# due to the order of subclasses
class Real(Dimension):
def interval(self, alpha=1.0):
prior_low, prior_high = super(Real, self).interval(alpha) # Call to Dimension.interval
return (max(prior_low, self._low), min(prior_high, self._high)) # Add constraints to interval
def sample(self, n_samples=1, seed=None):
# Checks if constraint are met!
samples = []
for _ in range(n_samples):
for _ in range(4):
sample = super(Real, self).sample(1, seed) # Calls _Discrete.sample
if sample[0] not in self:
# Calls Integer.__contains__
# ...
# ...
# ...
return samples
class _Discrete(Dimension):
def sample(self, n_samples=1, seed=None):
samples = super(_Discrete, self).sample(n_samples, seed) # Calls Dimension.sample
# Making discrete by ourselves because scipy does not use **floor**
return list(map(lambda x: numpy.floor(x).astype(int), samples)) # Converts to integers
So the sequence of calls goes like this:
- [
Integer.sample
(redirects)] Real.sample
(fulfills extra constraints)_Discrete.sample
(converts to integers)Dimension.sample
(samples from distribution)
Integer.__contains__
(checks if they are integers)Dimension.__contains__
(check if they are within bounds)Integer.interval
(makes it int)Real.interval
(extra constraints)Dimension.interval
(asks the distribution for its support)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove _Discrete
and move all methods in class Integer(Real)
. The flow will be become:
So the sequence of calls goes like this:
Integer.sample
(converts to integers)Real.sample
(fulfills extra constraints)Dimension.sample
(samples from distribution)Integer.__contains__
(checks if they are integers)Dimension.__contains__
(check if they are within bounds)Integer.interval
(makes it int)Real.interval
(extra constraints)Dimension.interval
(asks the distribution for its support)
Poor Dimension.sample
will return whatever the scipy distribution returns, and the check from the Integer.__contains__
will fail, if the underlying distribution in scipy corresponds to a real one... However, we should allow (I believe) discretizing real distributions, so that for instance someone could have approximately normal distributed integer stochastic variables. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the idea of discretizing Real distributions. However I don't see the problem here, can't we just move _Discrete.sample()
inside Integer.sample()
rather than completely removing this method and rely solely on Real.sample()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. I meant if we removed class _Discrete(Dimension)
and moved all of its methods to class Integer(Real)
(including sample
), then:
... the sequence of calls goes like this:
Integer.sample
(converts to integers)Real.sample
(fulfills extra constraints)Dimension.sample
(samples from distribution) -> this will return possibly samples in real numbersInteger.__contains__
(checks if they are integers) -> it will fail because it checks if they are integersDimension.__contains__
(check if they are within bounds)Integer.interval
(makes it int)Real.interval
(extra constraints)Dimension.interval
(asks the distribution for its support)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes indeed, I was confused. 😅
My proposition then would be to replace this confusing inheritance path with a simple separation of sample()
and _sample()
where sample makes verifications over _sample()
which do it blindly.
So this line:
sample = super(Real, self).sample(1, seed)
becomes
sample = self._sample(1, seed)
where
class Real(Dimension):
# ...
def _sample(self, n_samples, seed=None):
return super(Real, self).sample(1, seed)
# ...
class Integer(Dimension):
# ...
def _sample(self, n_samples, seed=None):
samples = super(Integer, self)._sample(n_samples, seed)
return list(map(lambda x: numpy.floor(x).astype(int), samples))
# ...
I think it is easier this way to track the path of execution in order to understand, modify and maintain the code.
We would now have:
Integer.sample
(converts to integers)Real.sample
(fulfills extra constraints)Integer._sample
Real._sample
Dimension._sample
(samples from distribution)
Integer.__contains__
Dimension.__contains__
(check if they are within bounds)Integer.interval
(makes it int)Real.interval
(extra constraints)Dimension.interval
(asks the distribution for its support)
src/metaopt/algo/space.py
Outdated
self.categories = tuple(categories) | ||
self._probs = tuple(numpy.tile(1. / len(categories), len(categories))) | ||
|
||
prior = distributions.rv_discrete(values=(list(range(len(self.categories))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stand by my point here. I don't see the advantage of going through rv_discrete. Using numpy.random.choices
would be simpler, without a need for any _inverse
and _check
.
@@ -283,6 +283,9 @@ def configure(self, config): | |||
if self._init_done: | |||
raise RuntimeError("Configuration is done; cannot reset an Experiment.") | |||
|
|||
# Sanitize and replace some sections with objects | |||
candidate_algorithms = Experiment._sanitize_and_instantiate_config(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsirif Can we settle on this one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good! I think we can remove commit 54e9c54 given we will keep _Discrete
. I'd rather leave both sample()
and interval()
inside of it then.
tests/unittests/algo/test_space.py
Outdated
|
||
points = space.sample(2, seed=seed) | ||
print(points) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Print?
tests/unittests/algo/test_space.py
Outdated
(points1[1], points2[1], points3[1])] | ||
test_points = [(points1[0], points2[0], points3[0]), | ||
(points1[1], points2[1], points3[1])] | ||
print(test_points) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Print?
b881215
to
e91d4c9
Compare
@bouthilx All is ok! |
src/metaopt/algo/space.py
Outdated
def __contains__(self, point): | ||
"""Check if constraints hold for this `point` of `Dimension`. | ||
|
||
:param x: a parameter corresponding to this `Dimension`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed those points
No description provided.