Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Algorithms, Dimensions, Space and Embeddings #36

Merged
merged 68 commits into from
Mar 19, 2018

Conversation

tsirif
Copy link
Member

@tsirif tsirif commented Dec 19, 2017

No description provided.

- Missing transformers but it should be functional without them for now.
  Transformers will be later compositor objects. Check module
  documentation. In general I think this refactoring will be able to support
  arbitrary transformations and distributions on dimensions.
- Fix shape to use scipy internal function
Also, silence a new 'wrong-import-order' error form pylint since it
conflicts with current 'google' style from flake-import-order plugin.
- Change how Categorical is repr-ed so that probabilities are printed
.2f
@codecov-io
Copy link

codecov-io commented Dec 22, 2017

Codecov Report

Merging #36 into master will increase coverage by 10.59%.
The diff coverage is 97.27%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #36       +/-   ##
===========================================
+ Coverage   86.95%   97.54%   +10.59%     
===========================================
  Files          16       34       +18     
  Lines        1012     2691     +1679     
  Branches      102      223      +121     
===========================================
+ Hits          880     2625     +1745     
+ Misses        132       42       -90     
- Partials        0       24       +24
Impacted Files Coverage Δ
src/metaopt/core/__init__.py 100% <ø> (ø) ⬆️
tests/unittests/core/mongodb_test.py 100% <ø> (ø)
src/metaopt/core/worker/trial.py 100% <100%> (ø) ⬆️
src/metaopt/core/io/database/__init__.py 100% <100%> (ø) ⬆️
src/metaopt/algo/space.py 100% <100%> (ø)
src/metaopt/algo/random.py 100% <100%> (ø)
src/metaopt/core/io/convert.py 100% <100%> (ø)
tests/unittests/core/test_trial.py 100% <100%> (ø)
tests/unittests/client/test_client.py 100% <100%> (ø)
src/metaopt/core/io/database/mongodb.py 100% <100%> (ø) ⬆️
... and 44 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f4db650...00496fc. Read the comment docs.

@bouthilx
Copy link
Member

bouthilx commented Feb 4, 2018

🎉 😄

We are rushing for ICML so I might not have enough time to review it unfortunately. Worst case scenario would be February 12th. I'll do my best, I can hardly wait to review and have this PR merged. 🙂

@bouthilx bouthilx self-requested a review February 12, 2018 18:02
Copy link
Member

@bouthilx bouthilx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review commits up to 55 / 55

a4414af - Space: Add _Discrete class and ban interval from Categorical

@@ -283,6 +283,9 @@ def configure(self, config):
if self._init_done:
raise RuntimeError("Configuration is done; cannot reset an Experiment.")

# Sanitize and replace some sections with objects
candidate_algorithms = Experiment._sanitize_and_instantiate_config(config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is weird. Why would _sanitize_and_instantiate_config() return optimization algorithms. I understand the convenience, but it's weird. Smells like there is a bad coupling.

I would at least split that line and fetch candidate_algorithms from config after instantiation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is a little bit nasty, but I will explain:

What we 'want':

  1. Most recent metaopt configuration has priority over the existing in database. In addition, having the same experiment name, means that we run the same algorithms.
  2. Configurations should match except allowed exceptions. Currently those are: non_forking_attrs = ('status', 'pool_size', 'max_trials')
  3. User can specify an algorithm to run with its defaults parameters just by specifying the algo's name. So a user will not be forced to write for instance:
    algorithms: random: {} instead of algorithms: random in moptconfig. Note that in former case the value of algorithms is a dictionary, while in the latter case it is a string.
  4. We would like to save in the database the complete specification of an algorithm, with all of its parameters specified. That's a measure in order to have consistency. Consider the case where the defaults values for an algorithm's parameters change.

What's the problem and solution:
Comparing the configuration, needs them to have a similar form. For our convenience, we can proactively instantiate algorithms from the new configuration (so that (3) can be a feature) and sanitize it by executing (code from Experiment._sanitize_and_instantiate_config):

        algorithms = PrimaryAlgo(space, config['algorithms'])
        # Sanitize 'algorithms' section in given configuration
        config['algorithms'] = algorithms.configuration

, as BaseAlgorithm.configuration always returns the complete specification.
So now we can compare them safely, because we didn't mess non-forking attributes (2) and we brought it to a consistent form (4). We are able to proactively instantiate the algorithms because of (1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on:

I would at least split that line and fetch candidate_algorithms from config after instantiation.

I did not quite get it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, I understand the convenience, but thanks for the explanation anyway. 🙂

Is Experiment._sanitize_and_instantiate_config() expected to affect the state of the experiment object at some point? I know it does not in the current implementation but maybe you have some plan in mind where this would happen? If it's not the case, then I would make the function edit a copy of config and return the modified version, leaving the passed one intact.

instantiated_config = Experiment._sanitize_and_instantiate_config(config)
candidate_algorithms = instantiated_config['algorithms']

Or if we keep current inplace implementation, my two line split proposition would look like this.

Experiment._sanitize_and_instantiate_config(config)
candidate_algorithms = config['algorithms']

I'd prefer the former one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsirif Can we settle on this one too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can be changed this way. or I am not able to see the way. I am going to go through the code with a particular example in mind (I will remove some code to focus where it's necessary):

    def configure(self, config):
        # ``config['algorithms'] == 'random'`` ; this was set from user's configuration
        candidate_algorithms = Experiment._sanitize_and_instantiate_config(config)
        # after this call variable config will be: ``config['algorithms'] == {'random': {}}``

        if self.status is None:  # Suppose that experiment with same name already exists
            is_new = True
        else:  # So we will be here
            # Now we have to check against what already exists in database
            # If were not to to `_sanitize_and_instantiate_config` as it is,
            # then ``config['algorithms'] == 'random'``.
            is_new = self._is_different_from(config)
            # ``is_new is True``, if ``config['algorithms'] == 'random'``.
            # But ``is_new is False``, if ``config['algorithms'] == {'random': {}}``
            # That is because in database, a complete (not partial) specification of the experiment
            # and its algorithms is being saved.
            if is_new:
                self._fork_config(config)

        for section, value in config.items():
            setattr(self, section, value)

        self._init_done = True
        # In any case, current config from user has the top precedence in defining
        # which algorithms are going to be run.
        self.algorithms = candidate_algorithms
        self.status = 'pending'
        # Get complete specification of experiment
        final_config = self.configuration

        # Push it to the database
        if is_new:
            self._db.write('experiments', final_config)
            self._id = final_config['_id']
        else:
            self._db.write('experiments', final_config, {'_id': self._id})
    def _sanitize_and_instantiate_config(config):
        space = SpaceBuilder().build_from(config['metadata']['user_args'])
        if not space:
            raise ValueError("Parameter space is empty. There is nothing to optimize.")
        # Instantiate algorithms
        algorithms = PrimaryAlgo(space, config['algorithms'])
        # Get back a complete specification of algorithms configuration!
        # Basically this step converts, ``config['algorithms'] == 'random'`` to
        # ``config['algorithms'] == {'random': {}}``.
        config['algorithms'] = algorithms.configuration
        return algorithms
    def _is_different_from(self, config):
        is_diff = False
        for section, value in config.items():
            item = getattr(self, section)
            # See below in the comparison, that no special case is being considered for
            # ``section == 'algorithms'``. Because of this, ``is_new is True`` will be returned
            # falsely, if the previous steps in ``_sanitize_and_instantiate_config`` do not happen.
            # Special treatment of ``section == 'algorithms'``, will effectively do twice the thing that
            # `BaseAlgorithm.__init__` does through `PrimaryAlgo`.
            # It breaks delegation of responsibility.
            if item != value:
                is_diff = True
                break

What is being proposed above, does not resolve this case. The reason is that config['algorithms'] is being replaced by an actual PrimaryAlgo instance which makes this key not immediately comparable with what (possibly) already exists in the database.

I am not very fond of this particular part of current implementation. But I think it is nonetheless a sufficient measure.

Copy link
Member

@bouthilx bouthilx Mar 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! There was an important detail I missed. The method _sanitize_and_instantiate_config() will

  1. instantiate the space and algorithm(s) and will set the algorithm's config inside the experiment's config (sanitize),
  2. but it will return the instantiated algorithm, not its config (instantiate)

This is because

  1. We want a unique form of config to compare (ex: algo: random and algo: {random: {}} maps to the same config {random: {}}) (sanitize)
  2. We want to instantiate the actual algorithm to set it as an attribute of the experiment (instantiate)

The problem is we need the instantiated config in order to get the sanitized config, that's why both operation are currently coupled in a single method.

I think we could get rid of the sanitize part by doing the tests for differences and forking on a temporary copy of the experiment. This way we would still avoid changes on the experiment itself during the tests. Something like this:

def configure(self, config):
    experiment = copy.deepcopy(self)
    experiment._instantiate_config(config)

    if self.status is None:
        is_new = True
    else:
        is_new = self._is_different_from(experiment.configuration)
        if is_new:
            experiment._fork_config(config)

    self._instantiate_config(experiment.configuration)

    self._init_done = True
    self.status = 'Pending'

    # ...

The method _instantiate_config() would setup the experiment with instantiated objects of the config.

def _instantiate_config(self, config):
    space = SpaceBuilder().build_from(config['metadata']['user_args'])
    if not space:
        raise ValueError("Parameter space is empty. There is nothing to optimize.")

    # Instantiate algorithms
    self.algorithms = PrimaryAlgo(space, config['algorithms'])

    for section,  value in config.items():
        setattr(self, section, value)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea! Much cleaner!!

algorithms = PrimaryAlgo(space, config['algorithms'])
# Sanitize 'algorithms' section in given configuration
config['algorithms'] = algorithms.configuration
return algorithms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment

@@ -90,8 +90,8 @@ def suggest(self, num=1):

@abstractmethod
def observe(self, points, results):
"""Observe evaluation `results` corresponding to list of `points` in
space.
"""Observe the `results` of the evaluation of the `points` in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a similar note as for judge and make clear the difference between the two methods.


"""
return self.categories
raise RuntimeError("Categories are not ordered. Use ``self.categories`` instead")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find ordered confusing in this context. I would simply say that Categories have no interval.

Also, elaborate more on its usage in docs
Also, use 'norm' distributions instead of 'alpha' in tests
as they are more popular.
Use the trial for debugging the reason why it failed to execute the
script.

- Add explanatory comment on ``point`` argument on
`BaseAlgorithm.judge`.
- Rename some Consumer's attributes to facilitate understanding the code
@bouthilx
Copy link
Member

bouthilx commented Mar 7, 2018

Good! Let me recap this tomorrow morning. It seems we are close to a merge. 🎉

@bouthilx
Copy link
Member

bouthilx commented Mar 8, 2018

@tsirif Sorry I spent too much time on the survey and I'm lacking time now for the PR. I'll recap tomorrow afternoon (Friday).

@bouthilx
Copy link
Member

bouthilx commented Mar 8, 2018

@tsirif By the way, would be nice if you could make the change for random() -> choices() before tomorrow so the PR might be ready for merge at that point.

Copy link
Member

@bouthilx bouthilx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good overall. I opened issues for things I believe we can discuss and fix outside this PR. There is 3 things I'd like us to settle on before the PR is merged.

  1. The _Discrete class
  2. The way Categorical sample points (I find it overcomplicated)
  3. Whether Experiment._sanitize_and_instantiate_config should be inplace or not.

return size


class _Discrete(Dimension):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe _Discrete class is useless. The sample method could solely reside inside Integer class. Categorical's sample method is straightforward using numpy.random.choice. Also, interval method is overwritten anyway by Categorical's to raise RuntimeError.

Although I could easily see how Integer and Categorical are related in theory, in our context combining them is not giving any advantage. It complexifies Categorical without giving any advantage in return.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😓

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to pass somehow a separate "sandboxed" seed without affecting global numpy RNG. numpy.random.choice does not offer such a choice. Temp saving the random state by:

orig_state = numpy.random.get_state()
numpy.random.seed(seed)
samples = # Do stuff with `numpy.random` here
numpy.random.set_state(orig_state)
return samples

i am not sure whether affecting global numpy RNG is safe, if this is not your intention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about setting a self.rng = numpy.RandomState() object inside the Dimension? We could then use rng.choices().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use it only inside Categorical. The rest of Dimension use scipy.stats which does this thing internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I patched Categorical to use numpy.RandomState, using check_random_state helper function.

However, I found a difficulty removing the _Discrete subclass. Remember that _Discrete class was introduced as an alternative to remove discrete keyword argument from Dimension. So the programming flow currently is as follows, and it takes advantage of Python mechanics regarding class inheritance:

Currently:
Suppose a call to Integer(...).sample(...). In general calling a method in an Integer object will first search within Integer and then super calls from there will try to find corresponding methods first in Real, then _Discrete and then Dimension.

class Integer(Real, _Discrete):
    def interval(self, alpha=1.0):
        low, high = super(Integer, self).interval(alpha)  # Call to Real.interval
        # make them integers
        return int_low, int_high

    def __contains__(self, point):
        # Check if they are integers...
        return super(Integer, self).__contains__(point)  # Call to Dimension.__contains__


    # calling Integer(...).sample(...) will enter first Real.sample;
    # due to the order of subclasses
class Real(Dimension):
    def interval(self, alpha=1.0):
        prior_low, prior_high = super(Real, self).interval(alpha)  # Call to Dimension.interval
        return (max(prior_low, self._low), min(prior_high, self._high))  # Add constraints to interval  

    def sample(self, n_samples=1, seed=None):
        # Checks if constraint are met!
        samples = []
        for _ in range(n_samples):
            for _ in range(4):
                sample = super(Real, self).sample(1, seed)  # Calls _Discrete.sample
                if sample[0] not in self:
                # Calls Integer.__contains__
                    # ...
            # ...
        # ...

        return samples
class _Discrete(Dimension):
    def sample(self, n_samples=1, seed=None):
        samples = super(_Discrete, self).sample(n_samples, seed)  # Calls Dimension.sample
        # Making discrete by ourselves because scipy does not use **floor**
        return list(map(lambda x: numpy.floor(x).astype(int), samples))  # Converts to integers

So the sequence of calls goes like this:

  • [Integer.sample (redirects)]
  • Real.sample (fulfills extra constraints)
    • _Discrete.sample (converts to integers)
      • Dimension.sample (samples from distribution)
    • Integer.__contains__ (checks if they are integers)
      • Dimension.__contains__ (check if they are within bounds)
        • Integer.interval (makes it int)
          • Real.interval (extra constraints)
            • Dimension.interval (asks the distribution for its support)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove _Discrete and move all methods in class Integer(Real). The flow will be become:

So the sequence of calls goes like this:

  • Integer.sample (converts to integers)
    • Real.sample (fulfills extra constraints)
      • Dimension.sample (samples from distribution)
      • Integer.__contains__ (checks if they are integers)
        • Dimension.__contains__ (check if they are within bounds)
          • Integer.interval (makes it int)
            • Real.interval (extra constraints)
              • Dimension.interval (asks the distribution for its support)

Poor Dimension.sample will return whatever the scipy distribution returns, and the check from the Integer.__contains__ will fail, if the underlying distribution in scipy corresponds to a real one... However, we should allow (I believe) discretizing real distributions, so that for instance someone could have approximately normal distributed integer stochastic variables. What do you think about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the idea of discretizing Real distributions. However I don't see the problem here, can't we just move _Discrete.sample() inside Integer.sample() rather than completely removing this method and rely solely on Real.sample()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. I meant if we removed class _Discrete(Dimension) and moved all of its methods to class Integer(Real) (including sample), then:

... the sequence of calls goes like this:

  • Integer.sample (converts to integers)
    • Real.sample (fulfills extra constraints)
      • Dimension.sample (samples from distribution) -> this will return possibly samples in real numbers
      • Integer.__contains__ (checks if they are integers) -> it will fail because it checks if they are integers
        • Dimension.__contains__ (check if they are within bounds)
          • Integer.interval (makes it int)
            • Real.interval (extra constraints)
              • Dimension.interval (asks the distribution for its support)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes indeed, I was confused. 😅

My proposition then would be to replace this confusing inheritance path with a simple separation of sample() and _sample() where sample makes verifications over _sample() which do it blindly.

So this line:

sample = super(Real, self).sample(1, seed)

becomes

sample = self._sample(1, seed)

where

class Real(Dimension):
    # ...
    def _sample(self, n_samples, seed=None):
        return super(Real, self).sample(1, seed)
    # ...

class Integer(Dimension):
    # ...
    def _sample(self, n_samples, seed=None):
        samples = super(Integer, self)._sample(n_samples, seed)
        return list(map(lambda x: numpy.floor(x).astype(int), samples))
    # ...

I think it is easier this way to track the path of execution in order to understand, modify and maintain the code.

We would now have:

  • Integer.sample (converts to integers)
    • Real.sample (fulfills extra constraints)
      • Integer._sample
        • Real._sample
          • Dimension._sample (samples from distribution)
      • Integer.__contains__
        • Dimension.__contains__ (check if they are within bounds)
          • Integer.interval (makes it int)
            • Real.interval (extra constraints)
              • Dimension.interval (asks the distribution for its support)

self.categories = tuple(categories)
self._probs = tuple(numpy.tile(1. / len(categories), len(categories)))

prior = distributions.rv_discrete(values=(list(range(len(self.categories))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stand by my point here. I don't see the advantage of going through rv_discrete. Using numpy.random.choices would be simpler, without a need for any _inverse and _check.

@@ -283,6 +283,9 @@ def configure(self, config):
if self._init_done:
raise RuntimeError("Configuration is done; cannot reset an Experiment.")

# Sanitize and replace some sections with objects
candidate_algorithms = Experiment._sanitize_and_instantiate_config(config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsirif Can we settle on this one too?

Copy link
Member

@bouthilx bouthilx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! I think we can remove commit 54e9c54 given we will keep _Discrete. I'd rather leave both sample() and interval() inside of it then.


points = space.sample(2, seed=seed)
print(points)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print?

(points1[1], points2[1], points3[1])]
test_points = [(points1[0], points2[0], points3[0]),
(points1[1], points2[1], points3[1])]
print(test_points)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print?

@tsirif
Copy link
Member Author

tsirif commented Mar 19, 2018

@bouthilx All is ok!

def __contains__(self, point):
"""Check if constraints hold for this `point` of `Dimension`.

:param x: a parameter corresponding to this `Dimension`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed those points

@bouthilx bouthilx merged commit e6e62d6 into Epistimio:master Mar 19, 2018
@tsirif tsirif deleted the feature/algorithm branch March 20, 2018 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants