-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid numerical errors and overflow. #34
Avoid numerical errors and overflow. #34
Conversation
Hello! I would like to merge this PR to make this library and Optuna more cooperative. Can I ask a question about CI? kurobako test seems to be failing on the step |
I leave the concrete example, in which the current # You can skip following 2 lines, if you already installed the hpobench dataset.
$ wget http://ml4aad.org/wp-content/uploads/2019/01/fcnet_tabular_benchmarks.tar.gz
$ tar xf fcnet_tabular_benchmarks.tar.gz
$ dataset=fcnet_tabular_benchmarks/fcnet_protein_structure_data.hdf5
$ MIN_RESOURCE=1
$ N_BRACKETS=4
$ REDUCTION_FACTOR=3
$ N_RUN=1
$ BUDGET=80
$ echo -n >| ./solvers.json
$ echo -n >| ./problems.json
$ echo -n >| ./studies.json
$ kurobako problem hpobench "${dataset}" | tee -a ./problems.json
$ kurobako solver --name cma-es-median optuna \
--loglevel debug \
--sampler cma-es \
--pruner median \
--hyperband-min-resource ${MIN_RESOURCE} \
--hyperband-n-brackets ${N_BRACKETS} \
--hyperband-reduction-factor ${REDUCTION_FACTOR} \
| tee -a ./solvers.json
$ kurobako studies \
--solvers $(cat ./solvers.json) \
--problems $(cat ./problems.json) \
--repeats ${N_RUN} \
--budget ${BUDGET} \
| tee -a ./studies.json
$ cat ./studies.json | kurobako run --parallelism 10 > ./results/result.json By executing these commands on the shell, I get the following error with some logging messages.
|
Thank you! I'll reproduce the problem.
GitHub actions upload kurobako image to my google cloud storage via gsutil CLI. So google's service account is required for authentication. But due to limitation of github actions, forked repositories seems not be able to access github secrets (refs: https://github.community/t/allow-secrets-to-be-shared-with-forks-from-trusted-actions/16525). Sorry I have no idea to resolve this issue for now. Please ignore the failures of kurobako benchmark. I'll run kurobako on my laptop and paste the results here. |
memo: I tried to reproduce an error with following seed numbers (cmaes revision: 751a9bf).
script to reproduce#!/bin/sh
set -ex
DATASET=fcnet_tabular_benchmarks/fcnet_protein_structure_data.hdf5
MIN_RESOURCE=1
N_BRACKETS=4
REDUCTION_FACTOR=3
N_RUN=1
BUDGET=80
SEED=${SEED:-0}
kurobako problem hpobench "${DATASET}" | tee -a ./problems.json
kurobako solver --name cma-es-median optuna \
--loglevel error \
--sampler cma-es \
--pruner median \
--hyperband-min-resource ${MIN_RESOURCE} \
--hyperband-n-brackets ${N_BRACKETS} \
--hyperband-reduction-factor ${REDUCTION_FACTOR} \
| tee -a ./solvers.json
kurobako studies \
--solvers $(cat ./solvers.json) \
--problems $(cat ./problems.json) \
--repeats ${N_RUN} \
--budget ${BUDGET} \
--seed ${SEED} \
| tee -a ./studies.json
cat ./studies.json | kurobako run --parallelism 10 > ./results.json |
HI @c-bata! I provide an example in which the current And, currently, script: #!/bin/sh
echo -n >| ./solvers.json
echo -n >| ./problems.json
echo -n >| ./studies.json
set -ex
DATASET=fcnet_tabular_benchmarks/fcnet_protein_structure_data.hdf5
MIN_RESOURCE=1
N_BRACKETS=4
REDUCTION_FACTOR=3
N_RUN=1
BUDGET=80
SEED=${SEED:-0}
kurobako problem hpobench "${DATASET}" | tee -a ./problems.json
kurobako solver --name cma-es-median optuna \
--loglevel error \
--sampler cma-es \
--pruner median \
--hyperband-min-resource ${MIN_RESOURCE} \
--hyperband-n-brackets ${N_BRACKETS} \
--hyperband-reduction-factor ${REDUCTION_FACTOR} \
| tee -a ./solvers.json
kurobako studies \
--solvers $(cat ./solvers.json) \
--problems $(cat ./problems.json) \
--repeats ${N_RUN} \
--budget ${BUDGET} \
--seed ${SEED} \
| tee -a ./studies.json
cat ./studies.json | kurobako run --parallelism 10 > ./results.json
|
Thank you! I reproduce the error now. error log
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HideakiImamura! I successfully reproduce an error. And this PR does not affect an optimization efficiency. Basically LGTM but I have a question about a magic number.
cmaes/cma.py
Outdated
if _log_sigma > 10 ** 2.8: | ||
self._sigma = np.exp(10 ** 2.8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you tell me the meaning of this magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. I've just decided this magic number to avoid the overflow. It is more reasonable to set it as the maximum number of float. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree 👍
And I guess you can refactor these changes like self._sigma = np.exp(min(_log_sigma, ...))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid overflow, seemingly it needs to refactor like:
self._sigma = min(np.exp(_log_sigma), sys.float_info.max)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry. Due to my suggestion, the error will be raised again.
...
ValueError: Nan is detected: [nan nan nan nan nan nan]
/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/cmaes/cma.py:222: RuntimeWarning: invalid value encountered in sqrt
D = np.sqrt(D2)
[E 2020-06-22 05:06:10,985] Nan is detected: [nan nan nan nan nan nan]
Traceback (most recent call last):
File "/var/folders/9q/c1wp98sd4110kvnb89ycs7vj6xh7ks/T/.tmpLv0wtD", line 89, in <module>
runner.run()
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 172, in run
while self._run_once():
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 186, in _run_once
self._handle_ask_call(message)
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 214, in _handle_ask_call
trial = solver.ask(idg)
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/optuna.py", line 86, in ask
trial = self._create_new_trial()
(ALL) [00:01:37] [STUDIES 1787/1785 100%] [ETA 0s] canceled
(STUDY) [00:01:35] [STEPS 4033/8000 50%] [ETA 1m] "cma-es-median" "HPO-Bench-Protein"
(ALL) [00:01:37] [STUDIES 1788/1785 100%] [ETA 0s] canceled
...
Hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I changed the revision to 58e67ec,
$ git checkout 58e67ec
$ git status
HEAD detached at 58e67ec
Untracked files:
(use "git add <file>..." to include in what will be committed)
fcnet_tabular_benchmarks/
problems.json
reproduce.sh
results.json
solvers.json
studies.json
nothing added to commit but untracked files present (use "git add" to track)
$ python -m pip freeze | grep kurobako
kurobako==0.1.7
$ ./reproduce-error.sh
...
ValueError: Nan is detected: [nan nan nan nan nan nan]
/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/cmaes/cma.py:222: RuntimeWarning: invalid value encountered in sqrt
D = np.sqrt(D2)
[E 2020-06-22 05:11:06,425] Nan is detected: [nan nan nan nan nan nan]
Traceback (most recent call last):
File "/var/folders/9q/c1wp98sd4110kvnb89ycs7vj6xh7ks/T/.tmpMl1HVg", line 89, in <module>
runner.run()
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 172, in run
while self._run_once():
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 186, in _run_once
self._handle_ask_call(message)
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 214, in _handle_ask_call
trial = solver.ask(idg)
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/optuna.py", line 86, in ask
trial = self._create_new_trial()
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/optuna.py", line 172, in _create_new_trial
return optuna.trial.Trial(self._study, trial_id)
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/optuna/trial/_trial.py", line 67, in __init__
self._init_relative_params()
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/optuna/trial/_trial.py", line 77, in _init_relative_params
self.relative_params = self.study.sampler.sample_relative(
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/optuna/samplers/_cmaes.py", line 245, in sample_relative
raise ValueError("Nan is detected: {}".format(params))
ValueError: Nan is detected: [nan nan nan nan nan nan]
/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/cmaes/cma.py:222: RuntimeWarning: invalid value encountered in sqrt
D = np.sqrt(D2)
[E 2020-06-22 05:11:06,613] Nan is detected: [nan nan nan nan nan nan]
Traceback (most recent call last):
File "/var/folders/9q/c1wp98sd4110kvnb89ycs7vj6xh7ks/T/.tmpMl1HVg", line 89, in <module>
runner.run()
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 172, in run
while self._run_once():
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 186, in _run_once
self._handle_ask_call(message)
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 214, in _handle_ask_call
trial = solver.ask(idg)
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/optuna.py", line 86, in ask
(ALL) [00:01:43] [STUDIES 2117/2109 100%] [ETA 0s] canceled
(STUDY) [00:01:41] [STEPS 4097/8000 51%] [ETA 1m] "cma-es-median" "HPO-Bench-Protein"
/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/cmaes/cma.py:222: RuntimeWarning: invalid value encountered in sqrt
D = np.sqrt(D2)
[E 2020-06-22 05:11:07,187] Nan is detected: [nan nan nan nan nan nan]
Traceback (most recent call last):
File "/var/folders/9q/c1wp98sd4110kvnb89ycs7vj6xh7ks/T/.tmpMl1HVg", line 89, in <module>
runner.run()
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 172, in run
while self._run_once():
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 186, in _run_once
self._handle_ask_call(message)
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/__init__.py", line 214, in _handle_ask_call
trial = solver.ask(idg)
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/optuna.py", line 86, in ask
trial = self._create_new_trial()
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/kurobako/solver/optuna.py", line 172, in _create_new_trial
return optuna.trial.Trial(self._study, trial_id)
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/optuna/trial/_trial.py", line 67, in __init__
self._init_relative_params()
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/optuna/trial/_trial.py", line 77, in _init_relative_params
self.relative_params = self.study.sampler.sample_relative(
File "/Users/a14737/src/github.com/CyberAgent/cmaes/venv/lib/python3.8/site-packages/optuna/samplers/_cmaes.py", line 245, in sample_relative
(ALL) [00:01:43] [STUDIES 2117/2109 100%] [ETA 0s] canceled
Error: InvalidInput (cause; EOF while parsing a value at line 1 column 0)
HISTORY:
[0] at kurobako_core/src/epi/channel.rs:62 -- line=""
[1] at kurobako_core/src/epi/solver/external_program.rs:164
[2] at kurobako_core/src/epi/solver/embedded_script.rs:94
[3] at kurobako_solvers/src/optuna.rs:276
[4] at kurobako_core/src/solver.rs:176
[5] at kurobako_core/src/solver.rs:176
[6] at src/runner.rs:314
[7] at src/runner.rs:271
[8] at src/runner.rs:357
[9] at src/runner.rs:136
[10] at src/runner.rs:145
[11] at src/main.rs:85
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... It seems the version of cmaes
is wrong because the current cmaes
in this PR does not contain D = np.sqrt(D2)
but contain D = np.sqrt(np.where(D2 < 0, 0, D2))
. Could you re-install cmaes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks. I'll re-try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. The error seems not to be raised.
Now, I'm checking more simplified patch at #41.
Fix numerical overflow errors (simplified version of #34).
Motivation
In the current implementation of
CMA
class,numpy.linalg.eigh
, and then take a route for eigenvalues, andnumpy.exp
each time thetell
function is called.For the first, due to the numerical error in
numpy.linalg.eigh
, the eigenvalue may be negative and the result of the root may be Nan. As for the second,numpy.exp
may overflow when the variance of the objective function is very large. This PR aims to resolve the above problems.Description of the changes