Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multidimensional vw hyperparameter optimization with hyperopt #867

Merged
merged 1 commit into from
Nov 30, 2015
Merged

Multidimensional vw hyperparameter optimization with hyperopt #867

merged 1 commit into from
Nov 30, 2015

Conversation

kirillfish
Copy link

All functionality (and even a bit more) and almost all syntax is in the line with @martinpopel 's suggestion here: https://github.com/martinpopel/vowpal_wabbit/wiki/vw-hyperopt-plans.
See also SO questions:
http://stackoverflow.com/questions/33262598/get-holdout-loss-in-vowpal-wabbit
http://stackoverflow.com/questions/33242742/multidimensional-hyperparameter-search-with-vw-hypersearch-in-vowpal-wabbit

Thank you @martinpopel @arielf for inspiration! I did it!

Unlike vw-hypersearch, vw-hyperopt.py can be multidimensional. It implements Tree of Parzen Estimators and Random search algorithms from hyperopt. TPE uses adaptive sampling strategy and addresses explore-exploit dilemma. There is some research showing that TPE may return better hyperparameter configuration than manual expert choice or grid search.

Here is an example of using vw-hyperopt:

python vw-hyperopt.py --train train.dat --holdout holdout.dat --max_evals 200 --outer_loss_function logistic --vw_space '--algorithms=[ftrl,sgd] --l2=[1e-8..1e-4]LO --l1=[1e-8..1e-4]LO -l=[0.01..10]L --ftrl_alpha=[5e-5..5e-1]L --ftrl_beta=[0.01..1] --passes=[1..10]I -q=[SE+SZ+DR,SE]O --ignore=[T]O --loss_function=[logistic] -b=[29]'

The quoted part after --vw_space flag means literally the same thing as @martinpopel described. It is converted to hyperopt tree-like search space. There is even additional functionality: you can list different combinations of quadratic features, just separating namespace combinations by "+" symbol (see an example).

Another new functionality is that you can optimize hyperparameters with respect to a custom metric, such as ROC-AUC, that can be different from inner vw loss function. This can be useful sometimes. The flag corresponding to this is --outer_loss_function. By now, only logistic (default) and roc-auc are implemented, but this list can be easily expanded.

Yet another additional feature is that you can specify several algorithms at once. They will be converted to hyperopt.hp.choice() object. Currently SGD and FTRL-Proximal are supported. If there are prohibited flags, such as --ftrl_alpha for SGD, they will simply be excluded from hyperopt search space for this particular algorithm.

You can also specify the maximum number of hyperparameter combinations to explore with --max_evals (default=100).

The possible (non-critical, I hope) problems and ways to improve my module, as I see it now:

  1. Some difficulties with exploiting natural VW holdout error-handling functionality (see this issue) motivated me to use separate train and holdout sets. But, if I understood correctly, the same way was proposed here. So the dataset should be manually split before using vw-hyperopt.
  2. All the training is now done with --holdout_off flag in order to make use all the training data, and also to make sure that model is always saved (see the mentioned issue).
  3. In light of (1), supplementary files "current.model" and "holdout.pred" are created in the course of optimization (and deleted at the end). The validation metrics are computed with scikit-learn methods using these files.
  4. Regression problems are not yet supported. In the nearest future I'm going to add everything needed for them.
  5. Need to write "python" in the beginning of a command and to quote a search space part. I hope this doesn't complicate things too much.
  6. If the code needs to be better documented, I can do it.
  7. Plotting and visualization are still to be done. I'm going to make it with matplotlib+seaborn.
  8. Maybe there is better format for the final output. Now it writes to stdout (and to a log file) something of this kind:
2015-11-25 00:06:20,853 INFO     [root/vw-hyperopt:290]:

A FULL TRAINING COMMAND WITH THE BEST HYPERPARAMETERS:
vw -d train.dat -f current.model --holdout_off -c  --passes 5 --ftrl_alpha 0.418819407315 -l 1.43698068436 -b 29 -q SE -q SZ -q DR --loss_function logistic --ftrl_beta 0.428875395523 --ftrl

2015-11-25 00:06:20,853 INFO     [root/vw-hyperopt:291]:

THE BEST LOSS VALUE:
0.0207917909491

JohnLangford added a commit that referenced this pull request Nov 30, 2015
Multidimensional vw hyperparameter optimization with hyperopt
@JohnLangford JohnLangford merged commit b18a81b into VowpalWabbit:master Nov 30, 2015
@JohnLangford
Copy link
Member

This seems like an obviously good idea and it's entirely isolated so I
merged it. Thanks much.

Ariel has a number of good suggestions. I'm sure working through them
would help improve usability.

-John

P.S. If you have happen to be going to NIPS, tell me and we'll squeeze you
into the VW tutorial (http://hunch.net/).

On Tue, Nov 24, 2015 at 6:08 PM, Кирилл Владимирович Рыбачук <
notifications@github.com> wrote:

All functionality (and even a bit more) and almost all syntax is in the
line with @martinpopel https://github.com/martinpopel 's suggestion
here: https://github.com/martinpopel/vowpal_wabbit/wiki/vw-hyperopt-plans.
See also SO questions:

http://stackoverflow.com/questions/33262598/get-holdout-loss-in-vowpal-wabbit

http://stackoverflow.com/questions/33242742/multidimensional-hyperparameter-search-with-vw-hypersearch-in-vowpal-wabbit

Thank you @martinpopel https://github.com/martinpopel @arielf
https://github.com/arielf for inspiration! I did it!

Unlike vw-hypersearch, vw-hyperopt.py can be multidimensional. It
implements Tree of Parzen Estimators and Random search algorithms from
hyperopt. TPE uses adaptive sampling strategy and addresses
explore-exploit dilemma. There is some research
http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
showing that TPE may return better hyperparameter configuration than manual
expert choice or grid search.

Here is an example of using vw-hyperopt:

python vw-hyperopt.py --train train.dat --holdout holdout.dat --max_evals 200 --outer_loss_function logistic --vw_space '--algorithms=[ftrl,sgd] --l2=[1e-8..1e-4]LO --l1=[1e-8..1e-4]LO -l=[0.01..10]L --ftrl_alpha=[5e-5..5e-1]L --ftrl_beta=[0.01..1] --passes=[1..10]I -q=[SE+SZ+DR,SE]O --ignore=[T]O --loss_function=[logistic] -b=[29]'

The quoted part after --vw_space flag means literally the same thing as
@martinpopel https://github.com/martinpopel described. It is converted
to hyperopt tree-like search space. There is even additional
functionality: you can list different combinations of quadratic features,
just separating namespace combinations by "+" symbol (see an example).

Another new functionality is that you can optimize hyperparameters with
respect to a custom metric that can be different from inner vw loss
function, such as ROC-AUC. This can be useful sometimes. The flag
corresponding to this is --outer_loss_function. By now, only logistic
(default) and roc-auc are implemented, but this list can be easily
expanded.

Yet another additional feature is that you can specify several algorithms
at once. They will be converted to hyperopt.hp.choice() object. Currently
SGD and FTRL-Proximal are supported. Don't worry about prohibited flags,
such as --ftrl_alpha for SGD: they will simply be excluded from hyperopt
search space for this algorithm.

You can also specify the maximum number of hyperparameter combinations to
explore with --max_evals (default=100).

The possible ways to improve my module, as I see it now:

  1. Some difficulties with exploiting natural VW holdout error-handling
    functionality (see this issue
    Model is not saved when --passes exceeds 2 #859 motivated me
    to use separate train and holdout sets. This imposes additional work on
    a researcher, so I'm going to put it right. All the training is done with
    --holdout_off flag in order to make use all the training data, and also
    to make sure that model is always saved (see the mentioned issue).
  2. In light of (1), supplementary files "current.model" and "holdout.pred"
    are created in the course of optimization (and deleted at the end). The
    validation metrics are computed with scikit-learn methods using these
    files.
  3. Regression problems are not yet supported. In the nearest future I'm
    going to add everything needed for them.
  4. Need to write "python" in the beginning of a command and to quote the
    search space part. I hope this doesn't complicate things too much.
  5. Maybe there is better format for the final output. Now it writes to
    stdout (and to the log file) something of this kind:

2015-11-25 00:06:20,853 INFO [root/vw-hyperopt:290]:

A FULL TRAINING COMMAND WITH THE BEST HYPERPARAMETERS:
vw -d train.dat -f current.model --holdout_off -c --passes 5 --ftrl_alpha 0.418819407315 -l 1.43698068436 -b 29 -q SE -q SZ -q DR --loss_function logistic --ftrl_beta 0.428875395523 --ftrl

2015-11-25 00:06:20,853 INFO [root/vw-hyperopt:291]:

THE BEST LOSS VALUE:
0.0207917909491


You can view, comment on, or merge this pull request online at:

#867
Commit Summary

  • Multidimensional vw hyperparameter optimization with hyperopt

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#867.

@kirillfish
Copy link
Author

Hi Ariel, thank you for the response! Here are the answers to your comments:

  • Yes, I'll make it executable and commit again, thanks.
  • Logging vs printing. The main reason for logging is that typically hyperparameter tuning takes at least dozens of iterations and is rather lengthy process, even for VW (when it comes to millions of observations). It cat take hours and even days. So I thought it may be better if messages are stored somewhere on disc so that user could access them later, in addition to being displayed on the screen (now they are displayed, too). I can remove this and leave only printing, if it's needed.
  • Ok, I'll make notifications lowercase.
  • As yet, too much information is printed to stdout to also display dots or progress bar. If all the messages are not suppressed, or --quiet vw flag is not added by default, then progress dots will be lost in the clutter.
  • Default values. For the time being, only --max_evals, --searcher and --outer_loss_function have some values by default (100, 'tpe' and 'logistic', respectively). Maybe it's worth to make them match vw core default values (that is squared loss)? Search space has no default ranges now.
  • Search space definition syntax. Initially, I quoted the --vw_space value because it inevitably contains spaces, and quoting is the only way to make python argparse module treat it as a single argument. The syntax with square brackets is taken from @martinpopel 's forked Vowpal Wabbit repository: https://github.com/martinpopel/vowpal_wabbit/wiki/vw-hyperopt-plans
    The meaning of [] is that it separates the range of possible values and distribution options (such as L, I, O). For instance, this: -q=[SE+SZ+DR,SE]O translates as "take SE quadratic feature, or SE and SZ and DR quadratic features, and try also omit this flag at all". L and I letters denote log-uniform and integer-valued uniform prior distributions, respectively (if there is .. of course). An absence of letters means real-valued uniform prior distribution. If there were no "]", the letters would fuse to the range of values and become difficult to parse. Maybe I'd better try another separators such as tilde, so ranges look like -l=0.01..10~L ?
  • Treating of "algorithms" option. --algorithms flag is really parsed differently. The reason is that I wanted to make completely separate search spaces for different algorithms. What would happen if --algorithms is counted as ordinary discrete-valued hyperparameter? On the one hand, discrete values are treated as completely non-correlated by hyperopt, so in principle an independent sub-space should be created by hyperopt for each such value. On the other hand, the points from the one single pool of hyperparameters will be drawn for each iteration, regardless of the algorithm. So for the given algorithm there can be "prohibited" dimensions in search space. hyperopt will be mislead and will try to elicit --ftrl_beta contribution to SGD performance on previous steps, when deciding what point to sample next. This may reduce efficiency and increase the number of steps required to discover some good configuration. That's why I didn't treat --algorithm option this way.

Nevertheless, it seems like I can unify the syntax of the command-line. Intuitively, this expression: --algorithms=,--ftrl should be written as algorithms=[--ftrl]O following the logic just described, that is "try --ftrl or default algorithm". But now it is written as --algorithms=[ftrl,sgd] (sgd converts not to --sgd, but to default algorithm with AdaGrad etc.). Omitting --algorithms is equivalent to writing --algorithms=[sgd].

So, should I: (a) change separator of range and distribution options? (b) make --algorithms usage syntax the same as for other flags? (c) make something else?

On top of this, hyperopt allows to save any useful information in the course of optimization. For instance , it may be useful to make diagnostic plots by default, such as (Iteration number) vs. (The best score obtained by this iteration). Also train loss can be tracked and plotted on the same plot to evaluate overall variance. Should I add such functionality?

@kirillfish
Copy link
Author

@JohnLangford @arielf Thank you for merging!

I'll work through your suggestions and ideas about improving it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants