-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the use of an external folds file #367
Conversation
1. Fix bug that led to using only a single job when CV folds were explicitly specified in local mode. 2. Change `MAX_CONCURRENT_PROCESSES` to 3 since that's what we use as the default number of grid jobs. This would also fix the issue when we were requesting more slots than necessary on the grid.
- Add a new `use_folds_file_for_grid_search` option that lets cross-validation experiments with specified folds bypass using the specified folds for the inner grid-search loop. - Refactor and rename some variables to make things cleaner.
- Pass through `use_folds_file_for_grid_search` in case we need to use it. - Reindent some long lines.
- Both `grid_search_folds` and `cv_folds` are now used as passed in from the config file. - Reindent and refactor some long lines.
- Test both `grid_search_folds` and `cv_folds` for both the `train` and `cross_validate` tasks. - Required some minor changes to `utils.py` and a template file.
- Only print out grid search related warnings when the user actually wants to do grid search.
- Update documentation for `folds_file` (formerly `cv_folds_file`). - Add documentation for new parameter `use_folds_file_for_grid_search`. - Add missing documentation for `save_cv_folds`.
- We want to hit the deprecation warning.
- This flag by default sets `grid_search_folds` to be the same as `cv_folds` if custom folds are specified. It can be set to False to prevent this.
@aoifecahill @dan-blanchard @Lguyogiro @aloukina this is now ready for review! The reason you can't see the checks is because I pushed a docstring change in the last commit and didn't want to build the whole thing again because of that. The green check mark in the last commit and the coverage increase should be all you need :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
skll/config.py
Outdated
@@ -455,7 +455,7 @@ def _parse_config_file(config_path): | |||
random_folds = config.getboolean("Input", "random_folds") | |||
if random_folds: | |||
if folds_file: | |||
logger.warning('Specifying folds_file overrides random_folds') | |||
logger.warning('Specifying \"folds_file\" overrides \"random_folds\".') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: No need for the slashes on these quotes, since it's "
inside '
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly have documentation comments and I highlighted a few situations where we should print a warning.
doc/run_experiment.rst
Outdated
keep the same folds for both the outer and the inner cross-validation loops. | ||
|
||
However, sometimes the goal of specifying the folds file is simply for the | ||
purposes of comparison to another existing experiment or another context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
purpose?
doc/run_experiment.rst
Outdated
|
||
Whether to use the specified :ref:`folds_file` for the inner grid-search | ||
cross-validation loop when :ref:`task` is set to ``cross_validate``. Ignored | ||
for all other tasks. Defaults to ``True``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a note that this will always be set to True
for train
?
|
||
save_cv_folds *(Optional)* | ||
"""""""""""""""""""""""""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which loop folds would these be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever folds end up being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I have a cross-validate
task, would these be outer or inner loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always the outer loop - the ones referred to by cv_folds
.
# ('train', None, 7, None) -> (None, 7) | ||
# ('train', None, 7, True) -> (None, 7) | ||
# ('train', None, 7, False) -> (None, 7) | ||
# ('train', 5, None, None) -> (None, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a warning in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would there need to be a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You ask for 5 folds but your request is ignored since you asked in the wrong place and you get 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is an issue of not reading the documentation so I personally think this is a user problem. PEBKAC :-p
tests/test_input.py
Outdated
# ('train', None, 7, True) -> (None, 7) | ||
# ('train', None, 7, False) -> (None, 7) | ||
# ('train', 5, None, None) -> (None, 3) | ||
# ('train', 5, None, True) -> (None. 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'.' -> ','
tests/test_input.py
Outdated
# ('train', 5, None, None) -> (None, 3) | ||
# ('train', 5, None, True) -> (None. 3) | ||
# ('train', 5, None, False) -> (None, 3) | ||
# ('train', 5, 7, None) -> (None, 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a warning too, I suppose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not sure why we need a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because your request for 5 is silently ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 5 doesn't matter here because that's the value for cv_folds
which is not used for train
at all.
# ('train', 'train/folds_file_test.csv', None, None) -> (None, fold_mapping) | ||
# ('train', 'train/folds_file_test.csv', None, True) -> (None, fold_mapping) | ||
# ('train', 'train/folds_file_test.csv', None, False) -> (None, fold_mapping) | ||
# ('train', 'train/folds_file_test.csv', 7, None) -> (None, fold_mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, your request for 7 is ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's a warning for this already.
# ('cross_validate', None, None, True) -> (10, 3) | ||
# ('cross_validate', None, None, False) -> (10, 3) | ||
# ('cross_validate', None, 7, None) -> (10, 7) | ||
# ('cross_validate', None, 7, True) -> (10, 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning?
# ('cross_validate', None, 7, True) -> (10, 7) | ||
# ('cross_validate', None, 7, False) -> (10, 7) | ||
# ('cross_validate', 5, None, None) -> (5, 3) | ||
# ('cross_validate', 5, None, True) -> (5, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think in this case the user will expect the same folds used for both loops so we should warn them?
# ('cross_validate', 'train/folds_file_test.csv', None, None) -> (fold_mapping, fold_mapping) | ||
# ('cross_validate', 'train/folds_file_test.csv', None, True) -> (fold_mapping, fold_mapping) | ||
# ('cross_validate', 'train/folds_file_test.csv', None, False) -> (fold_mapping, 3) | ||
# ('cross_validate', 'train/folds_file_test.csv', 7, None) -> (fold_mapping, fold_mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aloukina if you look in the code, I had already added some new warnings in config.py
. This is a test, so not sure why we need warnings here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need them in test of course - it's just easier to see all places with warnings when you have all combinations written out with expected behaviour.
Since you know the code better, you know where the warnings are already in place. For others we could just add an issue to add them in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and I think I already have the warnings where I think they are really important.
- Add note to new flag about train task. - Improve references to other options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
skll/config.py
Outdated
# then use that mapping for grid search instead of the value | ||
# contained in `grid_search_folds`. | ||
# (b) if the task is `cross_validate` and an external fold mapping is specified and the | ||
# then use that mapping for the outer CV loop. Depending on the value of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
- We were not actually passing in the flag to `cross_validate()` as we should be.
- And reorder the condition so that reading it makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this on one of the datasets that prompted me to file the original issue and observed a noticeable decrease in the time taken to run an experiment. Thanks!
Fix the bug that was causing the inner grid-search loop of a cross-validation experiment to use a single job instead of the number specified via
grid_search_jobs
.Rename the
cv_folds_file
config file option tofolds_file
and update the documentation to reflect that it is applicable to both the training and the cross-validation experiments..Add a new config file option called
use_folds_file_for_grid_search
which controls whether the inner-loop grid-search uses the folds specified viafolds_file
. It's set toTrue
by default which means. Setting it toFalse
means that the inner loop uses the value ofgrid_search_folds
(3 by default). This option is only used in case of cross-validation experiments and is ignored otherwise. Add documentation for this option.Also add a new keyword argument called
use_custom_folds_for_grid_search
to thelearner.cross_validate()
method. This flag by default setsgrid_search_folds
to be the same ascv_folds
if custom folds are specified. It can be set toFalse
to prevent this. This is required to handle the case whencross_validate()
is directly called via the API, by passing the configuration parser.Add a new comprehensive test to test the setting the values of
cv_folds
andgrid_search_folds
in the configuration file based on the above options.Refactor tests to deal with the renaming of
cv_folds_file
tofolds_file
.Add documentation for the
save_cv_folds
option which we had forgotten to do when the option was added back in the day.Minor re-indentation of some part of the code for readability.
(Note: See #363 for detailed timing experiments.)