Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the use of an external folds file #367

Merged
merged 31 commits into from
Oct 17, 2017

Conversation

desilinguist
Copy link
Member

@desilinguist desilinguist commented Oct 12, 2017

  1. Fix the bug that was causing the inner grid-search loop of a cross-validation experiment to use a single job instead of the number specified via grid_search_jobs.

  2. Rename the cv_folds_file config file option to folds_file and update the documentation to reflect that it is applicable to both the training and the cross-validation experiments..

  3. Add a new config file option called use_folds_file_for_grid_search which controls whether the inner-loop grid-search uses the folds specified via folds_file. It's set to True by default which means. Setting it to False means that the inner loop uses the value of grid_search_folds (3 by default). This option is only used in case of cross-validation experiments and is ignored otherwise. Add documentation for this option.

  4. Also add a new keyword argument called use_custom_folds_for_grid_search to the learner.cross_validate() method. This flag by default sets grid_search_folds to be the same as cv_folds if custom folds are specified. It can be set to False to prevent this. This is required to handle the case when cross_validate() is directly called via the API, by passing the configuration parser.

  5. Add a new comprehensive test to test the setting the values of cv_folds and grid_search_folds in the configuration file based on the above options.

  6. Refactor tests to deal with the renaming of cv_folds_file to folds_file.

  7. Add documentation for the save_cv_folds option which we had forgotten to do when the option was added back in the day.

  8. Minor re-indentation of some part of the code for readability.

(Note: See #363 for detailed timing experiments.)

1. Fix bug that led to using only a single job when CV folds were explicitly specified in local mode.
2. Change `MAX_CONCURRENT_PROCESSES` to 3 since that's what we use as the default number of grid jobs. This would also fix the issue when we were requesting more slots than necessary on the grid.
- Add a new `use_folds_file_for_grid_search` option that lets cross-validation experiments with specified folds bypass using the specified folds for the inner grid-search loop.
- Refactor and rename some variables to make things cleaner.
- Pass through `use_folds_file_for_grid_search` in case we need to use it.
- Reindent some long lines.
- Both `grid_search_folds` and `cv_folds` are now used as passed in from the config file.
- Reindent and refactor some long lines.
- Test both `grid_search_folds` and `cv_folds` for both the `train` and `cross_validate` tasks.
- Required some minor changes to `utils.py` and a template file.
- Only print out grid search related warnings when the user actually wants to do grid search.
 - Update documentation for `folds_file` (formerly `cv_folds_file`).
- Add documentation for new parameter `use_folds_file_for_grid_search`.
- Add missing documentation for `save_cv_folds`.
@coveralls
Copy link

coveralls commented Oct 12, 2017

Coverage Status

Coverage decreased (-0.2%) to 91.629% when pulling d466115 on fix/cv-folds-file-is-slower into 13e4982 on master.

- We want to hit the deprecation warning.
- This flag by default sets `grid_search_folds` to be the same as `cv_folds` if custom folds are specified. It can be set to False to prevent this.
@coveralls
Copy link

coveralls commented Oct 13, 2017

Coverage Status

Coverage increased (+0.1%) to 91.977% when pulling 3c30141 on fix/cv-folds-file-is-slower into 13e4982 on master.

@desilinguist
Copy link
Member Author

desilinguist commented Oct 13, 2017

@aoifecahill @dan-blanchard @Lguyogiro @aloukina this is now ready for review! The reason you can't see the checks is because I pushed a docstring change in the last commit and didn't want to build the whole thing again because of that. The green check mark in the last commit and the coverage increase should be all you need :)

Copy link
Contributor

@dan-blanchard dan-blanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

skll/config.py Outdated
@@ -455,7 +455,7 @@ def _parse_config_file(config_path):
random_folds = config.getboolean("Input", "random_folds")
if random_folds:
if folds_file:
logger.warning('Specifying folds_file overrides random_folds')
logger.warning('Specifying \"folds_file\" overrides \"random_folds\".')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: No need for the slashes on these quotes, since it's " inside '.

Copy link
Collaborator

@aloukina aloukina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly have documentation comments and I highlighted a few situations where we should print a warning.

keep the same folds for both the outer and the inner cross-validation loops.

However, sometimes the goal of specifying the folds file is simply for the
purposes of comparison to another existing experiment or another context
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

purpose?


Whether to use the specified :ref:`folds_file` for the inner grid-search
cross-validation loop when :ref:`task` is set to ``cross_validate``. Ignored
for all other tasks. Defaults to ``True``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a note that this will always be set to True for train?


save_cv_folds *(Optional)*
""""""""""""""""""""""""""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which loop folds would these be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever folds end up being used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I have a cross-validate task, would these be outer or inner loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always the outer loop - the ones referred to by cv_folds.

# ('train', None, 7, None) -> (None, 7)
# ('train', None, 7, True) -> (None, 7)
# ('train', None, 7, False) -> (None, 7)
# ('train', 5, None, None) -> (None, 3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a warning in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there need to be a warning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You ask for 5 folds but your request is ignored since you asked in the wrong place and you get 3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is an issue of not reading the documentation so I personally think this is a user problem. PEBKAC :-p

# ('train', None, 7, True) -> (None, 7)
# ('train', None, 7, False) -> (None, 7)
# ('train', 5, None, None) -> (None, 3)
# ('train', 5, None, True) -> (None. 3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'.' -> ','

# ('train', 5, None, None) -> (None, 3)
# ('train', 5, None, True) -> (None. 3)
# ('train', 5, None, False) -> (None, 3)
# ('train', 5, 7, None) -> (None, 7)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a warning too, I suppose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not sure why we need a warning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because your request for 5 is silently ignored.

Copy link
Member Author

@desilinguist desilinguist Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5 doesn't matter here because that's the value for cv_folds which is not used for train at all.

# ('train', 'train/folds_file_test.csv', None, None) -> (None, fold_mapping)
# ('train', 'train/folds_file_test.csv', None, True) -> (None, fold_mapping)
# ('train', 'train/folds_file_test.csv', None, False) -> (None, fold_mapping)
# ('train', 'train/folds_file_test.csv', 7, None) -> (None, fold_mapping)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, your request for 7 is ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's a warning for this already.

# ('cross_validate', None, None, True) -> (10, 3)
# ('cross_validate', None, None, False) -> (10, 3)
# ('cross_validate', None, 7, None) -> (10, 7)
# ('cross_validate', None, 7, True) -> (10, 7)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning?

# ('cross_validate', None, 7, True) -> (10, 7)
# ('cross_validate', None, 7, False) -> (10, 7)
# ('cross_validate', 5, None, None) -> (5, 3)
# ('cross_validate', 5, None, True) -> (5, 3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think in this case the user will expect the same folds used for both loops so we should warn them?

# ('cross_validate', 'train/folds_file_test.csv', None, None) -> (fold_mapping, fold_mapping)
# ('cross_validate', 'train/folds_file_test.csv', None, True) -> (fold_mapping, fold_mapping)
# ('cross_validate', 'train/folds_file_test.csv', None, False) -> (fold_mapping, 3)
# ('cross_validate', 'train/folds_file_test.csv', 7, None) -> (fold_mapping, fold_mapping)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning?

Copy link
Member Author

@desilinguist desilinguist Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aloukina if you look in the code, I had already added some new warnings in config.py. This is a test, so not sure why we need warnings here?

Copy link
Member Author

@desilinguist desilinguist Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aloukina here are the new warnings that are already in the PR (each word is a link).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need them in test of course - it's just easier to see all places with warnings when you have all combinations written out with expected behaviour.
Since you know the code better, you know where the warnings are already in place. For others we could just add an issue to add them in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and I think I already have the warnings where I think they are really important.

Copy link
Contributor

@Lguyogiro Lguyogiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

skll/config.py Outdated
# then use that mapping for grid search instead of the value
# contained in `grid_search_folds`.
# (b) if the task is `cross_validate` and an external fold mapping is specified and the
# then use that mapping for the outer CV loop. Depending on the value of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@coveralls
Copy link

coveralls commented Oct 17, 2017

Coverage Status

Coverage increased (+0.1%) to 91.98% when pulling 36f156b on fix/cv-folds-file-is-slower into 13e4982 on master.

Copy link
Collaborator

@aoifecahill aoifecahill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this on one of the datasets that prompted me to file the original issue and observed a noticeable decrease in the time taken to run an experiment. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants