-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add learning curves #332
Add learning curves #332
Conversation
- Needed for learning curve generation.
- instead of a method of `Learner` because it needs to be picklable.
Reviewers, please actually test this PR on your machines (on multiple datasets, if possible) to make sure things work as expected. Thanks! I have added an example config file for the Titanic example to the repository if you need a place to start and play around. |
- and coverage hopefully goes up.
Thanks @bndgyawali! Did you use the default values for the learning curve CV folds? Do you mind sharing your config file here? Also, do you expect the |
Here it is. I did use all the default values.
|
From @bndgyawali, it looks like this works pretty well on other datasets and for larger number of learners and objectives. I am currently trying to get the coverage issue resolved which is a little annoying because we don't want to require |
You had mentioned that grid_search is disallowed, is it better to show warning message if |
Okay @bndgyawali @aloukina @dan-blanchard @benbuleong @cml54 @mheilman @aoifecahill @bwriordan @mulhod @dmnapolitano , this PR can now be considered complete including changes to the documentation. You guys can now start reviewing and testing it on your own datasets to make sure things work as expected. Don't worry about the coverage decrease. It's only 0.003% compared to the last build and compared to |
@dan-blanchard will you have time to look at this PR this week? |
Unfortunately, if the training set sizes are a bit larger, the numbers get squished together on the same line in graph. Unrelated to this example, I saw that the graph key is only on the first graph if you provide multiple learners and/or multiple objectives. This might be the intention, obviously, but I was thinking that maybe it would be better if it were not part of the first graph and was just at the top/sides/bottom when there are multiple graphs, if possible. This is the config I used in the example above:
|
@mulhod thanks for looking at the PR!
|
The system I used to conduct the experiment I reported on above was Linux Ubuntu:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked over all the changes and don't have much to say other than the comment I left separately (re: X axis labels). Looks good!
tests/test_output.py
Outdated
# We itereate over each model with an expected | ||
# accuracy score. T est proves that the report | ||
# We iterate over each model with an expected | ||
# accuracy score. Test proves that the report | ||
# written out at least as a correct format for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"as" is meant to be "has", right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. I could not find anything to say.
Some metrics like r2 could be negative when model performance is really bad. The plots seem to set the min of y axis to 0. |
If one supplies learning_curve_train_sizes = [1.0, 100.0, 200.0] this raises an error since the numbers are interpreted as floats but are not within (0, 1] range. This is the way this is handled in scikit-learn so we could leave it as is or be user-friendly and add a checking function to convert those to integers before passing them to the learning_curve functions? |
Thanks @aloukina! Very useful comments. Will incorporate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problems that need fixing are:
- Negative metrics
- X axis
The floats vs. int issue is just a suggestion.
.format(grid_objectives)) | ||
|
||
# check whether the right things are set for the given task | ||
if (task == 'evaluate' or task == 'predict') and not test_path: | ||
raise ValueError('The test set must be set when task is evaluate or ' | ||
'predict.') | ||
if (task == 'cross_validate' or task == 'train') and test_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should these unused fields trigger a ValueError rather than a Warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because if the user set the test_path
, may be they wanted to do evaluate
rather than cross_validate
or train
and forgot to change the task. The SKLL philosophy has always been to make no assumptions and let the user fix the config file lest they run a really long experiment they didn't really want to.
My latest commits (a) automatically rotate the x-tick labels if any of the sizes is >= 1000 and (b) automatically generate the correct y-limits for the learning curves based on the metrics. Note that if you use the @aloukina and @mulhod can you please re-run your respective tests to make sure that the labels and y-limits are okay? |
Hmm, yeah I think we can do that. Let me see. |
- and update its test.
Okay, I tweaked the plot generation code to hide unnecessary areas of the plot. Here's the new plot of your data, @mulhod. What do you think? |
learning_curve
. This essentially ties in to a newlearning_curve()
method in theLearner
class which is adapted from the scikit-learn methodsklearn.model_selection.learning_curve()
. The reason that I didn't just basically call the scikit-learn method directly is because that method works with estimator objects and raw feature arrays . We want to apply the whole SKLL pipeline (feature selection, transformation, hashing, etc.) that the user has specified when computing the learning curve results and so we need to use the SKLL API.learning_curve
task is a TSV file containing the training set size and the averaged scores for all combinations of featuresets, learners, and objectives. Ifpandas
andseaborn
are available, actual learning curves are generated as PNG files - one for each feature set. Each PNG file contains a faceted plot with objective functions on rows and learners on columns. Here's an example plot.(Note: since grid search is disallowed, we don't really need to train the learner for each objective separately; we could simply train the learner once and then compute the scores using multiple functions. However, this doesn't fit into the current parallelization scheme that SKLL follows and so I didn't feel like changing that. The training jobs are run in parallel so it's not that big a deal anyway.)