Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic learning curves #221

Closed
mheilman opened this issue Dec 30, 2014 · 4 comments
Closed

automatic learning curves #221

mheilman opened this issue Dec 30, 2014 · 4 comments
Assignees
Milestone

Comments

@mheilman
Copy link
Contributor

It would be nice to have functionality to automatically run experiments with different size training sets in order to plot performance as a function of training sample size.

Perhaps this could be a separate experiment type (e.g., like "evaluate"). This probably does not make sense for cross-validation.

Perhaps there should be an option not to save all the models, etc., since there could be hundreds.

Possible configuration options:

  • a list of sample sizes to consider (the default could be powers of 2 starting at 32 or 64)
  • the number of replications per sample size (this could default to 1, but higher values would produce smoother learning curves)

(This was also briefly mentioned in the discussion of #212.)

@desilinguist desilinguist added this to the 1.2 milestone Jul 18, 2015
@aoifecahill aoifecahill modified the milestones: 2.0, 1.2 Feb 12, 2016
@desilinguist
Copy link
Member

This should use the learning curves feature from scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.learning_curve.learning_curve.html

@desilinguist
Copy link
Member

@langep, has there been any progress on this issue?

@desilinguist desilinguist self-assigned this Dec 2, 2016
@desilinguist desilinguist modified the milestones: 1.3, 2.0 Dec 2, 2016
@desilinguist
Copy link
Member

desilinguist commented Jan 13, 2017

I'd like to include this in the upcoming v1.3.

Here's how I am thinking about this:

  1. Have a new experiment type called learning_curves which will compute the learning curve using scikit-learn's built-in method and save the output as a CSV in the results directory. It will also save the actual learning curve plot as a PNG in the results directory if matplotlib is available.
  2. The models for the various training sizes will not be saved.
  3. Users will be able to specify the various training sizes and the number of cross-validation iterations to be used for averaging.
  4. Since we generally want to do at least 10 folds of cross-validation to get a smooth learning curve, grid search will not be allowed within each fold since that would make it too slow.

Since this will be a new feature, I'd like to solicit input from all of you: @dan-blanchard @dmnapolitano @mheilman @bndgyawali @mulhod @benbuleong @aoifecahill @aloukina @cml54 @bwriordan.

Thanks!

@desilinguist
Copy link
Member

addressed by #332.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants