Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add learning curves #332

Merged
merged 23 commits into from
Feb 8, 2017
Merged

Add learning curves #332

merged 23 commits into from
Feb 8, 2017

Conversation

desilinguist
Copy link
Member

  • Addresses automatic learning curves #221.
  • The way that this works is by having a new task type called learning_curve. This essentially ties in to a new learning_curve() method in the Learner class which is adapted from the scikit-learn method sklearn.model_selection.learning_curve(). The reason that I didn't just basically call the scikit-learn method directly is because that method works with estimator objects and raw feature arrays . We want to apply the whole SKLL pipeline (feature selection, transformation, hashing, etc.) that the user has specified when computing the learning curve results and so we need to use the SKLL API.
  • The process of computing the curve is as follows: Only a training set is required. For each point on the leaning curve, the training set is split into two partitions 80/20. The learner is trained on the subset of the 80% corresponding to the point of the learning curve and then evaluated on the 20% partition. This is repeated multiple times (using multiple different 80/20 partitions) and the results are averaged. This gives us the score for each point in the training curve. The whole process it then repeated for each point on the curve.
  • I consider the learning curve task to be orthogonal to ablation and finding the right hyper-parameters. Therefore, ablation and grid search are not allowed. Just like for the cross-validation task, no models are saved for this task.
  • Users can specify the various training set sizes and the number of 80/20 partitions for each point tin the curve (if they don't, there are reasonable defaults for both).
  • Users can also specify the number of cross-validation iterations to be used for averaging the results for a given training set size.
  • The output of the learning_curve task is a TSV file containing the training set size and the averaged scores for all combinations of featuresets, learners, and objectives. If pandas and seaborn are available, actual learning curves are generated as PNG files - one for each feature set. Each PNG file contains a faceted plot with objective functions on rows and learners on columns. Here's an example plot.
    foo_example_iris

(Note: since grid search is disallowed, we don't really need to train the learner for each objective separately; we could simply train the learner once and then compute the scores using multiple functions. However, this doesn't fit into the current parallelization scheme that SKLL follows and so I didn't feel like changing that. The training jobs are run in parallel so it's not that big a deal anyway.)

@desilinguist
Copy link
Member Author

desilinguist commented Jan 23, 2017

Reviewers, please actually test this PR on your machines (on multiple datasets, if possible) to make sure things work as expected. Thanks! I have added an example config file for the Titanic example to the repository if you need a place to start and play around.

@coveralls
Copy link

coveralls commented Jan 23, 2017

Coverage Status

Coverage decreased (-1.5%) to 89.977% when pulling 8a86262 on feature/add-learning-curves into b3228e2 on master.

@coveralls
Copy link

coveralls commented Jan 23, 2017

Coverage Status

Coverage decreased (-1.8%) to 89.631% when pulling e7ea809 on feature/add-learning-curves into b3228e2 on master.

@ghost
Copy link

ghost commented Jan 23, 2017

I ran on a dataset and got this nice plot.
screen shot 2017-01-23 at 5 00 09 pm

@desilinguist
Copy link
Member Author

desilinguist commented Jan 23, 2017

Thanks @bndgyawali! Did you use the default values for the learning curve CV folds? Do you mind sharing your config file here? Also, do you expect the Lasso curves to look like that?

@ghost
Copy link

ghost commented Jan 23, 2017

Here it is. I did use all the default values.

[General]
experiment_name=exp_name
task=learning_curve

[Input]
train_directory=path_to_feature_directory
featuresets = [['abc_feature'], ['def_feature']]
learners=['LinearRegression', 'LinearSVR', 'RandomForestRegressor', 'Lasso', 'Ridge', 'ElasticNet']
suffix=.csv
feature_scaling=both

[Tuning]
grid_search=true
objectives=[quadratic_weighted_kappa, linear_weighted_kappa, qwk_off_by_one]

[Output]
log=/tmp/skll_output
results=/tmp/skll_output

@desilinguist
Copy link
Member Author

From @bndgyawali, it looks like this works pretty well on other datasets and for larger number of learners and objectives. I am currently trying to get the coverage issue resolved which is a little annoying because we don't want to require matplotlib.

@ghost
Copy link

ghost commented Jan 23, 2017

You had mentioned that grid_search is disallowed, is it better to show warning message if grid_search=true?

@coveralls
Copy link

coveralls commented Jan 24, 2017

Coverage Status

Coverage increased (+0.2%) to 91.725% when pulling 389777b on feature/add-learning-curves into b3228e2 on master.

@desilinguist
Copy link
Member Author

desilinguist commented Jan 24, 2017

Okay @bndgyawali @aloukina @dan-blanchard @benbuleong @cml54 @mheilman @aoifecahill @bwriordan @mulhod @dmnapolitano , this PR can now be considered complete including changes to the documentation. You guys can now start reviewing and testing it on your own datasets to make sure things work as expected.

Don't worry about the coverage decrease. It's only 0.003% compared to the last build and compared to master, it's actually 0.2% higher :)

@desilinguist
Copy link
Member Author

@dan-blanchard will you have time to look at this PR this week?

@mulhod
Copy link
Contributor

mulhod commented Jan 30, 2017

call_henry_wer_bias_bias length wer

Unfortunately, if the training set sizes are a bit larger, the numbers get squished together on the same line in graph.

Unrelated to this example, I saw that the graph key is only on the first graph if you provide multiple learners and/or multiple objectives. This might be the intention, obviously, but I was thinking that maybe it would be better if it were not part of the first graph and was just at the top/sides/bottom when there are multiple graphs, if possible.

This is the config I used in the example above:

[General]
experiment_name = call_henry_wer_bias
task = learning_curve

[Input]
train_directory = ~/skll_learning_curves/call_henry_wer_bias/features
id_col = id
label_col = y
featuresets = [["length", "bias", "wer"]]
learners = ['RescaledSVR']
ids_to_floats = False
suffix = .jsonlines
feature_scaling = both
fixed_parameters = [{"C": 100.0, "gamma": 0.001}]

[Tuning]
min_feature_count = 1
feature_scaling = none
objectives = ["unweighted_kappa"]

[Output]
log = ~/skll_learning_curves/call_henry_wer_bias/log
results = ~/skll_learning_curves/call_henry_wer_bias/results

@desilinguist
Copy link
Member Author

desilinguist commented Jan 30, 2017

@mulhod thanks for looking at the PR!

  1. Good call about the x-tick labels. I can rotate them so so that they are easily visible.

  2. Yes, the legend is only on the first graph intentionally. I did try creating the legend outside of the plotting area but it never really worked that well for me. I'll look at it again but if it doesn't work, I think putting it in the first plot is a decent compromise.

@mulhod
Copy link
Contributor

mulhod commented Jan 30, 2017

The system I used to conduct the experiment I reported on above was Linux Ubuntu:

$ uname -a
Linux frigga 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Copy link
Contributor

@mulhod mulhod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over all the changes and don't have much to say other than the comment I left separately (re: X axis labels). Looks good!

# We itereate over each model with an expected
# accuracy score. T est proves that the report
# We iterate over each model with an expected
# accuracy score. Test proves that the report
# written out at least as a correct format for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"as" is meant to be "has", right?

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. I could not find anything to say.

@aloukina
Copy link
Collaborator

aloukina commented Feb 2, 2017

Some metrics like r2 could be negative when model performance is really bad. The plots seem to set the min of y axis to 0.

@aloukina
Copy link
Collaborator

aloukina commented Feb 2, 2017

If one supplies learning_curve_train_sizes = [1.0, 100.0, 200.0] this raises an error since the numbers are interpreted as floats but are not within (0, 1] range. This is the way this is handled in scikit-learn so we could leave it as is or be user-friendly and add a checking function to convert those to integers before passing them to the learning_curve functions?

@aloukina
Copy link
Collaborator

aloukina commented Feb 2, 2017

fig
Same as Matt's comment above: When the number of different train sizes is reasonably large, x axis gets squished. The plot also shows what happens with negative r2.

@desilinguist
Copy link
Member Author

Thanks @aloukina! Very useful comments. Will incorporate.

Copy link
Collaborator

@aloukina aloukina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problems that need fixing are:

  • Negative metrics
  • X axis

The floats vs. int issue is just a suggestion.

.format(grid_objectives))

# check whether the right things are set for the given task
if (task == 'evaluate' or task == 'predict') and not test_path:
raise ValueError('The test set must be set when task is evaluate or '
'predict.')
if (task == 'cross_validate' or task == 'train') and test_path:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should these unused fields trigger a ValueError rather than a Warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if the user set the test_path, may be they wanted to do evaluate rather than cross_validate or train and forgot to change the task. The SKLL philosophy has always been to make no assumptions and let the user fix the config file lest they run a really long experiment they didn't really want to.

@desilinguist
Copy link
Member Author

desilinguist commented Feb 7, 2017

My latest commits (a) automatically rotate the x-tick labels if any of the sizes is >= 1000 and (b) automatically generate the correct y-limits for the learning curves based on the metrics.

Note that if you use the mean_squared_error metric for learning curve, the learning curve has negative values because scikit-learn essentially turns mean_squared_error into -1 * mean_squared_error internally so that it can be optimized just like any other function where higher is better. I will be submitting a PR that renames mean_squared_error to neg_mean_squared_error soon.

@aloukina and @mulhod can you please re-run your respective tests to make sure that the labels and y-limits are okay?

@coveralls
Copy link

coveralls commented Feb 7, 2017

Coverage Status

Coverage increased (+0.004%) to 91.48% when pulling edb9e72 on feature/add-learning-curves into b3228e2 on master.

@coveralls
Copy link

coveralls commented Feb 7, 2017

Coverage Status

Coverage increased (+0.3%) to 91.75% when pulling 3d863a6 on feature/add-learning-curves into b3228e2 on master.

@desilinguist
Copy link
Member Author

Okay, I fixed the coverage issue. @aloukina and @mulhod, I am now waiting on you to test the changes and if everything looks good, I will merge.

@mulhod
Copy link
Contributor

mulhod commented Feb 7, 2017

call_henry_wer_bias_bias length wer

This looks good!

One nitpick: I understand why it's -1 to 1 for kappa on the y-axis, but is there a way to not make the minimum y-axis value -1 if none of the values are below 0?

@desilinguist
Copy link
Member Author

Hmm, yeah I think we can do that. Let me see.

@desilinguist
Copy link
Member Author

Okay, I tweaked the plot generation code to hide unnecessary areas of the plot. Here's the new plot of your data, @mulhod. What do you think?
call_henry_wer_bias_bias length wer

@coveralls
Copy link

coveralls commented Feb 8, 2017

Coverage Status

Coverage increased (+0.3%) to 91.801% when pulling 0f5b3c4 on feature/add-learning-curves into b3228e2 on master.

@desilinguist desilinguist merged commit 108b1a1 into master Feb 8, 2017
@desilinguist desilinguist deleted the feature/add-learning-curves branch February 8, 2017 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants