Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

Add cross validation function #114

Closed
theotherphil opened this issue Aug 1, 2016 · 12 comments
Closed

Add cross validation function #114

theotherphil opened this issue Aug 1, 2016 · 12 comments

Comments

@theotherphil
Copy link
Contributor

At least a naïve version, with basic reporting. Ideally also a fast version for models admitting a monoidal structure, as in https://github.com/mikeizbicki/HLearn/blob/master/README.md.

@theotherphil
Copy link
Contributor Author

I'd like to have a go at this. Please let me know if you have any existing plans for validation tools.

@AtheMathmo
Copy link
Owner

Thanks for this!

I still haven't figured out how this should look but I agree that even a naive version would be good! I've had a couple ideas that I'd be happy to share - but feel free to implement this as you see fit and we can work together to get the best approach.

The only requirements for me would be to try and not produce too much memory overhead. If we can manipulate the data in-place and use MatrixSlice it would be easier to extend this later. (Note that right now the models take only Matrix as input but we're trying to fix that and similar issues now).

Let me know if you need anything as you're working on this. If there's any missing functionality you need I'll do my best to get it in quickly.

@theotherphil
Copy link
Contributor Author

Thanks, will keep that in mind.

@AtheMathmo
Copy link
Owner

From #124 by @theotherphil :

I've implemented a very WIP version of cross validation, but 1. it's hard-coded to only accept models with inputs and targets of type Matrix, and 2. this means that it's impossible to write a non-copy version, as there's no way of selecting a random subset of rows and passing them to the train or predict functions of these models.

Options:

  1. Switch models to using Iterator<Item=&[f64]> (which can be implemented for Matrix if it's not already).
  2. Accept that cross-validation will involve copying.

If we go with 2. we can still either avoid any allocation by shuffling the input rows in place (this sounds like a bad idea), or limit ourselves to allocating a single array of size (k-1)/k * input data size. Maybe the latter option is acceptable?

Datasets also need to be split like this when training random forests (and presumably for some other learning algorithms), which might be an argument in favour of 1. Although it's possible that copying data around, and even allocating, might end up cheaper overall in some cases due to improved locality. Maybe?


There are some tricky issues here (some of which are why I was nervous to push on with this myself) but I think we can get a decent solution.

Will issue 1 be solved be solved if we get associated types? If not - what is preventing us from allowing Vectors? Is it just the fact that Vectors are less well supported in rulinalg?

With regards to the options you proposed:

  1. This might be an option but I'm not too sure. I feel as though many of the models benefit from assuming locality of the data. After PR [Breaking Change] update rulinalg to v0.3.0 #117 lands we will hopefully have some better ways to handle model inputs - though I'm not sure it will be enough to handle this in the way you'd like.
  2. This might be the better option - you can limit this somewhat by utilizing the select_rows function. shuffling the rows in place might also be tolerable depending on how you've constructed the cross validation. For example if cross validation consumes the input data I think it is ok to let it mutate the data?

I'm still woefully unfamiliar with random forests so can't really comment intelligently on the last part. If this is a requirement then I would say one approach for this would be to have a Transformer which shuffles the input data rows, and then allow the model to take MatrixSlices as inputs (which is easier to achieve than iterators). This way a random forest user can shuffle their input data easily and slice it to feed it in part-by-part.

@theotherphil
Copy link
Contributor Author

Switching from type parameters to associated types won't help with 1. Using Vector<&[f64]> would, but this has the same non-locality issue you object to for using Iterators. I don't think the switch from Matrix to MatrixSlice will help either, as it looks like those only let you select contiguous rows (presumably because otherwise linear algebra routines would be a nightmare).

I'll go with the second option for now. I wasn't aware of select_rows, thanks. It's currently not quite general enough - I only have an Iterator<Item=&usize> rather than a &[usize] (as the set of indices to train with after removing a fold are no longer contiguous). I could provide an ExactSizeIterator though, so I'll create a pull request to support that.

As you might have guessed from how many times I've mentioned them vs how many times I've mentioned anything else, random forests are about the only machine learning algorithm I do know. No changes to the rest of this library are required to implement them - I was just mentioning them as another example of where choosing arbitrary subsets of rows was useful.

@AtheMathmo
Copy link
Owner

it looks like those only let you select contiguous rows

Yes this is true for the reason you mention (that and optimization/compatibility with other libraries).

I only have an Iterator<Item=&usize> rather than a &[usize]

You could also take a look at learning::toolkit::rand_utils which contains things like in_place_fisher_yates. Though it's possible I am misunderstanding exactly what you need.

As for random forests I'd be happy to see a PR to support them if you ever have the time/desire!

@theotherphil
Copy link
Contributor Author

I'm already using your in_place_fisher_yates :-). The issue is that after shuffling the input indices I iterate over (test_indices, training_indices) tuples to train and validate for each fold. test_indices is a slice, but training_indices is a pair of slices (stuff before the test set, then stuff after). It should be easy to make select_rows (and select_cols) more general without breaking any existing code, so I'll give that a go.

I plan on creating a PR for random forests at some point. How soon depends on how busy I am with real life.

@AtheMathmo
Copy link
Owner

Sounds good!

@AtheMathmo
Copy link
Owner

The changes you need have been released in rulinalg 0.3.1. The bad news is that we'll need to merge #117 before you can get access to it :(. That PR is pretty close to going in though - we just need to verify there are no performance regressions.

@theotherphil
Copy link
Contributor Author

Great, thanks.

@AtheMathmo
Copy link
Owner

Closed with #125 !

@AtheMathmo
Copy link
Owner

Hey @theotherphil ! I just wanted to ping you to say that I merged #133 (a continuation of #117) and so you should have the functions you need to improve the current implementation. Let me know if you want me to address these/write up a tracking issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants