Add cross validation function #114
Comments
I'd like to have a go at this. Please let me know if you have any existing plans for validation tools. |
Thanks for this! I still haven't figured out how this should look but I agree that even a naive version would be good! I've had a couple ideas that I'd be happy to share - but feel free to implement this as you see fit and we can work together to get the best approach. The only requirements for me would be to try and not produce too much memory overhead. If we can manipulate the data in-place and use Let me know if you need anything as you're working on this. If there's any missing functionality you need I'll do my best to get it in quickly. |
Thanks, will keep that in mind. |
From #124 by @theotherphil : I've implemented a very WIP version of cross validation, but 1. it's hard-coded to only accept models with inputs and targets of type Matrix, and 2. this means that it's impossible to write a non-copy version, as there's no way of selecting a random subset of rows and passing them to the train or predict functions of these models. Options:
If we go with 2. we can still either avoid any allocation by shuffling the input rows in place (this sounds like a bad idea), or limit ourselves to allocating a single array of size (k-1)/k * input data size. Maybe the latter option is acceptable? Datasets also need to be split like this when training random forests (and presumably for some other learning algorithms), which might be an argument in favour of 1. Although it's possible that copying data around, and even allocating, might end up cheaper overall in some cases due to improved locality. Maybe? There are some tricky issues here (some of which are why I was nervous to push on with this myself) but I think we can get a decent solution. Will issue 1 be solved be solved if we get associated types? If not - what is preventing us from allowing With regards to the options you proposed:
I'm still woefully unfamiliar with random forests so can't really comment intelligently on the last part. If this is a requirement then I would say one approach for this would be to have a |
Switching from type parameters to associated types won't help with 1. Using I'll go with the second option for now. I wasn't aware of As you might have guessed from how many times I've mentioned them vs how many times I've mentioned anything else, random forests are about the only machine learning algorithm I do know. No changes to the rest of this library are required to implement them - I was just mentioning them as another example of where choosing arbitrary subsets of rows was useful. |
Yes this is true for the reason you mention (that and optimization/compatibility with other libraries).
You could also take a look at As for random forests I'd be happy to see a PR to support them if you ever have the time/desire! |
I'm already using your in_place_fisher_yates :-). The issue is that after shuffling the input indices I iterate over (test_indices, training_indices) tuples to train and validate for each fold. test_indices is a slice, but training_indices is a pair of slices (stuff before the test set, then stuff after). It should be easy to make select_rows (and select_cols) more general without breaking any existing code, so I'll give that a go. I plan on creating a PR for random forests at some point. How soon depends on how busy I am with real life. |
Sounds good! |
The changes you need have been released in rulinalg 0.3.1. The bad news is that we'll need to merge #117 before you can get access to it :(. That PR is pretty close to going in though - we just need to verify there are no performance regressions. |
Great, thanks. |
Closed with #125 ! |
Hey @theotherphil ! I just wanted to ping you to say that I merged #133 (a continuation of #117) and so you should have the functions you need to improve the current implementation. Let me know if you want me to address these/write up a tracking issue. |
At least a naïve version, with basic reporting. Ideally also a fast version for models admitting a monoidal structure, as in https://github.com/mikeizbicki/HLearn/blob/master/README.md.
The text was updated successfully, but these errors were encountered: