Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

ENH: Add datasets #161

Merged
merged 5 commits into from Dec 27, 2016
Merged

ENH: Add datasets #161

merged 5 commits into from Dec 27, 2016

Conversation

sinhrks
Copy link
Contributor

@sinhrks sinhrks commented Dec 16, 2016

Closes #115. Added Dataset struct which has data() and target() impl (intended for supervised learning).

Adding more data once API looks OK.

Copy link
Owner

@AtheMathmo AtheMathmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to add a feature flag - or at least discuss this.

Otherwise this looks ready to go.

/// Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].
/// Irvine, CA: University of California, School of Information and Computer Science.
pub fn load_iris() -> Dataset<Matrix<f64>, Vector<usize>> {
let data: Vec<f64> = vec![5.1, 3.5, 1.4, 0.2,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: It might be easier to use the matrix! macro here. My thinking is that if we need to add a row there's a little less work.


/// Dataset container
#[derive(Clone, Debug)]
pub struct Dataset<D, T> where D: Clone + Debug, T: Clone + Debug {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense for now. We might want to be more strict in future if we want to be generic over DataSets. However, this is something that I don't think we will ever want to do.

@@ -219,3 +219,6 @@ pub mod analysis {
pub mod cross_validation;
pub mod score;
}

/// Module for datasets.
pub mod datasets;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should feature gate this. My thinking is that if we have a few datasets users will not want to download all of this data by default.

To do this:

@sinhrks
Copy link
Contributor Author

sinhrks commented Dec 18, 2016

Added feature gates. Is it ok to be included by default ATM (as it is likely to be used in most tests)?

@AtheMathmo
Copy link
Owner

Thanks for the update.

It looks good but I'm a little cautious about having the datasets flag included by default. I wanted it feature flagged specifically so that it had to be opted-in. I can see that we will probably want to use it in some tests but I'd try a few ways around this first.

  • Can we fit all tests in the datasets module?
  • Can we somehow put the tests in the tests folder behind the feature flag too?
  • As a last resort, do as you have done and keep the "datasets" flag for the "test" feature.

Finally note that we will need to modify the travis CI matrix to include the "datasets" flag.

@sinhrks sinhrks force-pushed the datasets branch 2 times, most recently from 43d6e2d to 90e1944 Compare December 19, 2016 22:53
@sinhrks
Copy link
Contributor Author

sinhrks commented Dec 20, 2016

OK, made datasets as optional, and fixed travis tests.

@AtheMathmo
Copy link
Owner

This looks good to me but before merging I'd like to check out the branch and play around with it a little.

Thanks!

@AtheMathmo
Copy link
Owner

I checked out the code and I have a few thoughts. I am happy to merge this in without any further changes but we should at least write up a tracking issue for improvements.

I think the description of load_iris should include more information on the dataset. Or a link directly to the iris dataset: http://archive.ics.uci.edu/ml/datasets/Iris (instead of just the /ml root). We should say what the attributes and classes are. From the link:

Attribute Information:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class:
    -- Iris Setosa
    -- Iris Versicolour
    -- Iris Virginica

Also I think it might be a good idea to organize the datasets module a little differently. If we add more datasets the module is going to get large quickly and difficult to manage. I think we should move the iris data into a new iris module. This iris module should have a pub fn load() -> DataSet. As a user I would be quite happy to do:

use rusty_machine::datasets:iris;

let (inputs, targets) = iris::load();

But we could also use the current format and have a load_iris function in the datasets module that calls iris::load. In this case we might want to keep the iris module private?

Let me know what you think. If you don't want to make any of these changes now I'll merge and move this information to a separate ticket.

@sinhrks sinhrks force-pushed the datasets branch 2 times, most recently from 4cdb18a to 06e52c4 Compare December 27, 2016 01:48
@sinhrks
Copy link
Contributor Author

sinhrks commented Dec 27, 2016

Thx for the comment. I've did a change requested. Pls take a look when u have a time.

Copy link
Owner

@AtheMathmo AtheMathmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now!

I have a minor nitpick for the features section but so far as I can tell it makes no real difference.

stats = []
datasets = []
test = []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to include the test or default features. These already exist as defined here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, removed.


use super::Dataset;

/// Load iris dataset.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is great!

@AtheMathmo
Copy link
Owner

Thank you! Merging now.

@AtheMathmo AtheMathmo merged commit 34a5417 into AtheMathmo:master Dec 27, 2016
@sinhrks sinhrks deleted the datasets branch December 27, 2016 08:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants