Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample behaviour #23

Open
LukasHedegaard opened this issue Apr 6, 2020 · 7 comments
Open

sample behaviour #23

LukasHedegaard opened this issue Apr 6, 2020 · 7 comments
Labels
behaviour Should this behaviour be changed?
Milestone

Comments

@LukasHedegaard
Copy link
Owner

Currently, if more samples are requested on .sample, than are available in the dataset, we will sample some samples multiple times. Should we raise an error instead?

@LukasHedegaard LukasHedegaard added the behaviour Should this behaviour be changed? label Apr 6, 2020
@iliiliiliili
Copy link
Collaborator

We can add no_more_samples parameter to be able to:

  1. 'error' – raise Exception if no more samples.
  2. 'None' – out of bounds samples will be None. (Can it be useful?)
  3. 'cycle' – start from beginning
  4. 'repeat' – repeat last sample

@LukasHedegaard LukasHedegaard added this to Planned in Kanban board via automation Apr 8, 2020
@LukasHedegaard
Copy link
Owner Author

I don't think the 'None' version makes sense. Also, sampling is random, so I don't think a distinction between 'cycle' and 'repeat' makes sense.
Then it's either throw an error or sample additional randomly.
Does it make sense for the user to need to determine this by argument?
A middle-ground solution could be to emit a warning when "oversampling". What do you think?

@iliiliiliili
Copy link
Collaborator

Maybe we can have allow_oversampling=False so it raises Exception when trying to oversample, but if user sets it to True, we will just continue sampling randomly.

@clegaard
Copy link
Collaborator

My question is, does it really make sense to sample outside of len-1 index?
If the user wants to sample more than this could they not use the repeat operation?

@LukasHedegaard
Copy link
Owner Author

LukasHedegaard commented Apr 10, 2020

Scenarios with oversampling will probably be due to an error in user-logic, and I think it’s better to get an error earlier rather than later.

Should a creative user finds the need for it, he could always do:

ds_sampled = ds.sample(42)
ds_oversampled = ds.concat(ds_sampled)

Though not that bad, it does break the call-chain. And seeing as the implementation is trivial, why not add the allow_oversampling flag as @iliiliiliili proposed, with a default arg (False) that optimises for the more frequent scenario (possible error in user logic)?

@clegaard
Copy link
Collaborator

Should this also extend to take function as well? If one is allowed I would expect the other to be as well.
I definitely prefer the flag to do funky stuff like the concat.

However, if the user wants to sample more samples than what is present they could use the repeat function? This would be more transparent in my opinion and would not add any complexity.

ds_sampled = ds.repeat(2).sample(42)
ds_taken = ds.repeat(2).take(42)

@LukasHedegaard
Copy link
Owner Author

Let's remove the oversampling and throw an error instead

@LukasHedegaard LukasHedegaard added this to the 0.1.0 milestone Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
behaviour Should this behaviour be changed?
Projects
Kanban board
  
Backlog
Development

No branches or pull requests

3 participants