sample behaviour #23

LukasHedegaard · 2020-04-06T08:38:34Z

Currently, if more samples are requested on .sample, than are available in the dataset, we will sample some samples multiple times. Should we raise an error instead?

The text was updated successfully, but these errors were encountered:

iliiliiliili · 2020-04-06T12:24:48Z

We can add no_more_samples parameter to be able to:

'error' – raise Exception if no more samples.
'None' – out of bounds samples will be None. (Can it be useful?)
'cycle' – start from beginning
'repeat' – repeat last sample

LukasHedegaard · 2020-04-08T09:33:20Z

I don't think the 'None' version makes sense. Also, sampling is random, so I don't think a distinction between 'cycle' and 'repeat' makes sense.
Then it's either throw an error or sample additional randomly.
Does it make sense for the user to need to determine this by argument?
A middle-ground solution could be to emit a warning when "oversampling". What do you think?

iliiliiliili · 2020-04-10T08:46:19Z

Maybe we can have allow_oversampling=False so it raises Exception when trying to oversample, but if user sets it to True, we will just continue sampling randomly.

clegaard · 2020-04-10T09:01:24Z

My question is, does it really make sense to sample outside of len-1 index?
If the user wants to sample more than this could they not use the repeat operation?

LukasHedegaard · 2020-04-10T13:01:32Z

Scenarios with oversampling will probably be due to an error in user-logic, and I think it’s better to get an error earlier rather than later.

Should a creative user finds the need for it, he could always do:

ds_sampled = ds.sample(42)
ds_oversampled = ds.concat(ds_sampled)

Though not that bad, it does break the call-chain. And seeing as the implementation is trivial, why not add the allow_oversampling flag as @iliiliiliili proposed, with a default arg (False) that optimises for the more frequent scenario (possible error in user logic)?

clegaard · 2020-04-15T09:12:39Z

Should this also extend to take function as well? If one is allowed I would expect the other to be as well.
I definitely prefer the flag to do funky stuff like the concat.

However, if the user wants to sample more samples than what is present they could use the repeat function? This would be more transparent in my opinion and would not add any complexity.

ds_sampled = ds.repeat(2).sample(42)
ds_taken = ds.repeat(2).take(42)

LukasHedegaard · 2020-04-15T12:28:58Z

Let's remove the oversampling and throw an error instead

LukasHedegaard added the behaviour Should this behaviour be changed? label Apr 6, 2020

LukasHedegaard added this to Planned in Kanban board via automation Apr 8, 2020

LukasHedegaard added this to the 0.1.0 milestone Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sample behaviour #23

sample behaviour #23

LukasHedegaard commented Apr 6, 2020

iliiliiliili commented Apr 6, 2020

LukasHedegaard commented Apr 8, 2020

iliiliiliili commented Apr 10, 2020

clegaard commented Apr 10, 2020

LukasHedegaard commented Apr 10, 2020 •

edited

clegaard commented Apr 15, 2020

LukasHedegaard commented Apr 15, 2020

sample behaviour #23

sample behaviour #23

Comments

LukasHedegaard commented Apr 6, 2020

iliiliiliili commented Apr 6, 2020

LukasHedegaard commented Apr 8, 2020

iliiliiliili commented Apr 10, 2020

clegaard commented Apr 10, 2020

LukasHedegaard commented Apr 10, 2020 • edited

clegaard commented Apr 15, 2020

LukasHedegaard commented Apr 15, 2020

LukasHedegaard commented Apr 10, 2020 •

edited