Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a seeded sample not working properly when an index is present #255

Closed
jqnatividad opened this issue Dec 26, 2020 · 1 comment
Closed

a seeded sample not working properly when an index is present #255

jqnatividad opened this issue Dec 26, 2020 · 1 comment

Comments

@jqnatividad
Copy link

jqnatividad commented Dec 26, 2020

I have a somewhat sparse tsv file with 45m rows, 24 columns - about 4.3 gb.

When I run xsv sample --seed 42 1000 file.tsv -o output.csv without an index, it takes about 15 seconds and produces a reproducible sample.

However, when I create an index (xsv index file.tsv - takes about 12 seconds, producing a 350mb IDX file), and run a sample using the same seed, it is fast (2 seconds), but produces a different sample for each run, as if I didn't specify a seed.

@jqnatividad
Copy link
Author

After looking at the code, this partially explains the issue

xsv/src/cmd/sample.rs

Lines 17 to 19 in 3de6c04

When an index is present, this command will use random indexing if the sample
size is less than 10% of the total number of records. This allows for efficient
sampling such that the entire CSV file is not parsed.

basically, short circuiting the seed parameter.

I ran a bigger sample( more than 10% ) using the same tsv file (xsv sample --seed 42 5000000 file.tsv -o output2.csv), and I can confirm its now reproducible, with and without an index.

However, why does it ignore the seed parameter only when an index is present when the sample size is less than 10%? Shouldn't seed always takes precedence over the <10% sample size check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant