a seeded sample not working properly when an index is present #255

jqnatividad · 2020-12-26T20:00:05Z

I have a somewhat sparse tsv file with 45m rows, 24 columns - about 4.3 gb.

When I run xsv sample --seed 42 1000 file.tsv -o output.csv without an index, it takes about 15 seconds and produces a reproducible sample.

However, when I create an index (xsv index file.tsv - takes about 12 seconds, producing a 350mb IDX file), and run a sample using the same seed, it is fast (2 seconds), but produces a different sample for each run, as if I didn't specify a seed.

The text was updated successfully, but these errors were encountered:

jqnatividad · 2020-12-26T21:01:18Z

After looking at the code, this partially explains the issue

xsv/src/cmd/sample.rs

Lines 17 to 19 in 3de6c04

    
           When an index is present, this command will use random indexing if the sample 
        
           size is less than 10% of the total number of records. This allows for efficient 
        
           sampling such that the entire CSV file is not parsed.

basically, short circuiting the seed parameter.

I ran a bigger sample( more than 10% ) using the same tsv file (xsv sample --seed 42 5000000 file.tsv -o output2.csv), and I can confirm its now reproducible, with and without an index.

However, why does it ignore the seed parameter only when an index is present when the sample size is less than 10%? Shouldn't seed always takes precedence over the <10% sample size check?

jqnatividad mentioned this issue Sep 15, 2021

Seeded samples should always be reproducible regardless of sample size jqnatividad/qsv#11

Closed

jqnatividad closed this as completed Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a seeded sample not working properly when an index is present #255

a seeded sample not working properly when an index is present #255

jqnatividad commented Dec 26, 2020 •

edited

jqnatividad commented Dec 26, 2020

a seeded sample not working properly when an index is present #255

a seeded sample not working properly when an index is present #255

Comments

jqnatividad commented Dec 26, 2020 • edited

jqnatividad commented Dec 26, 2020

jqnatividad commented Dec 26, 2020 •

edited