You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a somewhat sparse tsv file with 45m rows, 24 columns - about 4.3 gb.
When I run xsv sample --seed 42 1000 file.tsv -o output.csv without an index, it takes about 15 seconds and produces a reproducible sample.
However, when I create an index (xsv index file.tsv - takes about 12 seconds, producing a 350mb IDX file), and run a sample using the same seed, it is fast (2 seconds), but produces a different sample for each run, as if I didn't specify a seed.
The text was updated successfully, but these errors were encountered:
When an index is present, this command will use random indexing if the sample
size is less than 10% of the total number of records. This allows for efficient
sampling such that the entire CSV file is not parsed.
basically, short circuiting the seed parameter.
I ran a bigger sample( more than 10% ) using the same tsv file (xsv sample --seed 42 5000000 file.tsv -o output2.csv), and I can confirm its now reproducible, with and without an index.
However, why does it ignore the seed parameter only when an index is present when the sample size is less than 10%? Shouldn't seed always takes precedence over the <10% sample size check?
I have a somewhat sparse tsv file with 45m rows, 24 columns - about 4.3 gb.
When I run
xsv sample --seed 42 1000 file.tsv -o output.csv
without an index, it takes about 15 seconds and produces a reproducible sample.However, when I create an index (
xsv index file.tsv
- takes about 12 seconds, producing a 350mb IDX file), and run a sample using the same seed, it is fast (2 seconds), but produces a different sample for each run, as if I didn't specify a seed.The text was updated successfully, but these errors were encountered: