Dependency between seed and number of threads (cpus) #215

izaak-coleman · 2021-09-17T02:49:43Z

Hey all!
First of all, awesome tool. It's been super useful for me so far, very easy to use.

I noticed an issue when trying to recreate an identical dataset (same input, same seed) across multiple machines.

I set the seed to 42 on both machines, one had 8 cpus, the other 4. I noticed the output of the tool differed:
diff machine_1_R1.fastq machine_2_R1.fastq returned something.

I wondered if this was something to do with the reads being output in a random order due to the paralellism and that the reads were still identical despite this. This was not the case, the diff below returned something:
cat machine_1_R1.fastq | sort > f1.fastq
cat machine_2_R1.fastq | sort > f2.fastq
diff f1.fastq f2.fastq

Further still, I wondered if the reads were being output in a random order due to the parallellism, but that the reads are not identical because the headers (perhaps a due to globally mutexed counter that gives the reads a unique id) were different. The DNA however, was still being sampled identically. To test this, I only output the DNA (i.e not the headers) from the fastq and run a diff:
sed -n '2~4p' machine_1_R1.fastq | sort > f1.fastq # this will give us just the sequences
sed -n '2~4p' machine_2_R1.fastq | sort > f2.fastq
diff f1.fastq f2.fastq
Again, this returned something. So, it seems the data is genuinely different despite the seed equalling 42!

The only difference left was that on one machine, I was constructing data with cpu=4, the other with cpu=8.
It turns out that when I set both to cpu=4, the files were the same:
diff machine_1_R1.fastq machine_2_R1.fastq returned nothing.

The last thing to check was that it was the differing machines and not the differing cpu numbers - perhaps, in some weird way the randomization algorithm would be different between the machines. But, (thank the good lord Number Forty-Two) this was not the case. I ran three runs on the same machine and compared the outputs, 8cpu vs 4cpu (run1) vs 4cpu (run2), the 8cpu output differed from the two 4cpu outputs, and the two 4cpu outputs were identical to one another.

I assume this is a bug, and not a feature - I can't think of any reason why you'd want this. It may be unfixable - sometimes
dealing with parallelism is hard (i've been there). But, I thought i'd bring it to your attention: Right now, you don't have identical datasets being output despite identical seed (and data) inputs if the cpu numbers differ.

The text was updated successfully, but these errors were encountered:

HadrienG · 2023-08-08T12:34:44Z

Hi!

Thanks for bringing this to my attention. I'm not sure how to go on about fixing it, but perhaps in the future.

/Hadrien

HadrienG added bug on hold good idea, might happen in the future labels Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dependency between seed and number of threads (cpus) #215

Dependency between seed and number of threads (cpus) #215

izaak-coleman commented Sep 17, 2021

HadrienG commented Aug 8, 2023

Dependency between seed and number of threads (cpus) #215

Dependency between seed and number of threads (cpus) #215

Comments

izaak-coleman commented Sep 17, 2021

HadrienG commented Aug 8, 2023