Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency between seed and number of threads (cpus) #215

Open
izaak-coleman opened this issue Sep 17, 2021 · 1 comment
Open

Dependency between seed and number of threads (cpus) #215

izaak-coleman opened this issue Sep 17, 2021 · 1 comment
Labels
bug on hold good idea, might happen in the future

Comments

@izaak-coleman
Copy link

Hey all!
First of all, awesome tool. It's been super useful for me so far, very easy to use.

I noticed an issue when trying to recreate an identical dataset (same input, same seed) across multiple machines.

I set the seed to 42 on both machines, one had 8 cpus, the other 4. I noticed the output of the tool differed:
diff machine_1_R1.fastq machine_2_R1.fastq returned something.

I wondered if this was something to do with the reads being output in a random order due to the paralellism and that the reads were still identical despite this. This was not the case, the diff below returned something:
cat machine_1_R1.fastq | sort > f1.fastq
cat machine_2_R1.fastq | sort > f2.fastq
diff f1.fastq f2.fastq

Further still, I wondered if the reads were being output in a random order due to the parallellism, but that the reads are not identical because the headers (perhaps a due to globally mutexed counter that gives the reads a unique id) were different. The DNA however, was still being sampled identically. To test this, I only output the DNA (i.e not the headers) from the fastq and run a diff:
sed -n '2~4p' machine_1_R1.fastq | sort > f1.fastq # this will give us just the sequences
sed -n '2~4p' machine_2_R1.fastq | sort > f2.fastq
diff f1.fastq f2.fastq
Again, this returned something. So, it seems the data is genuinely different despite the seed equalling 42!

The only difference left was that on one machine, I was constructing data with cpu=4, the other with cpu=8.
It turns out that when I set both to cpu=4, the files were the same:
diff machine_1_R1.fastq machine_2_R1.fastq returned nothing.

The last thing to check was that it was the differing machines and not the differing cpu numbers - perhaps, in some weird way the randomization algorithm would be different between the machines. But, (thank the good lord Number Forty-Two) this was not the case. I ran three runs on the same machine and compared the outputs, 8cpu vs 4cpu (run1) vs 4cpu (run2), the 8cpu output differed from the two 4cpu outputs, and the two 4cpu outputs were identical to one another.

I assume this is a bug, and not a feature - I can't think of any reason why you'd want this. It may be unfixable - sometimes
dealing with parallelism is hard (i've been there). But, I thought i'd bring it to your attention: Right now, you don't have identical datasets being output despite identical seed (and data) inputs if the cpu numbers differ.

@HadrienG HadrienG added bug on hold good idea, might happen in the future labels Aug 8, 2023
@HadrienG
Copy link
Owner

HadrienG commented Aug 8, 2023

Hi!

Thanks for bringing this to my attention. I'm not sure how to go on about fixing it, but perhaps in the future.

/Hadrien

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug on hold good idea, might happen in the future
Projects
None yet
Development

No branches or pull requests

2 participants