Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multicore Bismark could corrupt unmapped/ambigious reads FASTQ files for unknown reason #495

Closed
iromeo opened this issue Jun 10, 2022 · 9 comments

Comments

@iromeo
Copy link
Contributor

iromeo commented Jun 10, 2022

This issue differs from #494 because I don't have any errors in the log file.

Everything seems OK, reasonable number of reads in BAM files, but FAST.GZ file has odd lines number ( FJ02_hg38_unmapped_reads_1.fq.gz: 590731999 lines, FJ02_hg38_unmapped_reads_2.fq.gz 590731999 lines)
. Obviously, the number of lines in FAST.GZ cannot be odd and the FASTQ.GZ file was corrupted.

P.S: Maybe there is some issue with buffering while saving temp files.


Bismark (0.23.0)

I launched command: bismark -X 600 --gzip --multicore 4 --ambiguous --unmapped --bowtie2 $(dirname resources/indexes/hg38/Bisulfite_Genome) -1 data/reads/wgbs_y23_pooled/Clean/FJ02/FJ02_1.fq.gz -2 data/reads/wgbs_y23_pooled/Clean/FJ02/FJ02_2.fq.gz

It looks like one of the files was truncated, although no errors in the log file. On the example below you can notice that new read @V35.. starts in the end of QC string of previous read. And the QC string is shorter than expected.
image

The file tail looks ok. So likely one of the temp files was truncated during merge:
image

P.S: I'm going to re-align this sample and will known in several days if the problem persists or not.

@iromeo
Copy link
Contributor Author

iromeo commented Jun 10, 2022

Bismark process logs & snakemake wrapper process logs
FJO2.logs.tar.gz

@iromeo iromeo changed the title Bismark corrupts unmapped/ambigious reads FASTQ files with no reason Multicore Bismark corrupts unmapped/ambigious reads FASTQ files with no reason Jun 10, 2022
@iromeo iromeo changed the title Multicore Bismark corrupts unmapped/ambigious reads FASTQ files with no reason Multicore Bismark corrupts unmapped/ambigious reads FASTQ files for no reason Jun 10, 2022
@iromeo
Copy link
Contributor Author

iromeo commented Jun 14, 2022

P.S: After re-run i get:

  • 590732980 instead of 590731999 in FJ02_hg38_unmapped_reads_1.fq.gz
  • Same FJ02_hg38_pe.bam file content

So the error was only during *.fq.gz files merge.

@FelixKrueger
Copy link
Owner

I am very sorry for the slow reply, I will try to look at your questions tomorrow morning. Hope that's still OK?

@FelixKrueger
Copy link
Owner

Hmm, this one is probably tricky. My initial thought would also be that this might be caused by buffering issues.

I can try to run some tests with a similar command, but chances are that I won't be able to replicate the issue. If it was true that this is a buffering issue, I suppose one would fine these 'corruption events' at the intersection of from one temp.fq file to another? But this is quite weird as it really is simply merging uncompressed text files... I'll do some tests and will report back.

@iromeo
Copy link
Contributor Author

iromeo commented Jun 15, 2022

but the chances are that I won't be able to replicate the issue.

I've processed 200 WGBS without any issues. Maybe it is a very rare event.

But this is quite weird as it really is simply merging uncompressed text files..

The launch where I got all problems was in conditions when the cluster likely was short of HDD space. I don't know the exact sequence of events, but the amount of free space was bouncing near zero.

The fact that FASTQ lines number is almost normal (590731999 instead of 590732980 ) sounds like buffering problem or slow IO. So maybe the merge started when not all changes were committed to HDD (not sure that is technically possible here). In case of "not enough disk space" I expect huge chunks missing + one of child processes should fail with IO error + this event should be visible in the log file, like in #494.

results/hg38/bams_bismark/FJ02/FJ02_hg38_ambiguous_reads_1.fq.gz - 324710416
results/hg38/bams_bismark/FJ02/FJ02_hg38_ambiguous_reads_2.fq.gz - 324710416
results/hg38/bams_bismark/FJ02/FJ02_hg38_unmapped_reads_1.fq.gz - 590731999
results/hg38/bams_bismark/FJ02/FJ02_hg38_unmapped_reads_2.fq.gz - 590731999

On the other hand is quite weird that same incorrect line number was in both in unmapped *_1.fq.gz and *_2.fq.gz but not in ambiguous files

@iromeo iromeo changed the title Multicore Bismark corrupts unmapped/ambigious reads FASTQ files for no reason Multicore Bismark could corrupt unmapped/ambigious reads FASTQ files for unknown reason Jun 15, 2022
@FelixKrueger
Copy link
Owner

FelixKrueger commented Jun 15, 2022

I think I might have found the culprit(s), it appears that there never were any explicit close statements for ambiguous or unmapped FastQ files, which is not noticeable in non-multicore runs as the filehandles get closed automatically when Bismark exits (so I would imagine that the files of the child processes would be complete), but I suppose the parent process might occasionally - but not every time - still have a few lines held in buffer rather than having been written out.... So yes, this almost sounds a little corner-case-y (but it shouldn't happen nevertheless).

I have now added explicit closing statements for the unmapped and ambiguous filehandles (9d7a806), I hope this should solve the issue?! Thanks for finding this, would you mind cloning the current dev branch and testing whether it is now resolved?

Best wishes! Felix

@iromeo
Copy link
Contributor Author

iromeo commented Jun 15, 2022

Thx, for the explanation and fix, it sounds reasonable. Is dev branch is considered to be relatively stable? I noticed, that there are many changes since the latest release (2021). I could switch my pipeline to dev version, but just want to be more or less sure that everything is expected to be OK and results expected to be correct. Also, my alignment process takes 4-5 days per sample, so it takes too long just for testing/playground.

@FelixKrueger
Copy link
Owner

I think the dev branch is pretty much equivalent to master branch, but we added support for minimap2 for Nanopore and PacBio alignments, there will most likely be a new release very soon (it has just been presented at AGBT last week). I was just wanting to have development on the dev branch only, and merge into master for releases (which I should probably have done from the start :P).

@FelixKrueger
Copy link
Owner

This should be a solved issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants