Multicore Bismark could corrupt unmapped/ambigious reads FASTQ files for unknown reason #495

iromeo · 2022-06-10T16:21:52Z

This issue differs from #494 because I don't have any errors in the log file.

Everything seems OK, reasonable number of reads in BAM files, but FAST.GZ file has odd lines number ( FJ02_hg38_unmapped_reads_1.fq.gz: 590731999 lines, FJ02_hg38_unmapped_reads_2.fq.gz 590731999 lines)
. Obviously, the number of lines in FAST.GZ cannot be odd and the FASTQ.GZ file was corrupted.

P.S: Maybe there is some issue with buffering while saving temp files.

Bismark (0.23.0)

I launched command: bismark -X 600 --gzip --multicore 4 --ambiguous --unmapped --bowtie2 $(dirname resources/indexes/hg38/Bisulfite_Genome) -1 data/reads/wgbs_y23_pooled/Clean/FJ02/FJ02_1.fq.gz -2 data/reads/wgbs_y23_pooled/Clean/FJ02/FJ02_2.fq.gz

It looks like one of the files was truncated, although no errors in the log file. On the example below you can notice that new read @V35.. starts in the end of QC string of previous read. And the QC string is shorter than expected.

The file tail looks ok. So likely one of the temp files was truncated during merge:

P.S: I'm going to re-align this sample and will known in several days if the problem persists or not.

The text was updated successfully, but these errors were encountered:

iromeo · 2022-06-10T16:24:26Z

Bismark process logs & snakemake wrapper process logs
FJO2.logs.tar.gz

iromeo · 2022-06-14T16:37:37Z

P.S: After re-run i get:

590732980 instead of 590731999 in FJ02_hg38_unmapped_reads_1.fq.gz
Same FJ02_hg38_pe.bam file content

So the error was only during *.fq.gz files merge.

FelixKrueger · 2022-06-14T19:48:38Z

I am very sorry for the slow reply, I will try to look at your questions tomorrow morning. Hope that's still OK?

FelixKrueger · 2022-06-15T09:30:21Z

Hmm, this one is probably tricky. My initial thought would also be that this might be caused by buffering issues.

I can try to run some tests with a similar command, but chances are that I won't be able to replicate the issue. If it was true that this is a buffering issue, I suppose one would fine these 'corruption events' at the intersection of from one temp.fq file to another? But this is quite weird as it really is simply merging uncompressed text files... I'll do some tests and will report back.

iromeo · 2022-06-15T09:44:53Z

but the chances are that I won't be able to replicate the issue.

I've processed 200 WGBS without any issues. Maybe it is a very rare event.

But this is quite weird as it really is simply merging uncompressed text files..

The launch where I got all problems was in conditions when the cluster likely was short of HDD space. I don't know the exact sequence of events, but the amount of free space was bouncing near zero.

The fact that FASTQ lines number is almost normal (590731999 instead of 590732980 ) sounds like buffering problem or slow IO. So maybe the merge started when not all changes were committed to HDD (not sure that is technically possible here). In case of "not enough disk space" I expect huge chunks missing + one of child processes should fail with IO error + this event should be visible in the log file, like in #494.

results/hg38/bams_bismark/FJ02/FJ02_hg38_ambiguous_reads_1.fq.gz - 324710416
results/hg38/bams_bismark/FJ02/FJ02_hg38_ambiguous_reads_2.fq.gz - 324710416
results/hg38/bams_bismark/FJ02/FJ02_hg38_unmapped_reads_1.fq.gz - 590731999
results/hg38/bams_bismark/FJ02/FJ02_hg38_unmapped_reads_2.fq.gz - 590731999

On the other hand is quite weird that same incorrect line number was in both in unmapped *_1.fq.gz and *_2.fq.gz but not in ambiguous files

FelixKrueger · 2022-06-15T10:58:29Z

I think I might have found the culprit(s), it appears that there never were any explicit close statements for ambiguous or unmapped FastQ files, which is not noticeable in non-multicore runs as the filehandles get closed automatically when Bismark exits (so I would imagine that the files of the child processes would be complete), but I suppose the parent process might occasionally - but not every time - still have a few lines held in buffer rather than having been written out.... So yes, this almost sounds a little corner-case-y (but it shouldn't happen nevertheless).

I have now added explicit closing statements for the unmapped and ambiguous filehandles (9d7a806), I hope this should solve the issue?! Thanks for finding this, would you mind cloning the current dev branch and testing whether it is now resolved?

Best wishes! Felix

iromeo · 2022-06-15T11:13:42Z

Thx, for the explanation and fix, it sounds reasonable. Is dev branch is considered to be relatively stable? I noticed, that there are many changes since the latest release (2021). I could switch my pipeline to dev version, but just want to be more or less sure that everything is expected to be OK and results expected to be correct. Also, my alignment process takes 4-5 days per sample, so it takes too long just for testing/playground.

FelixKrueger · 2022-06-15T11:33:05Z

I think the dev branch is pretty much equivalent to master branch, but we added support for minimap2 for Nanopore and PacBio alignments, there will most likely be a new release very soon (it has just been presented at AGBT last week). I was just wanting to have development on the dev branch only, and merge into master for releases (which I should probably have done from the start :P).

FelixKrueger · 2022-11-10T09:23:20Z

This should be a solved issue.

iromeo changed the title ~~Bismark corrupts unmapped/ambigious reads FASTQ files with no reason~~ Multicore Bismark corrupts unmapped/ambigious reads FASTQ files with no reason Jun 10, 2022

iromeo changed the title ~~Multicore Bismark corrupts unmapped/ambigious reads FASTQ files with no reason~~ Multicore Bismark corrupts unmapped/ambigious reads FASTQ files for no reason Jun 10, 2022

iromeo changed the title ~~Multicore Bismark corrupts unmapped/ambigious reads FASTQ files for no reason~~ Multicore Bismark could corrupt unmapped/ambigious reads FASTQ files for unknown reason Jun 15, 2022

FelixKrueger closed this as completed Nov 10, 2022

tamuanand mentioned this issue May 27, 2023

Failed to close filehandle messages with bismark --parallel #587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multicore Bismark could corrupt unmapped/ambigious reads FASTQ files for unknown reason #495

Multicore Bismark could corrupt unmapped/ambigious reads FASTQ files for unknown reason #495

iromeo commented Jun 10, 2022 •

edited

Loading

iromeo commented Jun 10, 2022 •

edited

Loading

iromeo commented Jun 14, 2022

FelixKrueger commented Jun 14, 2022

FelixKrueger commented Jun 15, 2022

iromeo commented Jun 15, 2022 •

edited

Loading

FelixKrueger commented Jun 15, 2022 •

edited

Loading

iromeo commented Jun 15, 2022

FelixKrueger commented Jun 15, 2022

FelixKrueger commented Nov 10, 2022

Multicore Bismark could corrupt unmapped/ambigious reads FASTQ files for unknown reason #495

Multicore Bismark could corrupt unmapped/ambigious reads FASTQ files for unknown reason #495

Comments

iromeo commented Jun 10, 2022 • edited Loading

iromeo commented Jun 10, 2022 • edited Loading

iromeo commented Jun 14, 2022

FelixKrueger commented Jun 14, 2022

FelixKrueger commented Jun 15, 2022

iromeo commented Jun 15, 2022 • edited Loading

FelixKrueger commented Jun 15, 2022 • edited Loading

iromeo commented Jun 15, 2022

FelixKrueger commented Jun 15, 2022

FelixKrueger commented Nov 10, 2022

iromeo commented Jun 10, 2022 •

edited

Loading

iromeo commented Jun 10, 2022 •

edited

Loading

iromeo commented Jun 15, 2022 •

edited

Loading

FelixKrueger commented Jun 15, 2022 •

edited

Loading