Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools mpileup handling of corrupted/truncated data/streaming errors #2177

Closed
kachulis opened this issue May 1, 2024 · 3 comments
Closed

Comments

@kachulis
Copy link

kachulis commented May 1, 2024

bcftools mpileup doesn't seem to check if there were errors while reading read data in. This means that bcftools mpileup can appear to have run successfully, but there may have been an underlying data reading problem. As an example, I created 3 bams and crams as follows:
good.{bam,cram} : these are a small valid bam/cram.
no_eof.{bam,cram}: these are good.{bam,cram} but with the eof markers removed from the end of the file.
truncated_corrupted.{bam,cram}: these are good.{bam,cram} but with a random set of bytes (including the eof bytes) removed from the end of the file.

I then ran these files through both bcftools mpileup and samtools view, using the following script.

#!/usr/bin/env bash

call_bcftools_mpileup() {
	echo "bcftools mpileup " $1
	bcftools mpileup -f ~/references/Homo_sapiens_assembly38.fasta -o /dev/null $1
	echo "returns " $?
	yes '' | head -n2
}

call_samtools_view() {
	echo "samtools view " $1
	samtools view -t ~/references/Homo_sapiens_assembly38.fasta -o /dev/null $1
	echo "returns " $?
	yes '' | head -n2
}

call_bcftools_mpileup good.cram

call_samtools_view good.cram

call_bcftools_mpileup good.bam

call_samtools_view good.bam

call_bcftools_mpileup no_eof.cram

call_samtools_view no_eof.cram

call_bcftools_mpileup no_eof.bam

call_samtools_view no_eof.bam

call_bcftools_mpileup truncated_corrupted.cram

call_samtools_view truncated_corrupted.cram

call_bcftools_mpileup truncated_corrupted.bam

call_samtools_view truncated_corrupted.bam

this results in the following output:

bcftools mpileup  good.cram
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns  0


samtools view  good.cram
returns  0


bcftools mpileup  good.bam
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns  0


samtools view  good.bam
returns  0


bcftools mpileup  no_eof.cram
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
[W::hts_close] EOF marker is absent. The input is probably truncated
returns  0


samtools view  no_eof.cram
[W::hts_close] EOF marker is absent. The input is probably truncated
returns  0


bcftools mpileup  no_eof.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns  0


samtools view  no_eof.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
returns  0


bcftools mpileup  truncated_corrupted.cram
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
returns  0


samtools view  truncated_corrupted.cram
samtools view: error reading file "truncated_corrupted.cram"
returns  1


bcftools mpileup  truncated_corrupted.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
[E::bgzf_read_block] Failed to read BGZF block data at offset 289804 expected 9098 bytes; hread returned 8902
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
returns  0


samtools view  truncated_corrupted.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read_block] Failed to read BGZF block data at offset 289804 expected 9098 bytes; hread returned 8902
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
samtools view: error reading file "truncated_corrupted.bam"
samtools view: error closing "truncated_corrupted.bam": -1
returns  1

A few things to note.

  1. When the EOF marker is missing, but the input data is otherwise fine, both samtools view and bcftools mpileup will print a warning, but return a code of 0, indicating success. This feels a little counterintuitive to me (I would expect an error return code here), but certainly I see the argument for this behavior.
  2. Much more problematically, IMO, when the truncated cram is used, bcftools will not print any warning or error at all, and will still return a code of 0. This is much more of a problem, imo, as this data is clearly corrupted. On this file, samtools does print an error and return a non-zero error code.
  3. The truncated bam behaves a little bit better, in that at least an error message is printed, though bcftools still returns a return code of 0.

My main concern about this is that this mimics the behavior we would see if a totally valid cram or bam was being streamed from a google bucket, and the google endpoint decided to return a 429 (or 404, or some other error code) at some point in the middle of streaming the data. I think that in this scenario bcftools would output a valid pileup file, with no indication that anything was amiss.

@jmarshall
Copy link
Member

Re your note (1), this behaviour comes from the days when many tools producing BAM files did not write EOF markers, so treating its absence as a hard error would have been counter-productive. Perhaps that behaviour could be revisited now — especially for CRAM files, which probably defined an EOF marker block right from the beginning.

It would be interesting to add testing of samtools mpileup to your script too, to check that it still checks all input errors. Apparently the mpileup code was forked into bcftools before samtools/samtools#1379 was applied, and no-one has done a similar audit of the bcftools code.

The good news is that propagating ret in mpileup_reg() and actually checking mpileup_reg()'s return value will probably fix most of this at a stroke, though only for mpileup.

@kachulis
Copy link
Author

kachulis commented May 2, 2024

@jmarshall You are quite right, samtools mpileup does catch this input error.

samtools mpileup  truncated_corrupted.cram
[mpileup] 1 samples in 1 input files
samtools mpileup: error reading from input file
returns  1

samtools mpileup  truncated_corrupted.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[mpileup] 1 samples in 1 input files
[E::bgzf_read_block] Failed to read BGZF block data at offset 289804 expected 9098 bytes; hread returned 8902
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
samtools mpileup: error reading from input file
returns  1

glad to hear it is an easy fix!

@pd3
Copy link
Member

pd3 commented May 24, 2024

Thank you for the report, this is now fixed in 57b9072, the command bcftools mpileup should now exit with a non-zero code. I believe the rest of bcftools is already doing the right thing and throws an error on truncated vcf.gz/bcf files.

@pd3 pd3 closed this as completed May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants