Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCF.chrom always returns 1 #25

Open
kaskarn opened this issue Jun 12, 2019 · 3 comments
Open

BCF.chrom always returns 1 #25

kaskarn opened this issue Jun 12, 2019 · 3 comments

Comments

@kaskarn
Copy link

kaskarn commented Jun 12, 2019

Hello,

I am unable to get BCF.chrom to output anything other than 1 (Julia 1.1.0; RHEL 7.6).

I tested this with multiple BCF files. When inspected with bcftools view, these same files show the correct chromosome name, in the expected field:

> bcftools view bcftest.bcf.gz
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype calls">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##contig=<ID=22>
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype call probabilities">
##INFO=<ID=QUAL,Number=1,Type=Float,Description="Imputation Quality">
##INFO=<ID=TAG,Number=1,Type=String,Description="rs_id from annotation files">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view bcftest.bcf.gz; Date=Wed Jun 12 02:41:11 2019
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  XXXX
22      16050075        22:16050075:A:G A       G       0.123   PASS    TAG=22:16050075:A:G;AC=0;AN=0   GT:DS:GP        ./.:0:1,0,0
22      16050115        22:16050115:G:A G       A       0.404   PASS    TAG=22:16050115:G:A;AC=0;AN=0   GT:DS:GP        ./.:0:1,0,0

However, BCF.chrom tells a different story:

julia> using GeneticVariation

julia> reader = BCF.Reader(open("bcftest.bcf.gz", "r"))
GeneticVariation.BCF.Reader{IOStream}((0x02, 0x02), GeneticVariation.VCF.Header:
  metainfo tags: fileformat FILTER FORMAT contig INFO
     sample IDs: XXXX, BGZFStreams.BGZFStream{IOStream}(<mode=read>))

julia> record = read(reader)
GeneticVariation.BCF.Record:
   chromosome: 1
     position: 16050075
   identifier: 22:16050075:A:G
    reference: A
    alternate: G
      quality: 0.123
       filter: 1
  information: Tuple{Int64,Any}[(5, "22:16050075:A:G"), (6, 0), (7, 0)]

julia> BCF.chrom(record)
1

julia> BCF.pos(record)
16050075

The issue doesn't seem connected to the actual implementation of BCF.chrom(). Reading straight from the BGZFStream causes the same issue:

julia> reader = BCF.Reader(open("bcftest.bcf.gz", "r"))
GeneticVariation.BCF.Reader{IOStream}((0x02, 0x02), GeneticVariation.VCF.Header:
  metainfo tags: fileformat FILTER FORMAT contig INFO
     sample IDs: XXXX, BGZFStreams.BGZFStream{IOStream}(<mode=read>))

julia> misc = read(reader.stream, Int64) #skip first 8 bytes
115964117068

julia> chrom = read(reader.stream, Int32) %Int + 1 #next 4 bytes are CHR
1

julia> pos = read(reader.stream, Int32) %Int + 1 #next 4 bytes are POS
16050075


Reproducible example attached: bcftest.bcf.gz

@kaskarn
Copy link
Author

kaskarn commented Jun 12, 2019

An update:

This does seem to be a bug in GeneticVariation's handling of BCF files. Long story short, instead of storing the explicit chromosome name like in plaintext VCF format, the CHROM field in BCF files stores indices to the list of 'contig' field headers.

In the example file, there is only one ##contig, with corresponding value 22. Instead of 1, BCF.Chrom should probably return the value specified by the first ##contig, i.e. 22.

Per the VCFv4.3 specs:

6.2.2 Dictionary of contigs
The CHROM field in BCF2 is encoded as an integer offset into the list of ##contig field headers in the VCF header. The offsets begin, like the dictionary of strings, at 0. So for example if in BCF2 the contig value is 10, this indicates that the actual chromosome is the 11th element in the ordered list of ##contig elements [...]

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Jun 12, 2019

Can you test how this is with TranscodingStreams.jl? All the IO packages are to be upgraded to use transcoding streams eventually for their 2.0 releases. If there's an issue with BGZFStreams, it may be better to use this issue as a reason to get a move on with moving GeneticVariation over to using TranscodingStreams, rather than bug hunt in a package we intend to drop eventually.

@kaskarn
Copy link
Author

kaskarn commented Jun 12, 2019

I'm fairly sure the issue involves the handling of correctly-decoded data (looks like I updated my post seconds before you commented), so moving to TranscodingStreams.jl shouldn't move the needle much on this particular problem. If GeneticVariation is still intended as the main package for BCF/VCF handling, I'm happy to contribute PR on both fronts, though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants