Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools-1.19 csq: output is not compatible with previous --local-csq format #2073

Open
mmokrejs opened this issue Jan 9, 2024 · 1 comment

Comments

@mmokrejs
Copy link

mmokrejs commented Jan 9, 2024

Hi Peter,
I wrote a simple parser for the consequences output by the --local-csq code. It turned out the haplotype-aware code outputs way different format. I am guessing there are some pointers to previous lines (see e.g. @988 below) which make processing harder as I would need to cache previous consequence (e.g. missense value). But it also seems two resulting consequences are on a single line. So in summary, the --local-csq has output everything (actualy three) consenquences on a single line. The haplotype aware caller has ouput 7 consequences spanning 3 lines if I am counting correctly (yes, after reformatting through bcftools +split-vep).

$ bcftools csq -c CSQ -f ref.fasta -g ref.gff3 -p m -n 150 -s - testcase.sorted.vcf' | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1' | bcftools +split-vep -f '%CHROM\t%POS\t%Consequence\t%amino_acid_change\t%dna_change\t%DP\t%ADF\t%ADR\t%AD\t%CSQ\t[ %VAF]\t[ %VAF1]\n' | grep
-v '^#' | head
Parsing ref.gff3 ...
Indexed 1 transcripts, 1 exons, 1 CDSs, 0 UTRs
Calling...
Warning: fewer CSQ fields than expected at ref:989, filling with dots. This warning is printed only once.
ref     988     missense        4T>4H   988A>C+989C>A+990A>C    641908  320881,23,14,8  0,0,0,0 320881,23,14,8  missense|SOMEGENE|12345678|protein_coding|+|4T>4H|988A>C+989C>A+990A>C  7.16676e-05,4.36238e-05,2.49279e-05     0.000140219
ref     989     @988    .       .       641908  320884,25,12,7  0,0,0,0 320884,25,12,7  @988     7.78991e-05,3.73916e-05,2.18117e-05     0.000137102
ref     990     @988    .       .       641466  319802,815,73,44        0,0,0,0 319802,815,73,44        @988     0.00254105,0.000227603,0.000137185      0.00290583
ref     991     missense        5A>5Y   991G>T+992C>A+993T>C    641720  320684,100,34,43        0,0,0,0 320684,100,34,43        missense|SOMEGENE|12345678|protein_coding|+|5A>5Y|991G>T+992C>A+993T>C  0.000311661,0.000105965,0.000134014     0.000551641
ref     992     @991    .       .       641550  320685,82,6,4   0,0,0,0 320685,82,6,4   @991     0.000255629,1.87046e-05,1.24697e-05     0.000286804
ref     993     @991    .       .       641138  320448,37,78,8  0,0,0,0 320448,37,78,8  @991     0.000115419,0.000243316,2.49555e-05     0.00038369
ref     994     missense        6S>6V   994A>G+995G>T+996C>A    641516  320604,48,92,16 0,0,0,0 320604,48,92,16 missense|SOMEGENE|12345678|protein_coding|+|6S>6V|994A>G+995G>T+996C>A  0.000149645,0.000286819,4.98815e-05     0.000486345
ref     995     @994    .       .       641772  320742,133,10,3 0,0,0,0 320742,133,10,3 @994     0.000414475,3.11635e-05,9.34906e-06     0.000454987
ref     996     @994    .       .       641018  320371,115,13,12        0,0,0,0 320371,115,13,12        @994     0.000358802,4.05602e-05,3.74402e-05     0.000436802
ref     997     missense        7H>7S   997C>T+998A>C+999T>G    641906  320448,26,38,6  0,0,0,0 320448,26,38,6  missense|SOMEGENE|12345678|protein_coding|+|7H>7S|997C>T+998A>C+999T>G  8.11187e-05,0.000118558,1.87197e-05     0.000218396

The former --local-csq output was:

$ bcftools csq -c CSQ -f ref.fasta -g ref.gff3 -p m -n 150 -l testcase.sorted.vcf | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1' | bcftools +split-vep -f '%CHROM\t%POS\t%Consequence\t%amino_acid_change\t%dna_change\t%DP\t%ADF\t%ADR\t%AD\t%CSQ\t[ %VAF]\t[ %VAF1]\n' | grep -v '^#' | head
Parsing ref.gff3 ...
Indexed 1 transcripts, 1 exons, 1 CDSs, 0 UTRs
Calling...
ref     988     missense,missense,missense      4T>4P,4T>4A,4T>4S       988A>C,988A>G,988A>T    641908  320881,23,14,8  0,0,0,0 320881,23,14,8  missense|SOMEGENE|12345678|protein_coding|+|4T>4P|988A>C,missense|SOMEGENE|12345678|protein_coding|+|4T>4A|988A>G,missense|SOMEGENE|12345678|protein_coding|+|4T>4S|988A>T    7.16676e-05,4.36238e-05,2.49279e-05     0.000140219
ref     989     missense,missense,missense      4T>4K,4T>4I,4T>4R       989C>A,989C>T,989C>G    641908  320884,25,12,7  0,0,0,0 320884,25,12,7  missense|SOMEGENE|12345678|protein_coding|+|4T>4K|989C>A,missense|SOMEGENE|12345678|protein_coding|+|4T>4I|989C>T,missense|SOMEGENE|12345678|protein_coding|+|4T>4R|989C>G    7.78991e-05,3.73916e-05,2.18117e-05     0.000137102
ref     990     synonymous,synonymous,synonymous        4T,4T,4T        990A>C,990A>G,990A>T    641466  319802,815,73,44        0,0,0,0 319802,815,73,44        synonymous|SOMEGENE|12345678|protein_coding|+|4T|990A>C,synonymous|SOMEGENE|12345678|protein_coding|+|4T|990A>G,synonymous|SOMEGENE|12345678|protein_coding|+|4T|990A>T       0.00254105,0.000227603,0.000137185      0.00290583
ref     991     missense,missense,missense      5A>5S,5A>5T,5A>5P       991G>T,991G>A,991G>C    641720  320684,100,34,43        0,0,0,0 320684,100,34,43        missense|SOMEGENE|12345678|protein_coding|+|5A>5S|991G>T,missense|SOMEGENE|12345678|protein_coding|+|5A>5T|991G>A,missense|SOMEGENE|12345678|protein_coding|+|5A>5P|991G>C    0.000311661,0.000105965,0.000134014     0.000551641
ref     992     missense,missense,missense      5A>5D,5A>5G,5A>5V       992C>A,992C>G,992C>T    641550  320685,82,6,4   0,0,0,0 320685,82,6,4   missense|SOMEGENE|12345678|protein_coding|+|5A>5D|992C>A,missense|SOMEGENE|12345678|protein_coding|+|5A>5G|992C>G,missense|SOMEGENE|12345678|protein_coding|+|5A>5V|992C>T    0.000255629,1.87046e-05,1.24697e-05     0.000286804
ref     993     synonymous,synonymous,synonymous        5A,5A,5A        993T>C,993T>A,993T>G    641138  320448,37,78,8  0,0,0,0 320448,37,78,8  synonymous|SOMEGENE|12345678|protein_coding|+|5A|993T>C,synonymous|SOMEGENE|12345678|protein_coding|+|5A|993T>A,synonymous|SOMEGENE|12345678|protein_coding|+|5A|993T>G       0.000115419,0.000243316,2.49555e-05     0.00038369
ref     994     missense,missense,missense      6S>6G,6S>6R,6S>6C       994A>G,994A>C,994A>T    641516  320604,48,92,16 0,0,0,0 320604,48,92,16 missense|SOMEGENE|12345678|protein_coding|+|6S>6G|994A>G,missense|SOMEGENE|12345678|protein_coding|+|6S>6R|994A>C,missense|SOMEGENE|12345678|protein_coding|+|6S>6C|994A>T    0.000149645,0.000286819,4.98815e-05     0.000486345
ref     995     missense,missense,missense      6S>6I,6S>6N,6S>6T       995G>T,995G>A,995G>C    641772  320742,133,10,3 0,0,0,0 320742,133,10,3 missense|SOMEGENE|12345678|protein_coding|+|6S>6I|995G>T,missense|SOMEGENE|12345678|protein_coding|+|6S>6N|995G>A,missense|SOMEGENE|12345678|protein_coding|+|6S>6T|995G>C    0.000414475,3.11635e-05,9.34906e-06     0.000454987
ref     996     missense,synonymous,missense    6S>6R,6S,6S>6R  996C>A,996C>T,996C>G    641018  320371,115,13,12        0,0,0,0 320371,115,13,12        missense|SOMEGENE|12345678|protein_coding|+|6S>6R|996C>A,synonymous|SOMEGENE|12345678|protein_coding|+|6S|996C>T,missense|SOMEGENE|12345678|protein_coding|+|6S>6R|996C>G     0.000358802,4.05602e-05,3.74402e-05     0.000436802
ref     997     missense,missense,missense      7H>7Y,7H>7N,7H>7D       997C>T,997C>A,997C>G    641906  320448,26,38,6  0,0,0,0 320448,26,38,6  missense|SOMEGENE|12345678|protein_coding|+|7H>7Y|997C>T,missense|SOMEGENE|12345678|protein_coding|+|7H>7N|997C>A,missense|SOMEGENE|12345678|protein_coding|+|7H>7D|997C>G    8.11187e-05,0.000118558,1.87197e-05     0.000218396

Would it be possible to mimic the previous output format?

I don't understand the dots in the 4th and 5th column of the haplotype-aware results and how to work with the supposedly "next data" section in columns 10 to 12. I know this is a TSV output from bcftools +split-vep but I hope you get my point. You have this particular testcase data in your email from yesterday. Am I just misunderstanding the output or have unreal expectations of the output format?

@pd3
Copy link
Member

pd3 commented Jan 29, 2024

There seems to be some confusion. Running with --local-csq or without can give different outputs, and that's a feature, not a bug. In the haplotype-aware mode multiple variants can be part of the same consequence prediction ("compound variant"). The program outputs only one record detailing the compound variant and will print pointers to this line for the rest. There is no plan to change this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants