Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in msa2vcf ? #232

Open
wwood opened this issue Mar 31, 2023 · 4 comments
Open

Bug in msa2vcf ? #232

wwood opened this issue Mar 31, 2023 · 4 comments

Comments

@wwood
Copy link

wwood commented Mar 31, 2023

Subject of the issue

Output seems incorrect.

Your environment

On Linux, see output for version info

Steps to reproduce

With this input:

$ head ~/t/eg.aln ~/t/eg_ref.fna
==> /home/woodcrob/t/eg.aln <==
>ref
AGGATTACTCGCTTTTTGTTAGTCATCGGTATCTGGACCCTGATAGCTGTTGCTTCCCAC
>mut
AGCATTACT-GCTTTTTGTTAGTCATCGGTATCTGGACCCCGATAGCTGATGCTTCCCAC

==> /home/woodcrob/t/eg_ref.fna <==
>ref
AGGATTACTCGCTTTTTGTTAGTCATCGGTATCTGGACCCTGATAGCTGTTGCTTCCCAC

and running

$ java -jar /mnt/hpccs01/home/woodcrob/bioinfo/jvarkit/dist/jvarkit.jar msa2vcf -R ~/t/eg_ref.fna ~/t/eg.aln 
[INFO][MsaToVcf]format : Fasta
##fileformat=VCFv4.2
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##contig=<ID=/home/woodcrob/t/eg_ref.fna,length=60>
##msa2vcf.meta=compilation:20230331082601 githash:8e5f425de htsjdk:3.0.4 date:20230331125743 cmd:-R /home/woodcrob/t/eg_ref.fna /home/woodcrob/t/eg.aln
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  mut     ref
/home/woodcrob/t/eg_ref.fna     3       .       C       G       .       .       DP=2    GT:DP   0/0:1   1/1:1
/home/woodcrob/t/eg_ref.fna     9       .       TC      T       .       .       DP=2    GT:DP   1/1:1   0/0:1
/home/woodcrob/t/eg_ref.fna     41      .       C       T       .       .       DP=2    GT:DP   0/0:1   1/1:1
/home/woodcrob/t/eg_ref.fna     50      .       A       T       .       .       DP=2    GT:DP   0/0:1   1/1:1

The first 2 are right, but the 2nd two aren't - they are T->C not C->T, and T->A not A->T. Did I get that right? Here's the blast output to help visualise

REF| Query  1   AGGATTACTCGCTTTTTGTTAGTCATCGGTATCTGGACCCTGATAGCTGTTGCTTCCCAC  60
                || |||||| |||||||||||||||||||||||||||||| |||||||| ||||||||||
ALT| Sbjct  1   AGCATTACT-GCTTTTTGTTAGTCATCGGTATCTGGACCCCGATAGCTGATGCTTCCCAC  59

Otherwise this tools seems like exactly what we need, so any help much appreciated.
Thanks, ben

@wwood
Copy link
Author

wwood commented Mar 31, 2023

FWIW, it seems like that switch happens after an indel, as observed with a second example not shown.

@lindenb
Copy link
Owner

lindenb commented Mar 31, 2023

Hi,
I think there is a misunderstanding here:

when using a fasta as input, the REFERENCE among the fasta records must be specified using option -R. Otherwise the most frequent base is used as ref.

then option -f is used to save the consensus:

    -f, --fasta
      save computed fasta sequence in this file.

@lindenb
Copy link
Owner

lindenb commented Mar 31, 2023

furthermore, you should also have a look at : https://github.com/sanger-pathogens/snp_sites

@wwood
Copy link
Author

wwood commented Mar 31, 2023

Hi,

Thanks for the quick response. You're right, I hadn't quite appreciated about the consensus. From the help it says

    -R, --REF
      reference name used for the CHROM column. Optional
      Default: chrUn

So as I understand, that only changes what appears in the first column, and doesn't impact what the sequence of the reference is. Are you saying it also changes the consensus?

Anyway, I tried to workaround by adding the same ref sequence twice, slightly changing the name of one to maintain uniqueness. Then finally the non-ref sequence. So the consensus should always be the reference.

That worked for the above example, but it seemed to do something unexpected when the reference has gaps. Specifically, including gaps, the input sequences were 5002 bp, and 5000 bp not including them. I was expecting the consensus to be 5000 bp, but it comes out as 5002 bp. Is that expected behaviour? If you don't quite understand I can cook up a reduced example as above?

snp_sites is what I tried first, but that doesn't seem to have any option to report INDELs, only SNPs. Maybe there's something I missed?

Anyway, happy to help further here if helpful for you, but I just decided to write this code myself, since I only have 2 sequences for my use-case. Let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants