Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools consensus --mark-ins #2124

Open
loieeyet opened this issue Mar 12, 2024 · 2 comments
Open

bcftools consensus --mark-ins #2124

loieeyet opened this issue Mar 12, 2024 · 2 comments

Comments

@loieeyet
Copy link

Hello,

I'm a researcher who has been using bcftools effectively. I was very grateful for the update you provided last time that allowed inserting characters into mark-del. This time, I'd like to discuss the functionality of mark-ins.

Here's an example of a VCF for a single sample:

chr1 786694 - A ATT . PASS AC=- GT 0|1

When running the following command:

samtools faidx Homo_sapiens_assembly38.fasta chr1:786694-786700 | 
bcftools consensus \
--mark-ins lc \ 
--mark-del d \
-H 1 \  # or 2
-s  sample1 \
test.vcf.gz

With H set to 1, the output is ACGCAT,
and with H set to 2, the output becomes AttCGCAT.

In this scenario, I would like the output to be A--CGCAT when H is set to 1. In situations of heterozygosity where one allele has an insertion, I'd like to fill in specific characters to ensure both sequences are of equal length.

Do you have plans to support this feature? If not, I would be very grateful if you could direct me to which part to refer to or modify so that I can implement this feature myself.

@pd3
Copy link
Member

pd3 commented Mar 14, 2024

How would it work if the genotype was, say, triploid or tetraploid, and there were multiple indels of different length? I am struggling to define the behavior for a generalized case.

@loieeyet
Copy link
Author

Considering I was mainly focused on human haploid situations, I realize it's quite a task for you to design a tool that covers generalized cases.

Thinking it over, it seems challenging to add new functionality to the --mark-ins option. What about introducing an option called --pad-insertion CHAR? This would pad sequences at insertion points with a specific CHAR to make lengths consistent, except for the alt allele. It would specifically apply to phased data and require a certain --haplotype N option to be set. For triploid or tetraploid scenarios, this option could take into account the various genotypes, producing a consensus sequence that's appropriately padded. And when dealing with multiple indels of differing lengths, maybe padding according to the longest indel might be a good strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants