Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion from standard VCF to Local-alleles VCF #2105

Open
jkbonfield opened this issue Feb 20, 2024 · 3 comments
Open

Conversion from standard VCF to Local-alleles VCF #2105

jkbonfield opened this issue Feb 20, 2024 · 3 comments

Comments

@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 20, 2024

This is a request for improved documentation.

Doing a hunt through the documentation I find local alleles mentioned in one section only:

$ egrep -n -i "local.*allele" ../bcftools/doc/*.txt
2008:*-L, --local-alleles* 'INT'::
2012:    and alternate alleles. The *-L, --local-alleles* option allows replacement of such tags
2013:    with a localized tag (FORMAT/LPL) which only includes a subset of alternate alleles relevant

This is the -L option of bcftools merge. The usage for this claims:

    -L, --local-alleles INT           If more than INT alt alleles are encountered, drop FMT/PL and output LAA+LPL instead; 0=unlimited [0]

It's unclear how we can use this as a conversion tool for an existing file. I can see two possible routes.

  1. Take an existing many-sample VCF and split into potentially thousands of per-sample files, and then merge them back together again. This seems excessive and probably extremely time consuming. Plus unlike samtools, there's no bcftools split subcommand, so I assume this would be many passes through the data with bcftools view -S which feels like it'll end up be O(N^2) complexity in the number of samples. Or am I missing a split command under a different name? [Edit: yes - see below. I found a plugin later on]

  2. Merging the existing VCF with an empty VCF. I'm unsure of how to go about this.

I tried method 2 initially as it seemed the most viable, but immediately hit an error.

$ echo '##fileformat=VCFv4.2' > _
$ bcftools merge -L 0 /nfs/users/nfs_j/jkb/scratch/data/gnomad-g1k.10k.vcf _
Error: "--local-alleles 0" makes no sense, expected value bigger or equal than 1

Yet the usage statement claims 0 means unlimited. (And confusingly implies it's the default via [0], although it has no default as it's a required argument.)

Not deterred, I soldiered on and took it at its word and went with 1.

[Incidentally - a feature request. Please permit tools to work on uncompressed data when we're reading the entire file. It's tedious to have to compress and index everything when I know everything is in the same order already.]

I also hit problems with not being sure what a valid empty VCF file looks like. Is there a command to take the header from an existing VCF and strip off all the samples, so we get an empty one? I did it with egrep and echoes in the end.

After compressing and indexing that, my merge now produces LAA in the header, but as far as I can see it's totally absent in the output file.

So am I correct in assuming now that it's not possible by my guess number 2 above, and the only route to producing LAA is a full sample by sample split and then a remerge back together again?

The back story to all of this is I just want to be able to produce a baseline from an existing merged file that's not using LAA to see how much better it can get, before looking into other formats and other tools. Help would be appreciated, and I'm sure it's not just beneficial to me (hence why putting the issue here rather than direct email).

@jkbonfield
Copy link
Contributor Author

Ah sorry, I just found bcftools plugin split, so that answers the "how to split" question.

It's still not clear if this is the only way to convert from normal to local-allele though.

@pd3
Copy link
Member

pd3 commented Feb 20, 2024

It's unclear how we can use this as a conversion tool for an existing file. I can see two possible routes.

1. Take an existing many-sample VCF and split into potentially thousands of per-sample files, and then merge them back together again.  This seems excessive and probably extremely time consuming.  Plus unlike samtools, there's no `bcftools split` subcommand, so I assume this would be many passes through the data with `bcftools view -S` which feels like it'll end up be O(N^2) complexity in the number of samples.  Or am I missing a split command under a different name?

2. Merging the existing VCF with an empty VCF.  I'm unsure of how to go about this.

Currently we have no command to convert between PL and LPL. The intention was to provide the functionality where it is actually needed and avoid producing such monster sites, which happens at the merge step.

Is the purpose of your experimentation just testing or does it come from a real world scenario? In any case, a dedicated program, or maybe an option in view or convert would be the way to go about it. Splitting would be a very heavy solution, given this is usually an issue in files with tens of thousands samples.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Feb 20, 2024

Well it's chicken and egg.

I don't know of any publically available many-sample data files using LPL. I do know of ones using normal PL.

I agree the answer is not to start from here, but here is where I'm at! If you know of public data with LPL then feel free to point me towards it. Thanks.

I'm testing the split method anyway as this is really just an experiment to see how this works rather than something for a pipeline. One feature request is +split needs a --threads option. It's dominated by bgzf compression time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants