Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert to called genotypes #29

Closed
olivierlabayle opened this issue Dec 14, 2022 · 7 comments
Closed

convert to called genotypes #29

olivierlabayle opened this issue Dec 14, 2022 · 7 comments

Comments

@olivierlabayle
Copy link

Hi,

Sorry for all the feature requests, I'm working a lot with BGEN files and would love to move all of my preprocessing to Julia if possible.

In SnpArrays, there is the possibility to convert the snp array to genotypes and I was wondering if it could be possible to have the same function here using the probabilities contained in the BGEN file, possibly with a "calling" threshold.

Cheers,
Olivier

@kose-y
Copy link
Member

kose-y commented Dec 14, 2022

Sure, we can do that. One thing is that it will not be much faster than you writing a function for that yourself. BGEN format has variant-by-variant compression, and the computational bottleneck is the decompression for each variant.

@kose-y
Copy link
Member

kose-y commented Jan 20, 2023

I need clarification here--are you working on phased data or unphased data? If unphased data, isn't it natural to store a dosage matrix rather than doing genotype calling?

@olivierlabayle
Copy link
Author

My goal eventually would be to convert any probability array returned by probabilities! to a CategoricalVector (or similar if you don't want to depend on CategoricalArrays) of genotypes given a calling threshold. This would be for both phased and unphased data as I'd like to support both in my project.

@kose-y
Copy link
Member

kose-y commented Mar 20, 2023

I'm sorry for not getting back to you earlier. Are you planning to limit the use of it for human data (biallelic, diploid)? If so, it should be relatively simple than what I initially imagined.

I might use some ideas like this:
https://www.cog-genomics.org/plink/2.0/input#dosage_import_settings

@olivierlabayle
Copy link
Author

Sorry, I missed that. Yes at the moment my project is limited to human data!

@kose-y
Copy link
Member

kose-y commented Apr 17, 2023

Since the target is biallelic and diploid, I am adding a function that takes the dosage vector as an input and an integer genotype vector as an output, named hardcall!() and hardcall(). The documentation will be also available. Please note that the first_allele_dosage!() function supports both phased and unphased data. This will be available in v0.1.13. Please let me know if you need further support.

@kose-y
Copy link
Member

kose-y commented Apr 20, 2023

Check out v0.1.14.

@kose-y kose-y closed this as completed May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants