Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nPhase needs a guide/tutorial for working on large/very heterozygous genomes #13

Open
OmarOakheart opened this issue Oct 13, 2021 · 2 comments

Comments

@OmarOakheart
Copy link
Owner

As long as nPhase doesn't make efficient use of heuristics to drastically speed up prediction time, users will run into issues with trying to run it on large genomes and could benefit from a guide to help reduce the time it takes to obtain results and how to interpret them.

@HMPNK
Copy link

HMPNK commented Jun 22, 2023

Absolutely... Any news here to cope for this?

Some features that should be added:

-Bam support, I try nPhase on a hexaploid plant (haploid genome ~650Mbp), nPhase inflates the data enormously, if this go on I will run out of disk:

-rw-rw-r-- 1 309G Jun 22 07:04 hexa.sam
-rw-rw-r-- 1 302G Jun 22 08:23 hexa.pass.sam
-rw-rw-r-- 1 302G Jun 22 10:53 hexa.sorted.header.sam
-rw-rw-r-- 1 231G Jun 22 12:06 hexa.sorted.sam

@OmarOakheart
Copy link
Owner Author

Hi,

Those are some really large files, I imagine you have very high coverage?

Unfortunately this will require you to make some manual modifications to reduce the computational burden.

My recommendations would be to do the following:

  1. Reduce the coverage of your input files to 10X/haplotype (60X total)
  2. Run nPhase one chromosome at a time (you can do so by using a different reference fasta for each chromosome, there are other possibly better ways though)
  3. You can save time by using nphase partial and reusing the same VCF file each time (which will have been run on the entire genome)
  4. You can also save time by reusing the same long read SAM file if you have one with the reads fully mapped to the genome. nPhase will only look at positions in the VCF file.

If you'd like, you can email me at omaroakheart@gmail.com and we can set up a call to talk about your use case for nPhase. It's possible that nPhase isn't going to give you the data that you're looking for. For example, it shouldn't be capable of giving you a chromosome-scale phasing. But there are things it can do well, like phase individual genes and regions in a ploidy agnostic way. It depends what information you're trying to get.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants