Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phased human genomes #22

Closed
LRizzardi1 opened this issue Aug 28, 2018 · 4 comments
Closed

Phased human genomes #22

LRizzardi1 opened this issue Aug 28, 2018 · 4 comments

Comments

@LRizzardi1
Copy link

I was wondering how SNPsplit would handle phased human genomes. I didn't see anything in the documentation that seemed to take it into account but I could've missed it. Thanks!

@FelixKrueger
Copy link
Owner

Hi Lindsay,

If you have phased SNP information for human data, SNPsplit should work in pretty much the same way. Since are currently two ways of getting of arriving at the point where you can align the data to the N-masked genome, use SNPsplit and then carry on with your downstream analyses:

  1. You could modify the SNPsplit_genome_preparation script to work with your VCF file. This would probably require changes in a few places, but I have done this for phased human data myself before. If the chromosomes are not mentioned in the same way as they are fin the mouse genomes VCF files, I believe one main part was that you need to change the chromosomes it using to:
    # HUMAN GENOME 
    @chroms = (1..22,'X','Y','MT');

I would be happy to get this to work for you if you could supply a copy of your VCF file (because they all look different...).
This option would have the advantage that it will generate an N-masked genome as well as the SNP file which is required later on for the SNPsplit processing itself.

  1. The other option is that you prepare N-masked genome as well as the SNP file for SNPsplit in some other way yourself (and I am afraid you would be on your own there).

Once the N-masked genome is was generated, you can:

  1. index it with either Bismark, Bowtie2, HISAT2 etc.,
  2. run your alignments, and then
  3. use SNPsplit on the resulting BAM file(s).
  4. In case of bisulfite data you would have to run deduplicate_bismark and then
  5. the bismark_methylation_extractor after they have been split up by SNPsplit

Just as a comment, the information of the phased genome is preserved in a way, because the SNP given as REF will be used as Genome 1, and ALT is used as Genome 2. I hope this helps?! Felix

@cloudred20
Copy link

Hi,
I'm studing allele-specific methylation in human cancer cell lines and would like to prepare hg19 using SNPsplit_genome_preparation and following VCF file, ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-All.vcf.gz.

  1. What are the major changes I will have to make in the SNPsplit_genome_preparation script?
  2. Also, the Bismark user guide mentions that deduplication is not recommended for RRBS. But as you've commented above, is it necessary when using SNPsplit?

@FelixKrueger
Copy link
Owner

Hi @Megha20 ,

  • To 1:
    I downloaded the VCF file you linked, and took a quick look. It would appear that this is a list of all
    SNPS annotated in dbSNP? Before you get started with this endeavour, I would like to a make sure that a few points are clear. SNPsplit is not a generalised SNP tool that will work for all situations, but is rather meant to discriminate files if both parental genotypes are known. You could possibly still use all dbSNP SNPs (as long as they are clearly defined), and look at cancer cell lines. Since the genomes are not phased, the only thing you could look at would be allelic imbalance in expression, ChIP-Seq binding or whatever you want to look at), but you can't truly assign reads to parental alleles.

If you wanted to go ahead with this, there are good and bad news. The good news are that you don't really have to deal with strains and so on, but you are kind of interested in all of the SNPs. This is however also a big problem, as the file you linked has more than 320 million lines! Since this has to be held in memory, such an 'all-dbSNP' approach would consume a HUGE amount of RAM (probably more than 100GB).

You could either change the entire code that looks for high confidence SNPs in the VCF file or write a new script that will simply write out every SNP that has a single REF and a single ALT base into a folder called SNPs_<Strain_name>, and then use the option --skip_filtering:

--skip_filtering              This option skips reading and filtering the VCF file. This assumes that a folder named
                              'SNPs_<Strain_Name>' exists in the working directory, and that text files with SNP information
                              are contained therein in the following format:

                                          SNP-ID     Chromosome  Position    Strand   Ref/SNP
                              example:   33941939        9       68878541       1       T/G

  • Regarding 2:

Unless you have used a UMI approach for the RRBS, it is indeed recommended not to deduplicate. SNPsplit itself doesn't really care about what you feed it with.

I hope this helps,
Best, Felix

@hmyh1202

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants