Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mask reference genome for human genome build 37 using bisulphite fasta file #65

Closed
nmfad opened this issue Jul 13, 2022 · 6 comments
Closed

Comments

@nmfad
Copy link

nmfad commented Jul 13, 2022

Hi All,
Am using the SNPsplit_genome_preparation command from SNPSplit v0.5.0 for human reference build 37 fasta file we use for WGBS analysis.

The SNP file I provided as input is the dbSNP v135 hg19 vcf downloaded from the DBSNP website.

When I run the genome preparation step, the arguments I provided are

  1. --reference_genome - the build 37 fasta file used for WGBS analysis
  2. --nmasking
  3. --vcf - The DBSNP v135 VCF file.
  4. --skip_filtering
  5. --genome_build hg37
  6. --strain HUMAN

The command does work, but at the end the summary says 0N were newly introduced into the reference file. If I take the --skip_filtering option away it seems to not work and throws an error.

Alternatively I also converted the dbSNP v135 VCF to a text file with only the required columns as shown in your example in the readme file and ran it with the 6 arguments above, but I still get the same output - the command works, but does not add any new Ns to the masked reference.

Can you please provide a guide on how this needs to be done for the human genome build37 fasta file used for whole genome bisulphite sequencing analysis.

Also incase no new Ns are added to the human genome fasta reference file, can it still be used to align WGBS data and can the resultant bam be successfully split into 2 alleles without the N masked reference ??

@FelixKrueger
Copy link
Owner

Hi @nmfad

This issue has come up a few times in the past (e/g. here), and I have also done this myself a few times. I should probably restate that SNPsplit is meant to work on hybrid strains where both parental genotypes are know. As a result, the input VCF file supposedly contains either the REF or ALT genotypes in a homozygous manner; so the SNPsplit genome preparation looks for a homozygous change in the ALT strain, e.g. 1/1 or 2/2 etc, and N-masks these positions.

If you wanted to use SNPsplit for the human genome with a dbSNP VCF file, you would be interested in heterozygous positions (0/1, 1/0, 0/2, etc) instead of homozygous calls; as a result, you would indeed have to modify the genome preparation script to follow this new logic, and mask heterozygous positions (maybe with some additional stringency filters) as Ns.

It is also worth reiterating that such an approach doesn't necessarily allow you to split the alignments into a paternal and a maternal copy downstream, as this would require phasing of parental genomes. Not sure if this is possible with dbSNP data at all, but I heard that haplotype phasing could be a very useful approach for this: https://whatshap.readthedocs.io/en/latest/. What you probably can do is look on a per-SNP basis for allelic imbalances or changes in methylation level.

If you need any further help with this I could probably either send you a script where tried to adapt the genome preparation for the H9 cell line, or potentially take a look at your VCF file.

And just briefly with regard to your second question, I am afraid the answer is: No. SNPsplit really requires an N-masking approach for the allele-specific sorting.

@nmfad
Copy link
Author

nmfad commented Jul 15, 2022

Hi Felix,

Thank you so much for your detailed response. Providing an answer to your first 2 paragraphs, I tried running the step with a SNP text file that looks like this.

SNP-ID Chromosome Position Strand Ref/SNP
3883917 M 64 1 C/T
147830800 M 97 1 G/A
144402189 M 125 1 T/C
139684161 M 127 1 T/C
72619361 M 146 1 T/C

I am using SNP split version v0.5.0 and other than changing the second column to contain the word "chr" in the chromosome name, do I need to make any other changes to this file ?? How should I represent the heterozygous genotypes to "behave" like homozygous positions. ??

@FelixKrueger
Copy link
Owner

The SNP file you presented here is the file that is taken for the actual SNPsplit allele-sorting step. By the looks of it this should just work.

What you will also need to do, however, is to change the SNPsplit_genome_preparation to not look for homozygous genotypes, but use heterozygous calls instead (should be somewhere around this area:

# Filtering for genotype. If the genotype is the same as the reference we will move on to the next position
.

Once you get the genome preparation to accept and N-mask the correct positions, it will also print out the correct file for use with SNPsplit later on (after indexing and mapping to the N-masked genome).

@nmfad
Copy link
Author

nmfad commented Jul 19, 2022

Thank you Felix for your kind response. The format I am using of the VCF file to be input for the genome preparation step is given in the link following link
https://github.com/FelixKrueger/SNPsplit/blob/master/SNPsplit_User_Guide.md#full-list-of-options-for-snpsplit_genome_preparation
It specifies that if we dont want the SNPsplit_genome_preparation step to do the filtering, then we can provide it a filtered VCF file inside the a directory called SNPs_strain/<vcf.txt> . This VCF.txt can be of the input with the format we discussed in prior comments. Also to add to your point, this is not the VCF I will be using for splitting the genome bams once aligned to BISMARCK. I will be using HET VCFs from the patient's WGS data for splitting the genome bams. Am only using the filtered DBSNP vcf to create the Nmasked reference. I hope that makes sense,

here is my filtered VCF file format
SNP-ID Chromosome Position Strand Ref/SNP
example: 33941939 9 68878541 1 T/G

On trying the SNPsplit_genome_preparation with the arguments I specified in my initial comment #1, the script does run successfully, but it gives me a summary as follows
"Summary
0 Ns were newly introduced into the N-masked genome for strain HUMAN in total"

Now with regard to changing the genotype, as you can see this filtered VCF file does not have a genotype column. Can I simply add a GT column and make it look homozygous but in the true sense, its actually het ? So i would not need to change the change the code, just make the genotype HOM even though the variant is not ?? Because changing the code in one location would require making changes to many places and I want to make sure I do this correctly.

@FelixKrueger
Copy link
Owner

Hmm, the file format you link there is in fact the SNP output from the SNPsplit genome preparation, which you would use for the SNPsplit step itself later on.

At least in its current form, the genome preparation expects a VCF file (more specifically this one: https://github.com/FelixKrueger/SNPsplit/blob/master/SNPsplit_User_Guide.md#filtering-and-processing-high-confidence-snps-from-the-vcf-file), which would look something like this:

CHROM	POS	REF	ALT	CAST_EiJ
1	3000023	C	A	1/1:15:4:0:79,15,0:67,12,0:2:24:4:0,0,4,0:0:-0.556411:.:0
1	3000126	G	T	0/0:.:5:0:.,.,.:.,.,.:2:47:4:1,0,4,0:0:-0.556411:.:0
1	3000185	G	T	1/1:43:12:0:276,43,0:255,36,0:2:54:12:0,0,10,2:0:-0.680642:.:1

As your file seems to contain the chromosome, position, REF and ALT bases already, you should in theory also be able to change the script to use these values directly (you could just set $fi to 1 or something. If you can't get this to work you could maybe send me the file that contains the positions you want to mask out, and I could take a look for you. Just as a side note: not sure how many positions are present in your file, but I would make sure that it's not hundreds of millions as you ill otherwise incorporate a LOT of Ns, and if your individual samples contain several Ns per read you might run into issues where the genotypes don't match up, and reads will then be considered conflicting - and discarded.

@FelixKrueger
Copy link
Owner

Haven't heard back in a while, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants