Skip to content
Kamil S. Jaron edited this page Aug 7, 2023 · 9 revisions
1. What are heterozygous kmers? Are these kmer pairs with a close but not perfect sequence match?

Yes. Right now heterozygous kmers are those that:

  • are exactly one SNP from each other (for instance AATCA ACTCA)
  • form a unique pair (i.e. there are no other kmers one SNP away from them. for instance ATGATCA ATGCTCA ATGGTCA would be discarded - they three not two)

Like this we heavily subsample the genome, but so far this was very sufficient to sample enough heterozygous kmers to see the genome structure.

2. Are you equating kmer with haplotype? If not, how do you infer haplotypes?

I assume that at least some of the kmers will be heterozygous - i.e. one of the kmer is from one haplotype and the other kmers is the other. If the ploidy is higher, like triploid, we expect to find locations that have the same kmers in two of the haplotypes one one SNP difference in the third haplotype. If the genome is tetraploid we will detect some pairs where two haplotypes are similar and two haplotypes are too diverged (looking like AB), some cases where one haplytype will be diverged but the three other will be "triploid like" and therefore it will look like AAB and finally majority of heterozygosity will be carried by either AABB or AAAB structures which will also tell us whether the genome structure is AA'BB' or AA'A''B (i.e. what is the branching of haplotypes)

3. In the example plot, I suspect that the red dot at x = 1/3 indicates triploidy, but only if two of the subgenomes are more similar to each other relative to the third (which I presume is meant by the AAB label).

Yes, because we search for unique pairs, if the pair comes from three haplotypes it must be twice same (A) and once one SNP away (B).

4. If the three genomes of a triploid organism were equidistant to each other, then that blob would move to and fuse with the one labelled 'AB' at x = 0.5 because you are considering pairs. Is that correct?

This could happen, but under slightly different circumstances. Only if one haplotype would be very diverged and the genome would be ABC (where AB are similar, C is distant), then we would not be able to identify C as the corresponding haplotype and smudgeplot would indeed look like diploid (because the only kmer pairs one SNP away from each other would be kmers heterozygous kmers between A and B). If all thee haplotypes are evenly distant but still close than AAB smudge will be mixture of genomic positions where 1. 2. haplotypes are the same and 3. has a SNP, 1. 3. are the same and 2. is different and 2. 3. are same and 1. is different. Smudgeplots are not phasing haplotypes, so we can not make strong claims about haplotype divergence for triploids (although I am trying to make some guesses).

5. What are the assumptions of smudgeplots?

We require:

  • single individual (or clonal population)
  • not too high coverage variance (this can be compensated with genome coverage, but for instance a 100x genome with whole genome amplification was not enough)
  • low sequencing error rates (<1%)

However, the method should be robust to:

  • contamination (to cetain extend)
  • genome subsampling (it should work on exon capture etc)
  • quite some coverage variance if the coverage is sufficient (I see a signal in tardigrades that was a messy sequencing too)
  • presence of adapters (because they are high frequency kmers, but this was not verified yet)

And mainly we do NOT use reference genome, de-novo genome assembly, mapping or base quality scores for the inference. Therefore we are free of all the downstream biases that are coming from the steps above.

5. How much coverage do I need?

More than ~50x, but as much as you can. For details check wikipage everything about input data.

6. How does look like a smudgeplot of a haploid species?

The interpretation of haploid genome smudgeplot would be very similar to the interpretation of smudgeplots of completely homozygous diploid genomes - all the real genomic kmer pairs represent genomic paralogs, therefore, smudgeplot would be a visualisation of the paralog structure of the genome. Keep in mind for the interpretation that only kmers that are 1 nucleotide from each other are considered as pairs which means that more diverged paralogs won't be detected.

7. Can I run smudgeplot on long reads?

You certainly can run using PacBio HiFi reads, the corrected long reads (as output by Canu) are not as good, but if you have sufficient coverage it is very likely you will be able to fit the genomescope to the k-mer spectrum. To decide if there is a point in running smudgeplot, one needs to look at the k-mer spectrum - does it have meaningfully distinct genomic peaks? Are they at least-moreless separated from the sequencing errors? If the answer to both is yes, then smudgeplot should work as well.

8. How are individual smudgeplot versions named?

Window types in order listed here.

9. Can I use smudgeplot on reduced representation sequencing?

RAD-seq, exon sequencing, or any other form of reduced representation sequencing is unfortunately difficult to process. These datasets show massive coverage variations and biases that usually generate a very misleading picture. In principle with removed PCR and optical duplicates and with careful consideration of the data one could use the same principles to determine ploidy, but it won't be out of box using smudgeplots and is not supported by the current version.

Clone this wiki locally