This tool extracts heterozygous kmer pairs from kmer dump files and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovA / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc.
Smudgeplots are computed from raw or even better from trimmed reads and show the haplotype structure using heterozygous kmer pairs. For example:
Every haplotype structure has a unique smudge on the graph and the heat of the smudge indicates how frequently the haplotype structure is represented in the genome compared to the other structures. The image above is an ideal case, where the sequencing coverage is sufficient to beautifully separate all the smudges, providing very strong and clear evidence of triploidy.
This tool is planned to be a part of GenomeScope in the near future.
You need a program for counting kmers installed, such as KMC and you should definitely run as well GenomeScope (a classical kmer spectra analysis). It's not rare that both GenomeScope and smudgeplot are needed to make a sense out of the sequencing data.
To run smudgpelot, you will need
python3 with couple of pretty standard packages and
- Download this repository
git clone https://github.com/KamilSJaron/smudgeplot cd smudgeplot
- Install the R package that is needed for plotting. For the installation you will need R package
devtools, the rest of the dependencies is installed automatically.
- install two scripts
install -C exec/smudgeplot.py /usr/local/bin install -C exec/smudgeplot_plot.R /usr/local/bin
Congratulations, you should have
Just to sure, check that the smudgeplot script works
Running smudgeplot v0.2.0 is expected to be printed.
If the installation procedure does not work, if you encounter any other problem, or if you would like to get help with the interpretation of your smudgeplot, please open an issue.
The input is a set of whole genome sequencing reads, the more coverage the better. The method is designed to process big datasets, don't hesitate to pull all single-end/pair-end libraries together.
The workflow is automatic, but it's not fool-proof. It requires some decisions. The usage is shown here is using KMC and GenomeScope. We provide also a real data tutorial on our wiki: strawberry genome analysis. If you are interested in running Jellyfish instead of KMC, look at the manual of smudgeplot with Jellyfish.
Give KMC all the files with trimmed reads to calculate kmer frequencies and then generate a histogram of kmers:
mkdir tmp ls *.fastq.gz > FILES # kmer 21, 16 threads, 64G of memory, counting kmer coverages between 1 and 10000x kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmer_counts tmp kmc_tools transform kmer_counts histogram kmer_k21.hist -cx10000
-k is the kmer length,
-m is the approximate amount of RAM to use in GB (1 to 1024),
-ci<value> excludes kmers occurring less than <value> times,
-cs is the maximum value of a counter,
FILES is a file name with a list of input files,
kmer_counts is the output file name prefix,
tmp is a temporary directory, and
-cx<value> is the maximum value of counter to be stored in the histogram file.
The next step is to extract genomic kmers using reasonable coverage thresholds. You can either inspect the kmer spectra and choose the L (lower) and U (upper) coverage thresholds via visual inspection, or you can estimate them using command
smudgeplot.py cutoff <kmer.hist> <L/U>. Then, extract kmers in the coverage range from
kmc_dump. Then run
smudgeplot.py hetkmers on the dump of kmers to compute the set of kmer pairs.
L=$(smudgeplot.py cutoff kmer_k21.hist L) U=$(smudgeplot.py cutoff kmer_k21.hist U) echo $L $U # these need to be sane values # L should be like 20 - 200 # U should be like 500 - 3000 kmc_tools transform kmer_counts -ci$L -cx$U dump -s kmer_k21.dump smudgeplot.py hetkmers -o kmer_pairs < kmer_k21.dump
now you can finally generate the smudgeplot using the coverages of the identified kmer pairs (
*_coverages.tsv file). You can either supply the haploid kmer coverage (reported by GenomeScope) or let it be estimated directly from the data. Something like this
smudgeplot.py plot kmer_pairs_coverages.tsv
will generate a basic smugeplot. To see all the parameters of
smudgeplot.py plot you can run
smudgeplot.py plot --help.
Smudgeplot generates two plots, one with coloration on a log scale and on a linear scale. The legend indicates approximate kmer pairs per tile densities. Note that a single polymorphism generates multiple heterozygous kmers. As such, the reported numbers do not directly correspond to the number of variants. Instead, the actual number is approximately 1/k times the reported numbers, where k is the kmer size (in summary already recalculated). It's important to note that this process does not exhaustively attempt to find all of the heterozygous kmers from the genome. Instead, only a sufficient sample is obtained in order to identify relative genome structure. You can also report the minimal number of loci that are heterozygous if the inference is correct.
GenomeScope for estimation of L/U
Rscript genomescope.R kmer_k21.hist <k-mer_length> <read_length> <output_dir> [kmer_max] [verbose]
This script estimates the size, heterozygosity, and repetitive fraction of the genome. By inspecting the fitted model you can determine the location of the smallest peak after the error tail. Then, you can decide the low end cutoff below which all kmers will be discarded as errors (cca 1/2 of the haploid kmer coverage), and the high end cutoff above which all kmers will be discarded (cca 8x of the haploid kmer coverage).
Frequently Asked Questions
Are collected on our wiki. Smudgeplot is not much demanding on computational resources, but make sure you check memory requirements before you extract kmer pairs (
hetkmers task). If you won't find an answer for your question in FAQ, open an issue or drop us an email.
Check projects to see how the development goes.
GenomeScope 2.0 and Smudgeplots: Reference-free profiling of polyploid genomes Timothy Rhyker Ranallo-Benavidez, Kamil S. Jaron, Michael C. Schatz bioRxiv 747568; doi: https://doi.org/10.1101/747568
This blogpost by Myles Harrison has largely inspired the visual output of smudgeplots. The colourblind friendly colour theme was suggested by @ggrimes. Grateful for helpful comments of beta testers and pre-release chatters!