Skip to content

manual of smudgeplot with jellyfish

Kamil S. Jaron edited this page Jan 14, 2020 · 8 revisions

This is a manual using smudgeplot v0.2.1

Previous versions:


jellyfish is an analogous program to KMC. KMC is faster and more versatile tool, but a basic smudgeplot of a small genome Jellyfish will do the job just fine. There are two possible algorithms to extract kmer pairs (the task called hetkmers), all and middle. The default algorithm all extracts kmer pairs that differ by a single nucleotide on any position (we consider all positions, therefore all). The way faster alternative extracts only kmers that differ by the middle nucleotide (therefore middle) and requires the dump file on the input to be sorted. You can sort the dump file using for instance bash sort, but I would discourage you to do that as this process is not optimized at all for nucleotide sequences, use KMC instead. Therefore here we show usage of smudgeplot with all algorithm.

  1. Give jellyfish all the sequencing read files to calculate kmer frequencies and then generate a histogram of kmers
jellyfish count -C -m 21 -s 1000000000 -t 8 *.fastq -o kmer_counts.jf
jellyfish histo kmer_counts.jf > kmer_k21.hist
  1. Extract genomic kmers using reasonable coverage thresholds. You can either inspect the kmer spectra and choose the L (lower) and U (upper) coverage thresholds via visual inspection, or you can estimate them using the script smudgeplot.py cutoff <kmer.hist> <L/U>. Then, extract kmers in the coverage range from L to U using Jellyfish dump and pipe them directly to smudgeplot.py hetkmers to compute the set of kmer pairs.
L=$(smudgeplot.py cutoff kmer_k21.hist L)
U=$(smudgeplot.py cutoff kmer_k21.hist U)
echo $L $U # these need to be sane values like 30 800 or so
jellyfish dump -c -L $L -U $U kmer_counts.jf | smudgeplot.py hetkmers -o kmer_pairs
# note that if you would like use --middle flag, you would have to sort the jellyfish dump first
  1. After generating the list of kmer pair coverages, generate the smudgeplot using the coverages of the kmer pairs (*_coverages.tsv file). You can either supply the haploid kmer coverage (reported by GenomeScope) or let it be estimated directly from the data and compare it afterwards. If GenomeScope correctly identifies the peak of haploid kmers, the expected positions of the haplotype structures will overlap with high-density smudges on the smudgeplot. If the overlap is not great you might consider adjusting the GenomeScope model and redoing the plot with a better estimate of the haploid coverage. Something like
smudgeplot.py plot kmer_pairs_coverages_2.tsv -o my_genome

will generate a basic smugdeplot, the full usage is described in the help of smudgeplot smudgeplot.py plot -h.