Autotetraploid

Autotetraploids

Autotetraploids are polyploids that originated via whole genome duplication event, assuming this is happening for a few generations, but not too many, the four haplotypes are more or less just a random selection of alleles on each haplotype, meaning there won't be any structure to them. This is of course a bit simplistic, because in nature polyploids, regardless of their origin, re-diplodise over time and eventually evolve a structure to their perfectly 4-way-symmetrical genome.

Autotetraploid yeast

Let's start with a supposedly tetraploid strain of saccharomyces [Knaus et al 2018]. Yeast are amazing for studying ploidy - those come in all sorts ploidies with all sorts of origins and have one more advantage... The genome is so small that it's not much computation to do everything on your own. We show even generating the k-mer spectra using KMC. Start by getting the data

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_2.fastq.gz

Then use `kmc` to get a kmer database (`kmer_counts`) and create a k-mer histogram.

mkdir tmp logs # create a directory for temporary files
ls SRR3265401_1.fastq.gz SRR3265401_2.fastq.gz > FILES # create a file with both the raw read files
kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmer_counts tmp # run kmc
kmc_tools transform kmer_counts histogram kmer_k21.hist -cx10000

Alternatively you can download the k-mer spectra from Google Drive.

Now, let's fit a default tetraploid GenomeScope model and see what happens.

genomescope.R -i kmer_k21.hist -k 21 -p 4 -o . -n Scer_genomescope

strange_genomescope2

TODO: double check generation, might need coverage prior TODO: add interpretation

Here we took a shortcut. We knew this is an autotetraploid and we right away fit the right model. But what if don't know all that? The two peaks on the left of the main peak are a bit puzzling, are both of the part of the genome? How do we know?

analysed

strange_smudgeplot

Looking at the posted smudgeplot - yeah, it's really weird. Beside the AB smudge, there is also a one more ~ at the L * cov line (the error line). I am suspecting that the L was too low, but I got to investigate a tiny bit if I am right.

The analysis

Alright. We see three peaks, ~15x, ~30x and ~60x. So the only question is if the smallest peak the 1n of the genome? I also expect that choosing L = 15 and high U (3000) would generate a smudgeplot similar to the one posted

kmc_dump -ci15 -cx3000 kmer_counts kmer_k21_L15.dump
smudgeplot.py hetkmers -o kmer_pairs_L15 < kmer_k21.dump
smudgeplot.py plot -o saccharomyces_L15 -t "yeast" kmer_pairs_L15_coverages.tsv

and indeed

replicated_smudgeplot

Alright, what is going on? The 1n estimate is ~35x (but looking the smudge behind the annotation, it seems it's slightly less). Therefore the AA and AB smudges are (a bit less than) ~70x. Hence they probably correspond to pairing of 15x peak with 45x peak in the AA case and kmers from 30x peak with itself. Interesting enough, there is no smudge around 30x, where one would expect one made of kmer pairs within 1n peak. The absence of pairs within the 1n peak is actually the main reason why the smudgeplot was confused with the annotation of smudges. So perhaps the smudge is real, but misannotated and the ~15x is a real genomic peak. So if we fit a tetraploid model with a coverage prior = 16 (it will converge on the best fitting value)

genomescope.R -i kmer_k21.hist -k 21 -p 4 -o . -l 16 -n Scer_genomescope_p4

then we will get a decent fit. The genome size is approximately right (14Mbp vs expected 12Mbp) and at least the spacing of the individual peaks seems to be exactly matching the tetraploid model

strange_genomescope2

The coverage has converged to ~17x haploid coverage. Lets then redo the smudgeplot, while enforcing what we think is a haploid coverage

smudgeplot.py plot -o saccharomyces_L15_n17 -n 17 -t "yeast" kmer_pairs_coverages.tsv

and there you go...

reannotated_smudgeplot

we get a tetraploid smudgeplot.

Table of content

Introduction

Concept of k-mers

k-mer spectra analysis

📖 Introduction to K-mer spectra analysis
- ⚒ Generating k-mer spectra tutorial
📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
📖 Analysing genome skimming data

Separation of chromosomes

📖Separate sub-genomes of an allopolyploid
📖Separating chromosomes by comparison of sequencing libraries
- ⚒ Extracting sex chromosome k-mers from a male and female sample
- ⚒ Extract k-mers specific to germ-line restricted chromosomes
- ⚒ Matching k-mers to a reference (bwa-mem)
- ⚒ Matching k-mers to sequencing reads (cookiecutter)

Species assignment using short k-mers

📖Identifying haplotypes within targeted amplicon sequencing datasets
- ⚒ Performing species assigment from targeted amplicon sequencing data

Others

🖥️ Installation of the kmer_tools conda evironment
📖 Other k-mer resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autotetraploid

Autotetraploids

Autotetraploid yeast

The analysis

Table of content

Clone this wiki locally