Repetitive mollusc genome interpretation #98

mason-linscott · 2022-03-11T16:15:23Z

I have had some troubles understanding my smudgeplot of a highly repetitive genome that has undergone LTR expansion recently. I counted 21-mers from 10x linked reads with the barcode information pruned from the R1 reads uisng KMC and then input that into smudgepairs. Coverage of the 10x data is approximately 60x.

smudgeplot.py plot 21kmc_L15_U2700_coverages.tsv -o "Land_snail" -t "O. idahoensis"

smudgeplot indicates the genome is a diploid which agrees with nQuire estimates on high map quality alignments. However, it also appears there is a substantial signal for triploidy. Is it possible this a signal of the recently expanded LTR elements? Or, is this artifact perhaps due to lower coverage than desired for the genome? I would appreciate hearing your interpretation of this plot.

All the best,
M.

The text was updated successfully, but these errors were encountered:

KamilSJaron · 2022-03-13T22:43:39Z

Hi M.,

10x is not the best for making smudgeplots, the coverage variation is a bit higher compared to other types of libraries (surprisingly even PacBio HiFi is better for kmer spectra analyses). So I think your analysis will be a bit more sensitive to the L parameter you chose. What did you use? And can you show us also a genomescope plot? That helps a lot to decide about a meaningful L.

I am a bit puzzled of the 1n coverage estimate, I am not 100% how smudgeplot ended up on that number... (the labels are not well overlaping with the smudges). But I also supect your L was too low and the bottom part of the smudgeplot is simply error kmers paired with the unique genomic kmers and then the upper smudge is the real diploid one (which would also estimate the 1n coverage a bit closer to the expected 30x, assuming that the 60x is what people call genome coverage, not per-haplotype coverage).

mason-linscott · 2022-03-14T19:25:45Z

Hi Kamil,

Thank you for looking into this! Here is the genomscope plot made using 31-mers (high K chosen due to genome repetitiveness). I was not sure whether to also apply a higher K for smudgeplot and just followed the guide. I hope it helps with your interpretation.

L was set to 15 and U to 2700 based on the output from the smudgeplot.py cutoff. I can raise L on the next smudgepairs run (takes a day and a half).

Thanks again,
Mason

KamilSJaron · 2022-03-18T14:19:17Z

Hi Mason,

that is a giant genome. That must be a huge dataset you are dealing with.

So, the first thing first. Your sequencing is not the cleanest (which is understandable for giant mollusc genomes) - the error peak and genome peak are overlapping. The problem is that you don't have a clean cut between errors and real genomic kmers, which is what we want L to be.

Good news is that the genomescope looks alright (both het and homo zygous peaks are kind of visible and the model relativelly well explains the data). Also, you have very little evidence of polyploidy (assuming your 1n coverage fit is right), because those usually have some of the higher coverage peaks relatively high as well. So the only thing I would suggest to you is to redo the smudgeplot PLOT with a coverage prior from the genomescope model. You can run it as:

smudgeplot.py plot 21kmc_L15_U2700_coverages.tsv -n 18 -o "Land_snail" -t "O. idahoensis"

I am suspecting you will get the peaks annotated right this time (as AB and AABB). I suspect the proportion of AB vs AABB kmers will be a bit different and quite possibly favouring AABB. That practically means that the 2n peak kmers have very often a very similar kmers within the same 2n peak. It is a puzzling signature for genomes with high heterozygosity, but perhaps we should not be bothered by this that much given how much troubles we had separating 1n and error peaks. Perhaps just keep in mind for downstream analyses that you might have quite a lot of paralogy in your genome (and keep even deeper in your mind that there polyploidy was not conclusively rejected in this analysis).

KamilSJaron added smudgeplot_included if smudgeplot was posted with the quesiton / problem genomescope_included labels Apr 13, 2022

KamilSJaron closed this as completed Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repetitive mollusc genome interpretation #98

Repetitive mollusc genome interpretation #98

mason-linscott commented Mar 11, 2022 •

edited

KamilSJaron commented Mar 13, 2022

mason-linscott commented Mar 14, 2022

KamilSJaron commented Mar 18, 2022 •

edited

Repetitive mollusc genome interpretation #98

Repetitive mollusc genome interpretation #98

Comments

mason-linscott commented Mar 11, 2022 • edited

KamilSJaron commented Mar 13, 2022

mason-linscott commented Mar 14, 2022

KamilSJaron commented Mar 18, 2022 • edited

mason-linscott commented Mar 11, 2022 •

edited

KamilSJaron commented Mar 18, 2022 •

edited