Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repetitive mollusc genome interpretation #98

Closed
mason-linscott opened this issue Mar 11, 2022 · 3 comments
Closed

Repetitive mollusc genome interpretation #98

mason-linscott opened this issue Mar 11, 2022 · 3 comments
Labels
genomescope_included smudgeplot_included if smudgeplot was posted with the quesiton / problem

Comments

@mason-linscott
Copy link

mason-linscott commented Mar 11, 2022

I have had some troubles understanding my smudgeplot of a highly repetitive genome that has undergone LTR expansion recently. I counted 21-mers from 10x linked reads with the barcode information pruned from the R1 reads uisng KMC and then input that into smudgepairs. Coverage of the 10x data is approximately 60x.

smudgeplot.py plot 21kmc_L15_U2700_coverages.tsv -o "Land_snail" -t "O. idahoensis" 

smudgeplot indicates the genome is a diploid which agrees with nQuire estimates on high map quality alignments. However, it also appears there is a substantial signal for triploidy. Is it possible this a signal of the recently expanded LTR elements? Or, is this artifact perhaps due to lower coverage than desired for the genome? I would appreciate hearing your interpretation of this plot.

All the best,
M.
Land_snail_smudgeplot

@KamilSJaron
Copy link
Owner

Hi M.,

10x is not the best for making smudgeplots, the coverage variation is a bit higher compared to other types of libraries (surprisingly even PacBio HiFi is better for kmer spectra analyses). So I think your analysis will be a bit more sensitive to the L parameter you chose. What did you use? And can you show us also a genomescope plot? That helps a lot to decide about a meaningful L.

I am a bit puzzled of the 1n coverage estimate, I am not 100% how smudgeplot ended up on that number... (the labels are not well overlaping with the smudges). But I also supect your L was too low and the bottom part of the smudgeplot is simply error kmers paired with the unique genomic kmers and then the upper smudge is the real diploid one (which would also estimate the 1n coverage a bit closer to the expected 30x, assuming that the 60x is what people call genome coverage, not per-haplotype coverage).

@mason-linscott
Copy link
Author

Hi Kamil,

Thank you for looking into this! Here is the genomscope plot made using 31-mers (high K chosen due to genome repetitiveness). I was not sure whether to also apply a higher K for smudgeplot and just followed the guide. I hope it helps with your interpretation.

G3_sup_fig_1_genomscope

L was set to 15 and U to 2700 based on the output from the smudgeplot.py cutoff. I can raise L on the next smudgepairs run (takes a day and a half).

Thanks again,
Mason

@KamilSJaron
Copy link
Owner

KamilSJaron commented Mar 18, 2022

Hi Mason,

that is a giant genome. That must be a huge dataset you are dealing with.

So, the first thing first. Your sequencing is not the cleanest (which is understandable for giant mollusc genomes) - the error peak and genome peak are overlapping. The problem is that you don't have a clean cut between errors and real genomic kmers, which is what we want L to be.

Good news is that the genomescope looks alright (both het and homo zygous peaks are kind of visible and the model relativelly well explains the data). Also, you have very little evidence of polyploidy (assuming your 1n coverage fit is right), because those usually have some of the higher coverage peaks relatively high as well. So the only thing I would suggest to you is to redo the smudgeplot PLOT with a coverage prior from the genomescope model. You can run it as:

smudgeplot.py plot 21kmc_L15_U2700_coverages.tsv -n 18 -o "Land_snail" -t "O. idahoensis" 

I am suspecting you will get the peaks annotated right this time (as AB and AABB). I suspect the proportion of AB vs AABB kmers will be a bit different and quite possibly favouring AABB. That practically means that the 2n peak kmers have very often a very similar kmers within the same 2n peak. It is a puzzling signature for genomes with high heterozygosity, but perhaps we should not be bothered by this that much given how much troubles we had separating 1n and error peaks. Perhaps just keep in mind for downstream analyses that you might have quite a lot of paralogy in your genome (and keep even deeper in your mind that there polyploidy was not conclusively rejected in this analysis).

@KamilSJaron KamilSJaron added smudgeplot_included if smudgeplot was posted with the quesiton / problem genomescope_included labels Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
genomescope_included smudgeplot_included if smudgeplot was posted with the quesiton / problem
Projects
None yet
Development

No branches or pull requests

2 participants