Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnaPipeTE estimates #22

Closed
3 tasks done
KamilSJaron opened this issue Aug 19, 2019 · 46 comments
Closed
3 tasks done

dnaPipeTE estimates #22

KamilSJaron opened this issue Aug 19, 2019 · 46 comments

Comments

@KamilSJaron
Copy link
Owner

KamilSJaron commented Aug 19, 2019

  • A. ricciae The problem:

Something is also wrong with the TE level for A. ricciae in figure 4 and main text, which is much lower than the other bdelloid rotifers – but they show similar levels of TE content in the analysis of Nowell et al. (2018). It looks the numbers are over-estimated in the other 3 species compared to that paper. Combined with the above, perhaps there is a problem with the A. ricciae analysis.

  • A. melifera estimated with dnaPipeTE 1.3.1 leads to pracically no TEs. Is honey bee genome known to be TE-less?

  • M. belhari estimates I got have 0's everywhere

@KamilSJaron KamilSJaron mentioned this issue Aug 20, 2019
24 tasks
@KamilSJaron KamilSJaron changed the title TE load estimate in A. ricciae dnaPipeTE estimates Aug 22, 2019
@KamilSJaron
Copy link
Owner Author

This is how we compute % of annotated TEs from Counts.txt files (here it's in a script):

    TEs <- read.table(TE_file)

    genome_tab[row,'TEs'] <- round(sum(TEs[1:6,'V2']) / TEs[nrow(TEs),'V2'], 3) * 100
    genome_tab[row,'other_repeats'] <- round(sum(TEs[7:12,'V2']) / TEs[nrow(TEs),'V2'], 3) * 100
    genome_tab[row,'all_repeats'] <- round(sum(TEs[-nrow(TEs),'V2']) / TEs[nrow(TEs),'V2'], 3) * 100

i.e. the initial sampling is based on estimated genome size, but the % of TEs is estimated as "proportion of reads annotated as TEs", hence it's ploid/genome size agnostic.

@KamilSJaron
Copy link
Owner Author

KamilSJaron commented Aug 22, 2019

Re: Apis melifera: https://www.nature.com/articles/nature05260 Yes, it seems that there are 6 described transposons in Apis genome (< than 0.5% of the genome). Also verified with newer Apis melifera assembly: https://www.ncbi.nlm.nih.gov/assembly/GCA_003254395.2 (0.47%). And finally "Most of the groups of retrotransposable elements were detected in the genome of the honey bee. In comparison to many other organisms, the most striking difference is the extremely low diversity and abundance of these elements." from https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-86
--> Apis is expected not to have any transposons.

As a sanity check I am rerunning some of the previous estimates (Anan1);

I am also recalculating again A. ricciae with 1.3.1, P. fallax and M. belhari.

(follow up on https://docs.google.com/document/d/1a8ER3bvDBn4wpehsAUN-s3QIEUwGSqlEYs_2ksWOkNA/edit?disco=AAAADaT02Ig)

@jensbast
Copy link
Collaborator

This is how we compute % of annotated TEs from Counts.txt files (here it's in a script):

    TEs <- read.table(TE_file)

    genome_tab[row,'TEs'] <- round(sum(TEs[1:6,'V2']) / TEs[nrow(TEs),'V2'], 3) * 100
    genome_tab[row,'other_repeats'] <- round(sum(TEs[7:12,'V2']) / TEs[nrow(TEs),'V2'], 3) * 100
    genome_tab[row,'all_repeats'] <- round(sum(TEs[-nrow(TEs),'V2']) / TEs[nrow(TEs),'V2'], 3) * 100

i.e. the initial sampling is based on estimated genome size, but the % of TEs is estimated as "proportion of reads annotated as TEs", hence it's ploid/genome size agnostic.

Yes, the read percentage TE caclulation is genome size independent.
But the previous TE identification and assembly stage is dependent on a coverage estimate, as it subsamples overrespresented reads (>1x coverage of genome)

@KamilSJaron
Copy link
Owner Author

Patrick managed to run dnaPipeTE v1.2 that produces consistent estimates with the previous version. Thank you, Patrick!

We got estimates for all three honey bees and Mbel, now rerunning A ricciae. If we get similar estimates we need to discuss the fragility of these results.

@KamilSJaron
Copy link
Owner Author

Sorry folks for the delay.

I updated the table with newer estimates of TEs using dnaPipeTE. The difference now is that I have used the "diploid" genomescope model to specify the genome size. The A. ricciae is still the same as before (0.7%), but A. vaga is now more similar (2.2%). I don't think there is anything we could do besides disclaiming that identification of TEs might be dependent on their frequency in the genome and phylogenetic distance to known TEs which is possibly causing this approach to identify the low copy-number TEs that are found in the assembly and previous literature.

We also have there three new honey bees. The third one is a really strange sample too. The estimated genome size is substantially smaller than in the case of the other two bee strains. It does not seem like inference problem, it looks like there is something weird with the genome. I mean, we might remove this value for TEs.

@KamilSJaron
Copy link
Owner Author

We could possibly include yet one more supplementary section on this. Or perhaps add it to the "Rotifer genome structure". What do you think @reubwn @jensbast ?

@reubwn
Copy link
Collaborator

reubwn commented Sep 27, 2019

Hi Kamil,

Sorry folks for the delay

No problem man!

I updated the table with newer estimates of TEs using dnaPipeTE.

Where is the table? in the Gdoc or somewhere on Github? Would be interesting to see the details. Also what was the Genomescope estimates for diploid genome size for the bdelloids? I have a feeling that this is at the root of this funny differences between Ar/Av... caused by something weird going on in the A. ricciae data, most likely

The A. ricciae is still the same as before (0.7%), but A. vaga is now more similar (2.2%).

Apologies if I'm missing something obvious, but what is the denominator in this calculation? a few comments up you say it's "proportion of reads annotated as TE", but in the ms it seems to be related to assembly span.

I don't think there is anything we could do besides disclaiming that identification of TEs might be dependent on their frequency in the genome and phylogenetic distance to known TEs which is possibly causing this approach to identify the low copy-number TEs that are found in the assembly and previous literature.

I'm sure this is true, but then it is likely to be true for many of these species, not just the rotifers: Repbase is totally biased towards a select few model species (and is now behind a paywall.. 😮). I guess A. vaga is a bit of an outlier in that a huge effort has gone into manual curation of TEs found there - even A. ricciae, in the same genus (kinda), will have a whole bunch of stuff unique to it, but this doesn't explain why the discrepancy relative to the other Rotaria species.

What do the raw TE counts look like, for your re-analyses? I can compare them to the data I have and see if anything jumps out.

@reubwn
Copy link
Collaborator

reubwn commented Sep 27, 2019

Also, @jensbast, I see now you asked a question on the comment that I only just saw - sorry!

which library for annotation did you use in the DNApipeTE?

I used a custom TE library from A. vaga (I guess this is what you worked on in 2013?), from Irina Arkhipova, combined with all known protostome TEs pulled from Repbase. Apparently, the Av TE library only partially made it into Repbase (the DNA TEs where never submitted). This could be a partial reason for the difference between Av and Ar, but again doesn't really explain why the Rotaria species are also quite high. I don't think that's the root cause here.

We could possibly include yet one more supplementary section on this. Or perhaps add it to the "Rotifer genome structure".

Perhaps... it's still a puzzle though. It's the difference between Ar and Av (.7 vs 2.2%) that bothers me the most. Maybe a supplementary that basically says "bdelloids are weird" will do the trick 😄

@KamilSJaron
Copy link
Owner Author

Here is the table:
https://github.com/KamilSJaron/genomic-features-of-asexual-animals/blob/master/tables/genome_table.tsv

Within the table you also have the latest (diploid) genomescope estimates of haploid size (and if you would browse history you could dig out the previous models).

@KamilSJaron
Copy link
Owner Author

Apologies if I'm missing something obvious, but what is the denominator in this calculation? a few comments up you say it's "proportion of reads annotated as TE", but in the ms it seems to be related to assembly span.

dnaPipeTE outputs reads that are TEs and reads that are not. Therefore you divide by the total number of sampled reads.

@reubwn
Copy link
Collaborator

reubwn commented Sep 27, 2019

is that what is on the x-axis of Fig. 4?

@KamilSJaron
Copy link
Owner Author

KamilSJaron commented Sep 27, 2019

Yes.

This is a Counts.txt output from dnaPipeTE

LTR	93195
LINE	81783
SINE	153
DNA	88499
MITE	0
Helitron	24130
rRNA	117945
Low_Complexity	0
Satellite	2948
Tandem_repeats	0
Simple_repeat	2499
others	7649
na	1838261
Others	0
Total	43312092

I process this file as described in this previous post: #22 (comment)

And that's x-axis on the Figure 4.

--- edit ---

This is Aric1 Counts.txt file.

@jensbast
Copy link
Collaborator

Hi both,
sorry, it takes longer to reply, I am on holidays until 7th of October.

I used a custom TE library from A. vaga (I guess this is what you worked on in 2013?), from Irina Arkhipova, combined with all known protostome TEs pulled from Repbase. Apparently, the Av TE library only partially made it into Repbase (the DNA TEs where never submitted). This could be a partial reason for the difference between Av and Ar, but again doesn't really explain why the Rotaria species are also quite high. I don't think that's the root cause here.

We exclusively based all TE identification on the DNApipeTE standard database, which is the repbase one (Yes, it is really annoying that it is behind a paywall now).
Because we did want to treat any species TE estimate the same, so we added never additional species-specific TEs to the library.

We could possibly include yet one more supplementary section on this. Or perhaps add it to the "Rotifer genome structure".

I would not. We can just say "rotifers are weird and say that these estimates depend on genome structure and so on, which is tricky (What Reuben said).

Also sorry that I reads like we use the assembly span. I have to recheck, this is not true.

@KamilSJaron
Copy link
Owner Author

during a meeting with Marc and Tanja we agreed that the best would be just to explain, why we think there is this difference between our and known estimates in Adinita.

@KamilSJaron
Copy link
Owner Author

Hey, I updated the figure 4 in the google docs so I belive that everything is ready for finally fixing this paragraph. Is there a volunteer who will take it?

@reubwn
Copy link
Collaborator

reubwn commented Sep 30, 2019

Hi Kamil, I had a busy day today (we are moving the lab to Oxford!) but tomorrow I will have a bit of time to look at this - still a bit unsure about the dnaPipeTE estimates but I'll have a dig around tomorrow

@jensbast
Copy link
Collaborator

jensbast commented Sep 30, 2019 via email

@KamilSJaron
Copy link
Owner Author

Hi Kamil, I had a busy day today (we are moving the lab to Oxford!) but tomorrow I will have a bit of time to look at this - still a bit unsure about the dnaPipeTE estimates but I'll have a dig around tomorrow

I understand, but I am not sure what else we can do about it. I am open to suggestions though :-)

@reubwn
Copy link
Collaborator

reubwn commented Oct 9, 2019

Hi guys,

Sorry for the delay. Here are a couple of thoughts!

the % of TEs is estimated as "proportion of reads annotated as TEs", hence it's ploid/genome size agnostic

These values in the Counts.txt file... are you sure it's the number of sampled reads, and not the span? In the documentation for dnaPipeTE, it is stated the file contains:

count of bp of the sample aligned for each TE class (used for the pieChart)

Which would suggest it is a sum of base pairs attributed to each TE class, rather than number of reads. I wasn't sure so I had another look at the relevant snippet of code in dnaPipeTE.py. It seems to take the 4th column of the reads_vs_annoted.blast.out file, which is the length of the blast hit (it varies from 1 to whatever is the read length of the fastq files, which makes sense). The "length" in BLAST output is alignment length of the HSP, and so a more accurate computation of total length attributed to each TE class should really delete any gaps that are present, but anyway that probably doesn't make a huge difference..

Then the last row ("Total") is computed as the total span of the fasta file at renamed.blasting_reads.fasta (these lines in the program). I checked this, and it was indeed the number of bases in that file, not the number of reads. So unless I'm missing something, the values in Counts.txt are bases, not reads? Which makes the X-axis on Fig. 4 something like "proportion of bases annotated as TEs, out of total bases sampled", I think.

Thinking about it, it kinda shouldn't make a big difference to the overall results (i.e. the %'s you compute from Counts file) because the number of sampled bases should be proportional to the number of sampled reads (I think? unless crazy things like biases in read trimming might affect that), for given a genome size and coverage thresholds.

As Jens already pointed out, probably more important is the effect of the -genome_size parameter on how dnaPipeTE does the subsampling: if the inputted genome size is way off, then dnaPipeTE will miscalculate the number of reads (or bases) needed to hit the specified coverage, which might in turn affect the ratio of bases attributed to the different TE classes versus the "Total" bases sampled (especially if real coverage is badly undersampled).

@reubwn
Copy link
Collaborator

reubwn commented Oct 9, 2019

Related to this point, I found an old Counts.txt file for A. ricciae, using only metazoan repeats from Repbase to annotate (i.e. w/o anything bdelloid-specific from Irina):

LTR	426319
LINE	750521
SINE	2081
DNA	821852
MITE	0
Helitron	490889
rRNA	273500
Low_Complexity	245246
Satellite	831
Tandem_repeats	0
Simple_repeat	1250267
others	22080
na	11107597
Others	0
Total	86348365

This gives the results (from R script above):

Type Proportion
TEs 2.40%
other_repeats 2.52%
all_repeats 17.92%

Thus the TE estimate for Ar is similar with your recent result for A. vaga, at ~2/3%.

In this case, I simply used assembly span, ~175 Mb, for the -genome_size parameter. You used the haploid assembly length from the diploid Genomescope model, which was 87.5 Mb, right? I wonder, if you put -genome_size 175000000 in your dnaPipeTE pipeline, what are the results? In fact do we know anything about the robustness of dnaPipeTE results depending on the specified genome size and/or coverage?

I am basically wondering if the extreme peculiarity of the A. ricciae genome is throwing off the haploid genome size from Genomescope, which in turn is affecting the sampling regime in dnaPipeTE, which is leading to an underestimate of TE % in the Ar data.

But I agree with what you said above Kamil: the best way forward would be to simply explain that somewhere in the ms, without changing Fig 4, which should remain as it is because it is consistent across the taxa. But better if we can have some evidence that we know Ar is more fragile to the -genome_size parameter, and maybe why.

@KamilSJaron
Copy link
Owner Author

I will read this in detail soon.

For now:

  1. You are certainly right on the Count.txt. Bases are far more logical than reads. The point I wanted to insist on was that it's a relative measure "TE bases" vs "Total bases", therefore the x axis is not that dependent on the genome size - only for the sampling purposes
  2. rerunning Ar with different genome sizes. Sounds good. I could try to reproduce the 2.4% run you did earlier.

Talk to you latter.

@jensbast
Copy link
Collaborator

jensbast commented Oct 10, 2019

I might have mixed up what counts mean for the DNApipeTE because I worked with PopoolationTE2 counts files as well (and other measurements of counts).
I think you are right, it seems to be number of bases, sorry for the confusion.

I agree on rerunning with different genome sizes, but conceptually this is a bit tricky to explain, because for the other analyses we always used the genome size from kmer estimates (but maybe we can explain why it is different for Ar).

Sorry, I am not at work this week, so I am a bit slow to respond and check stuff.

@reubwn
Copy link
Collaborator

reubwn commented Oct 11, 2019

No problem. Yes bases seemed more logical but I'm glad I looked at it because I was never exactly sure what was in the Counts.txt file.

I agree on rerunning with different genome sizes, but conceptually this is a bit tricky to explain, because for the other analyses we always used the genome size from kmer estimates (but maybe we can explain why it is different for Ar).

I'll set this analysis running now, varying -genome_size and also maybe -coverage for Ar to see what happens. I agree, I think we should still stick with the genome-size-from-kmer approach to maintain consistency across the different taxa. Depending on what we see, the Ar re-analysis would probably only be used to back up / explain why the TE estimate for Ar is low (but, it might become problematic if we essentially have to admit that the kmer approach doesn't really work for Ar).

therefore the x axis is not that dependent on the genome size - only for the sampling purposes

I guess the point is that perhaps the dnaPipeTE approach might be more sensitive to this than I realised, if there is a correlation between -genome_size and the final TE results... if that is the case, then it might be hard to argue the method is independent of genome size, for this species at least. Anyway, let's see what happens with the Ar re-analysis first.

PS I'm on holiday next week, but back the week after. It will take a few days to re-run dnaPipe anyway.
Cheers! -R

@KamilSJaron
Copy link
Owner Author

Thinking more about it, it's wise to do the various `-genome_size. So, did it work @reubwn ? I don't mean to pressure you, just if dnaPipeTE would misbehave (again?) I could also try to rerun it on our cluster (I am waiting for P. fallax results anyway).

@reubwn
Copy link
Collaborator

reubwn commented Oct 24, 2019

Hey man, indeed it has finished, summary results below and also in full here in Google Sheet.

I ran 3 genome sizes (87.5,175,262.5 Mb) each with 3 coverages (0.3,0.5,0.7). I based the genome sizes on the idea that 175 Mb (the assembly span) represents at least two subgenomes, so "true" genome size might vary in steps of ~87.5 Mb.

The results show that TE "amount" (expressed either as absolute count or % total sampled bp) does systematically vary both across -coverage values within a given genome size, and across -genome_size for a given coverage value. Basically, the trend is that the more bp you sample, the more TE you get back...

That's a bit worrying, I think? i.e. that the result should be so dependent on the specified input genome_size/coverage parameters. Hmmmmm... so it seems:

  1. It is important to get the genome size right. OK, fair enough, I can see why this would be important, but I guess that would mean the method is really dependent on an accurate genome size
  2. More weird is that -coverage would also make a difference, within each genome size class. I would expect increased coverage would change the absolute values, but here there is a trend in the % graph too: as I increase coverage, I find proportionally more and more TEs in the sampled nucleotides. Moreover the increase seems to get bigger and bigger: for -genome_size 175e6, the difference between 0.3 and 0.5 is 1.02%, but between 0.5 and 0.7 it is 1.11% (well, there are only 3 data points so maybe that's pushing it too far, but still the trend is consistent and I would have thought it would go the other way round if anything, that it would approach an asymptote).

What do you guys think? Maybe this is just for Ar, I haven't looked at any other species.

TEs absolute count
TEs as % total bp sampled

-genome_size -coverage TE count proportion
87.5 0.3 TEs 237951 0.92%
87.5 0.3 other_repeats 659476 2.54%
87.5 0.3 all_repeats 897427 3.45%
87.5 0.5 TEs 503153 1.16%
87.5 0.5 other_repeats 1796103 4.15%
87.5 0.5 all_repeats 2299256 5.31%
87.5 0.7 TEs 885489 1.46%
87.5 0.7 other_repeats 4324112 7.13%
87.5 0.7 all_repeats 5209601 8.59%
         
175 0.3 TEs 623262 1.20%
175 0.3 other_repeats 2817586 5.42%
175 0.3 all_repeats 3440848 6.62%
175 0.5 TEs 1921429 2.22%
175 0.5 other_repeats 13589635 15.69%
175 0.5 all_repeats 15511064 17.91%
175 0.7 TEs 4041685 3.33%
175 0.7 other_repeats 40840766 33.69%
175 0.7 all_repeats 44882451 37.02%
         
262.5 0.3 TEs 1436031 1.84%
262.5 0.3 other_repeats 9603651 12.32%
262.5 0.3 all_repeats 11039682 14.16%
262.5 0.5 TEs 4774248 3.68%
262.5 0.5 other_repeats 50279818 38.71%
262.5 0.5 all_repeats 55054066 42.38%
262.5 0.7 TEs 12489114 6.87%
262.5 0.7 other_repeats 112178283 61.68%
262.5 0.7 all_repeats 124667397 68.55%

@KamilSJaron
Copy link
Owner Author

Phew, that's really worrying... It's seems that it's simply a function of sampled nucleotides, look at this plot...

sampled_vs_annoted_Aric1

I wonder if this is something related to the low frequency of individual TEs. And I wonder if it would get saturated. If we set genome size to 262.5Mbp and genome sampling to 3, 4, 5 or something that crazy.

With other genomes we know that our estimates correlated with the published numbers (although they are bit higher), hence I don't think all dnaPipeTE estimates are bs...

@reubwn
Copy link
Collaborator

reubwn commented Oct 24, 2019

Phew, that's really worrying

Yup.

I wonder if this is something related to the low frequency of individual TEs.

I have wondered about that. But, I can't see how low copy number of genuine TEs (as is the case in A. vaga anyway) would lead to this correlation? I would have expected it to lead to systematic underrepresentation, maybe, but the correlation is weird.

I also wondered if the ploidy is somehow messing with dnaPipeTE, e.g., if just normal tetraploid genes are being flagged as TEs somehow. It might be something like this, but the plots are from TEs annotated as LINES, LTRs etc - so they must have some basis of similarity to a TE in Repbase (I used Metazoa TEs as the -RM_lib).

And I wonder if it would get saturated.

Doesn't look that way from the shape of the graphs! If anything it shows an exponential increase in TE discovery...

If we set genome size to 262.5Mbp and genome sampling to 3, 4, 5 or something that crazy.

I can set this running, will takes ~3 days to run a full pipeline x3 (maybe more if coverage is so high).

With other genomes we know that our estimates correlated with the published numbers

That is good to know. But what is correlated exactly? The absolute counts compared to RepeatMasker estimates, or proportions?

I can run the same experiment on a non-bdelloid, to test if just A. ricciae is totally whack - which would be the best one do you think?

@KamilSJaron
Copy link
Owner Author

I have wondered about that. But, I can't see how low copy number of genuine TEs (as is the case in A. vaga anyway) would lead to this correlation? I would have expected it to lead to systematic underrepresentation, maybe, but the correlation is weird.

dnaPipeTE subsamples reads and assembles it with Trinity. If there is one TE per family in the genome, you would increase fraction of "assemblable" TEs by increasing coverage. (unlike in the case of abbundant TEs in the genome, when 0.5x should be sufficient to sample enough coverage to assemble them). At least this is how I understand dnaPipeTE and this is why I expect it to get saturated as some point.

I do agree that these plots do not suggest that there will be any sarturation, but there should be.

That is good to know. But what is correlated exactly? The absolute counts compared to RepeatMasker estimates, or proportions?

That depends on studies. We correlated what was in papers with what we have measured. We did not check how exactly they estimated TEs in every study (because that varied a lot). It was just a sanity check.

I can run the same experiment on a non-bdelloid, to test if just A. ricciae is totally whack - which would be the best one do you think?

I hoped to wrap it up with minimum number of computing hours, but it sounds like a reasonable thing to do. What about Leptopilina clavipes? Relatively small genome, sane TE content but not too small.

@reubwn
Copy link
Collaborator

reubwn commented Oct 25, 2019

dnaPipeTE subsamples reads and assembles it with Trinity. If there is one TE per family in the genome, you would increase fraction of "assemblable" TEs by increasing coverage. (unlike in the case of abbundant TEs in the genome, when 0.5x should be sufficient to sample enough coverage to assemble them). At least this is how I understand dnaPipeTE and this is why I expect it to get saturated as some point.

Yes I see your point. But, then we should expect a linear increase with increased coverage, but instead we see an exponential increase. I wonder, is there a selection step in dnaPipeTE pipeline, where only reads above a certain abundance threshold are carried forward to assembly stage? If yes, then single-copy TEs might be "lost" because they never occur in the high-abundance selected reads. I looked again at the paper but it was unclear. Maybe tetra/polyploidy makes a difference too, because it will raise the genome-average abundance threshold even higher (i.e. even 'normal' genes are in 4x).

Below are the 3 graphs for "Bases_per_component" for -genome_size 175e6 and -coverage {0.3,0.5,0.7}. You can see that whatever the program is doing, it designates bigger proportion of the genome as "repeats" (dark pink) and especially "rep < 0.01 %" (light pink) as coverage increases. Hmm, not sure what exactly the light pink represents, but something is strange here...?

What about Leptopilina clavipes?

Is there an easy way you can share the QC reads? I will set the program running on this, using same RM library etc. Let's hope it's just Ar that is the weird one.

dnapipete_vAr_175_cov0 3 Bases_per_components
dnapipete_vAr_175_cov0 5 Bases_per_components
dnapipete_vAr_175_cov0 7 Bases_per_components

@KamilSJaron
Copy link
Owner Author

Unfortunately, I can not share them on the ftp anymore (my data were clansed and I archived only non-easily-recomputable bits).

However, dl and clenaing should be rather straightforward.
Reads are SRR7028347 (wanted to look up fastq files in EBI, but the ftp seems to be down now).
And the trimming is kind of basic Skewer:

skewer -z -m pe -n -q 26 -l 21 -t 8 -o $OUT_NAME $R1 $R2

@KamilSJaron
Copy link
Owner Author

I am not sure if I understand right the graphs, but maybe this is already good enough to open a ticket at their GitHub. This behaviour really needs some explanation.

@reubwn
Copy link
Collaborator

reubwn commented Oct 25, 2019

I am not sure if I understand right the graphs

Me neither, but all I see is that the pink bit gets bigger when I don't think it should do that, as the x-axis is a proportion. Possibly I am just misunderstanding something.

OK I will try and get it running on SRR7028347 over weekend so we can chat about it next week maybe. Yeah it's a bit worrying, but testing on a non-bdelloid is a good idea first imo. Bdelloids are weird, not really a good benchmark

@reubwn
Copy link
Collaborator

reubwn commented Nov 1, 2019

Rplot
Hi guys. Partial results for Lcla1 (the 548 Mb run is taking much longer) above. Genome coverage values shows as symbols, different genome sizes as the colours. It shows the same thing: increased TEs as a function of increasing sampled nucleotides... It looks kinda linear, but I'm not sure it is.

Plot showing the % TEs for 137 Mb (black) and 274 Mb (red) below:
Rplot01
This shows that within a given genome size, the %TEs that come out the other end are dependent on the coverage supplied. E.g., 0.3 to 0.7 coverage for the red one goes from 15.7% to 18.0%, an increase of 2.3% absolute; for Ar, it went from 1.2% to 3.3%, so that's actually a smaller absolute increase than for Lcla1, but I guess it looks worse because the numbers are so small already for Ar.

This was after using Skewer to trim the reads from SRR7028347, dnaPipeTE v.1.3.1_07.dec.2017 (is that the same as yours?) with -RM_lib pointed at metazoan repeats.

Basically, I think plot 1 should be linear at lower values, eventually reaching a plateau as you sample all of the TEs in the genome. Plot 2 should be flat. Neither of these seem to be what we see. What are you guys thinking about this?

@KamilSJaron
Copy link
Owner Author

This looks so weird and it contradict benchmark they talk about it in their paper (in the supplement they say that the increasing coverage did not yield in higher fractions of annotated TEs, but it increased the computational time). i think it's the time to talk to developers of dnaPipeTE. I wonder if there is a way how to ratain sanity of the analysis.

I will also once again do the figure of our dnaPipeTE estimates vs published values and post it here for completeness.

@KamilSJaron
Copy link
Owner Author

Well, Clém said that our sampling is too high, not too low and he speculated that the saturation should happen only on the level of annotated TEs and that this was observed before. Could we check this out? If we see the pattern even if we exclude the unknown TEs.

I would be happy to do that if you share me the Count file @reubwn .

If that's the case, we could either exclude unknown from all the estimates and claim to report the "more active part of transposons" or we could rerun the estimates with smaller sampling values to avoid this strange behaviour (as also suggested by Clém).

Not sure what is the right course of action here, but I feel we need to decide fast as this is kind of blocking the resubmission.

@reubwn
Copy link
Collaborator

reubwn commented Nov 5, 2019

Well, Clém said that our sampling is too high, not too low and he speculated that the saturation should happen only on the level of annotated TEs and that this was observed before. Could we check this out? If we see the pattern even if we exclude the unknown TEs.

The graphs I made above are already only for annotated TEs; unannotated, "na" and "Other" etc have already been excluded (see the data at the Google sheet, which has all the Counts.txt data in there too).

I can run Lcla1 and Ar at lower coverage values to see if we can observe any saturation, this will be much quicker than before (it can be done by the end of the week). I would expect not however, because of above. If we do see it, we could quite easily re-run the analysis on the samples (e.g. I can help with the compute to speed things up).

Alternative ideas:

  1. Use the results that we have already, and say explicitly that the values for Ar (maybe all the bdelloids) might be unreliable (see below).
  2. Do you have RepeatMasker results from the assembled genomes? Of course there are other issues with using this tool, but I think that method is easier to understand, maybe the results are less controversial.

Overall, my understanding is that we do not make a big conclusion from the TEs results, therefore I think it is perfectly OK to just explain the limitations here. Estimating TE content in these non-model organisms is (bloody) difficult and the results are a hypothesis; we do not claim to annotate every bp with complete accuracy. In that sense, so long as we are clear with our limitations and explain the caveats that we do already know about, I don't think this needs to be a major issue or stand in the way of resubmission for very long! (but I'm still interested to get to the bottom of it because I've been using dnaPipeTE a lot recently.)

Just let me know how I can help.

@jensbast
Copy link
Collaborator

jensbast commented Nov 6, 2019

I agree, good to rerun with low coverage value.
And if we do not get more confident results, I guess we go with strategy 1 and be clear about it, as you say, @reubwn.
I think strategy 2 (RMing the genomes) is not preferable, because we wanted to base as far as we could the results on read-only based methods. And as you say, RM brings its own problems and biases (as many as DNApipeTE).

@reubwn
Copy link
Collaborator

reubwn commented Nov 6, 2019

Cool, I'll run Aric1 and Lcla1 at 0.05, 0.1, 0.15, 0.2 and 0.25 now, at least that will give us a better idea of what's going on across the range of coverage values.

I think strategy 2 (RMing the genomes) is not preferable, because we wanted to base as far as we could the results on read-only based methods.

Yep, I think the approach is the right one for sure. I've got loads of data for a bunch of bdelloids on RM and dnaPipeTE, eventually I'll do some tests to see how much they correlate. But for this I think we stick with dnaPipeTE - it removes entirely the question of how each genome was assembled, which (hopefully) is a real strength of the comparative analysis.

@KamilSJaron
Copy link
Owner Author

The graphs I made above are already only for annotated TEs; unannotated, "na" and "Other" etc have already been excluded (see the data at the Google sheet, which has all the Counts.txt data in there too).

Alright. Then unknown do not explain (and sorry for not checking the sheet, I was a bit in rush when I tried to respond the last time).

  1. Use the results that we have already, and say explicitly that the values for Ar (maybe all the bdelloids) might be unreliable (see below).

This would be an easy way how if we would manage to phrase it well.

@KamilSJaron
Copy link
Owner Author

KamilSJaron commented Nov 7, 2019

Ha @reubwn, I found one little detail in the spreadsheet you shared. You used Ar total for Lcla fractions. When corrected, it looks like this. There is still the pattern, but the numbers are not THAT worrisome.

P.S. The graphs you posted here are correct. Just the numbers in the Lcla1_simple in the sheet I shared are now corresponding with the figure.

@reubwn
Copy link
Collaborator

reubwn commented Nov 7, 2019

Weird... well spotted! But yeah, the graphs are correct - for Lcla1, genome size = 274 MB, we see an increase from 15.7% to 16.4% to 18.0% TEs across the coverage values 0.3, 0.5, 0.7. It's true that this is not such a huge increase. It's more the trend that is worrying - I guess there would always be a bit of variation (and we have only 3 data points).

We'll know more when the lower coverage counts are ready. The new Google sheet seems to be different - can you give us edit access and I can put the new dnaPipeTE values into your version when the program has finished?

@KamilSJaron
Copy link
Owner Author

Alright, one more question. Do you think we should the put investigation of this thread in supplementary materials?

I guess it would not harm and show that we actually looked into it. However, it's yet another section.

@reubwn
Copy link
Collaborator

reubwn commented Nov 13, 2019

Hi Kamil, here are the results from the lower coverage runs:
Rplot01
Fig 1 shows relationship between sampled nt vs TE nt across a range of coverage values for 2 separate genome size values (black: 87.5 Mb, red: 175 Mb) for Aric1.

Rplot
Fig 2 shows same thing but comparing Aric1 175 Mb vs Lcla1 274 Mb.

Rplot03
Fig 3 shows same data but expressed as % TEs of sampled nt.

Counts.txt data and code are attached below.
coverage_analysis.zip

@KamilSJaron
Copy link
Owner Author

It does not appear to me there either of the species is converging around any value. :-(

-> I still have hard times figuring out how exactly to phrase it. I will try to get it done over the weekend.

Thanks @reubwn for all this analyses! Although a bit frustrating, they were very helpful.

@reubwn
Copy link
Collaborator

reubwn commented Nov 15, 2019

Yes - although the blue one for Lcla1 seems to level off from 0.1 to 0.25, but then it climbs again.

Thanks @reubwn for all this analyses! Although a bit frustrating, they were very helpful.

No worries, and sorry that it's been frustrating (I can understand that!). I don't really know what to think about it - happy to Skype and chat more next week if you want.

@KamilSJaron
Copy link
Owner Author

Alright, closing this issue for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants