Reproducible Scripts for Cho et al., 2018
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md

README.md

Reproducible Scripts for the Publication

Jungnam Cho, Matthias Benoit, Marco Catoni, Hajk-Georg Drost, Anna Brestovitsky, Matthijs Oosterbeek and Jerzy Paszkowski. Sensitive detection of pre-integration intermediates of LTR retrotransposons in crop plants. Nature Plants (2018). doi: http://dx.doi.org/10.1038/s41477-018-0320-9

For the computational reproducibility of de novo annotations of LTR retrotransposons we implemented the pipeline LTRpred. LTRpred calls the command line tools suffixerator, LTRharvest (Ellinghaus et al., 2008), and LTRdigest (Steinbiss et al., 2009), which are part of the GenomeTools library (Gremme et al., 2013) to screen for repeated LTRs, specific sequence motifs such as primer binding sites (PBS), polypurine tract motifs (PPT), and target site duplications (TSD) and for conserved protein domains such as reverse transcriptase (gag), integrase DNA binding domain, integrase Zinc binding domain, RNase H, and the integrase core domain. Subsequently, LTRpred implements customized parser functions to import LTRdigest output and structures the data in a tidy data format (Wickham, 2014) which subsequently enables automation of false positive curation. In a second step, open reading frame (ORF) prediction is performed by a customized wrapper function that runs the command line tool usearch (Edgar, 2010). This automated step allows to automatically filter out RTEs that might have conserved protein domains such as an integrase or a reverse transcriptase, but fail to have ORFs and thus are not expressed. In a third step, RTE family clustering is performed using the command line tool vsearch (Rognes et al., 2016) which defines family members by >90% sequence homology of the full element to each other. In a fourth step, an automated hmmer search (Finn et al., 2011) against the Dfam database (Hubley et al., 2016) is performed to assign super-family associations such as Copia or Gypsy by comparing the protein domains of de novo predicted RTEs with already annotated RTEs in the Dfam database. For each step, LTRpred implements customized parser functions to import usearch, vsearch, and Dfam output and transforms this output in tidy data format for subsequent automated false positive curation. In a fifth step, for each predicted element the count and proportion (count divided by element length) of CHH, CHG, CG, and NNN motifs are quantified for the entire element, the 3’ LTR and 5’ LTR separately. In a sixth step, automated false positive curation is performed by the LTRpred function quality.filter() to conservatively reduce false positive predictions.

Please install the following R packages before running the reproducible scripts:

install.packages("dplyr")
install.packages("ggplot2")
install.packages("readr")
install.packages("readxl")

source("http://bioconductor.org/biocLite.R")
biocLite('biomartr')

biocLite("devtools")
biocLite("HajkD/LTRpred")

Please also make sure that you follow the INSTALLATION instructions of the LTRpred package to install all command line tools that LTRpred depends on.

Resources for Annotation

tRNAs

We retrieved tRNA sequences in *.fasta format from the following databases:

We combined tRNA sequences from both databases to have a comprehensive collection of tRNA sequences specific for each kingdom of life.

HMM Models

We retrieved the HMM models for protein domain annotation of the region between de novo predicted LTRs from Pfam:

Running LTRpred

The following code can be run on a computer with 4 cores. Please be aware that computation times might correspond to days due to the genome sizes of the respective species.

For further details about LTRpred please consult the LTRpred: Introduction Vignette.

Generating de novo LTR retrotransposon annotation for Arabidopsis thaliana

library(LTRpred)
# de novo LTR transposon prediction of 'A. thaliana'
LTRpred(
      genome.file = "Athaliana.fa",
      cluster     = TRUE,
      cores       = 4,
      copy.number.est = FALSE,
      minlenltr   = 100,
      maxlenltr   = 5000,
      mindistltr  = 4000,
      maxdistltr  = 30000,
      mintsd      = 3,
      maxtsd      = 20,
      vic         = 80,
      overlaps    = "no",
      xdrop        = 7,
      motifmis    = 1,
      pbsradius   = 60,
      pbsalilen   = c(8,40),
      pbsoffset   = c(0,10),
      quality.filter = TRUE,
      n.orfs      = 0
      )
# import LTRpred output
Athaliana_LTRpred <- read.ltrpred("Athaliana_ltrpred/Athaliana_LTRpred_DataSheet.tsv")

Generating de novo LTR retrotransposon annotation for Solanum lycopersicum

library(LTRpred)
# de novo LTR transposon prediction of 'S. lycopersicum'
LTRpred(
      genome.file = "Slycopersicum.fa",
      cluster     = TRUE,
      cores       = 4,
      copy.number.est = FALSE,
      minlenltr   = 100,
      maxlenltr   = 5000,
      mindistltr  = 4000,
      maxdistltr  = 30000,
      mintsd      = 3,
      maxtsd      = 20,
      vic         = 80,
      overlaps    = "no",
      xdrop        = 7,
      motifmis    = 1,
      pbsradius   = 60,
      pbsalilen   = c(8,40),
      pbsoffset   = c(0,10),
      quality.filter = TRUE,
      n.orfs      = 0
      )
# import LTRpred output
Slycopersicum_LTRpred <- read.ltrpred("Slycopersicum_ltrpred/Slycopersicum_LTRpred_DataSheet.tsv")

Generating de novo LTR retrotransposon annotation for Oryza sativa

library(LTRpred)
# de novo LTR transposon prediction of 'O. sativa'
LTRpred(
      genome.file = "Osativa.fa",
      cluster     = TRUE,
      cores       = 4,
      copy.number.est = FALSE,
      minlenltr   = 100,
      maxlenltr   = 5000,
      mindistltr  = 4000,
      maxdistltr  = 30000,
      mintsd      = 3,
      maxtsd      = 20,
      vic         = 80,
      overlaps    = "no",
      xdrop        = 7,
      motifmis    = 1,
      pbsradius   = 60,
      pbsalilen   = c(8,40),
      pbsoffset   = c(0,10),
      quality.filter = TRUE,
      n.orfs      = 0
      )
# import LTRpred output
Osativa_LTRpred <- read.ltrpred("Osativa_ltrpred/Osativa_LTRpred_DataSheet.tsv")

References

Edgar,R.C. (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461.

Ellinghaus,D. et al. (2008) LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics, 9, 18.

Finn,R.D. et al. (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res., 39, W29–W37.

Gremme,G. et al. (2013) GenomeTools: A Comprehensive Software Library for Efficient Processing of Structured Genome Annotations. IEEE/ACM Trans. Comput. Biol. Bioinforma., 10, 645–656.

Hubley,R. et al. (2016) The Dfam database of repetitive DNA families. Nucleic Acids Res., 44, D81–D89.

Lawrence,M. et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput. Biol., 9, 1–10.

Rognes,T. et al. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584.

Steinbiss,S. et al. (2009) Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Res., 37, 7002–7013.

Wickham,H. (2009) ggplot2 Springer New York, New York, NY.

Wickham,H. (2014) Tidy Data. J. Stat. Softw., 59, 1–23.