-
Notifications
You must be signed in to change notification settings - Fork 8
/
Introduction.Rmd
1369 lines (1133 loc) · 75.9 KB
/
Introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Introduction to LTRpred"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
---
# Table of Contents
- [1. Introduction](#introduction)
- [2. Getting Started](#getting-started)
- [2.1 Installation](#installation)
- [2.1.1 LTRpred Docker Container](#ltrpred-docker-container)
- [2.1.2 Install tool dependencies on Linux](#install-tool-dependencies-on-linux)
- [2.2 Quick Start](#quick-start)
- [2.3 LTRpred output](#ltrpred-output)
- [2.4 Import LTRpred output](#import-ltrpred-output)
- [2.5 Output file format of LTRpred()](#output-file-format-of-ltrpred)
- [3. Detailed LTRpred run](#detailed-ltrpred-run)
- [3.1 HMM Models](#hmm-models)
- [3.2 Detailed description of adjustable LTRpred parameters](#detailed-description-of-adjustable-ltrpred-parameters)
- [4. Metagenome scale annotations](#metagenome-scale-annotations)
# Introduction
The `LTRpred` package implements a software pipeline and provides an integrated workflow to screen for intact and potentially active LTR retrotransposons in any genomic sequence of interest. For this purpose, this package provides a rich set of analytics tools to allow researchers to quickly start annotating and explore their own genomes.
The _de novo_ prediction of LTR transposons in `LTRpred` is based on the command line tools
[LTRharvest](http://www.zbh.uni-hamburg.de/?id=206) and [LTRdigest](http://www.zbh.uni-hamburg.de/forschung/arbeitsgruppe-genominformatik/software/ltrdigest.html) and extends these search strategies by additional analytics
modules to filter the search space of putative LTR transposons for biologically meaningful candidates.
Please make sure that all command line tools are installed properly before running any `LTRpred` function.
# Getting Started
The rationale for implementing `LTRpred` was to implement an R based pipeline combining the most
sensitive, accurate, and conservative state-of-the art LTR retrotransposon
detection tools and extend their inference by additional analyses and quality filtering steps to
screen for functional and structurally intact elements. Hence, _LTRpred_
aims to provide a high-level _de novo_ prediction infrastructure to generate high quality annotations
of intact LTR retrotransposons. All `LTRpred` functions are generic so that parameters can be changed to detect any form of LTR retrotransposon in any genome.
Internally, _LTRpred_ is based on the `de novo` annotation tools `LTRharvest` and `LTRdigest` which use prior knowledge
about DNA sequence features (also referred to as `Structure-based methods`) such as the homology of Long Terminal Repeats (LTRs), Primer Binding Sites (PBS), _gag_ and _pol_ protein domains, and
Target Site Duplications (TSDs) that are known to enable the process of transposition ([Bergman and Quesneville, 2007](http://bib.oxfordjournals.org/content/8/6/382.long)) to infer LTR retrotransposons in any genome.
Hence, these `de novo` annotation tools are designed to screen the genome systematically and efficiently for
these structural DNA features. __Figure 1__ shows the structural features of LTR retrotransposons
that are used for predicting putative LTR transposons `de novo` in any genome of interest.
![Sequence features of LTR retrotransposons. The structural element (LTR length, LTR similarity, TSD, max. size of full LTR retrotransposon, PBS length, tRNA binding, protein domain search, etc.) can be modified and controlled separately by corresponding arguments implemented in the `LTRpred()` function. In addition to controlling the structural features of candidate LTR retrotransposons, a open reading frame prediction, Dfam annotation, copy number clustering, and LTR copy number estimation, and methylation context quantification is performed subsequently after reporting putative LTR retrotransposons.](LTRfeatures.png)
Based on the optimized output of these tools, the `LTRpred` package aims to provide researchers with maximum flexibility of
adjustable parameters to detect any type of functional LTR retrotransposon. `LTRpred` package allows users to modify a vast range of
parameters to screen for potential LTR transposons, so having this template shown in
__Figure 1__ in mind will help researchers to modify structural parameters in a biologically meaningful manner.
## Installation
The fastest way to run `LTRpred` is to download the `ltrpred` Docker container which
includes all pre-installed tool dependencies. For users who cannot use a Docker environment
the individual installation instructions for each dependency tool are listed below (only for Linux).
### LTRpred Docker Container
Please be aware that the `drostlab/ltrpred` container (command line version) is suitable for `Linux`, `macOS`, and `Windows` users. Whereas the `RStudio` container version `drostlab/ltrpred_rstudio` is not suitable for `Windows` users, because a port bridge cannot be established.
Please make sure to create a [Docker](https://www.docker.com/get-started) account and to [install Docker](https://docs.docker.com/engine/install/) on your system.
#### Download `drostlab/ltrpred` container for use with R command line
```bash
# retrieve docker image from dockerhub
docker pull drostlab/ltrpred
# run ltrpred container
docker run --rm -ti drostlab/ltrpred
# start R prompt within ltrpred container
~:/app# R
```
Within the ltrpred container R prompt run the `ltrpred` example:
```r
LTRpred::LTRpred(genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"))
```
To exit R in the container run:
```r
q()
```
And to exit the `ltrpred` container run:
```bash
~:/app# exit
```
Now, users can add their own genome data as well as the Dfam database for
further annotation to the `ltrpred` container by following these steps (in a different Terminal window):
```bash
# go to the folder path in which you want to
# store all genome and Dfam data you want to
# mount in the ltrpred container and then run:
# create a new folder which will store
# all files that will be required in the
# ltrpred container
mkdir ltrpred_data
cd ltrpred_data
# create a dfam database folder
mkdir Dfam
cd Dfam
```
Now users can download and format the Dfam database as follows (within the `Dfam` folder created above). Unfortunately, the `Dfam` database size is too large to make it part of the `drostlab/ltrpred` container. In addition, the database is frequently curated and updated. Thus, it is recommended that users download and format the `Dfam` database to their local hard drive and mount it to the running `drostlab/ltrpred` container.
__To format the Dfam database locally, users need to install [HMMER](http://hmmer.org/download.html) on their local machine (to use `hmmpress`). However, within the `drostlab/ltrpred` and `drostlab/ltrpred_rstudio` containers HMMER is already preinstalled and does not need to be installed by the user.__ An example installation of HMMER for Linux machines is listed below.
For `macOS` users, please install `wget` on your `masOS` machine using [Homebrew](http://brew.sh/).
```bash
wget https://www.dfam.org/releases/Dfam_3.1/families/Dfam.hmm.gz
gunzip Dfam.hmm.gz
# format database by running hmmpress
hmmpress Dfam.hmm
cd ..
```
Next, make sure to also store the genome assembly file (in `fasta` format) you
want to _de novo_ annotate with `LTRpred` in the `ltrpred_data` folder you just created.
A possible way to retrieve such a genome is (within R) using [biomartr](https://github.com/ropensci/biomartr).
If you use `biomartr` please make sure to [install all `biomartr` package dependencies](https://github.com/ropensci/biomartr#installation) before running the following code.
```r
# install.packages("biomartr")
biomartr::getGenome(db = "ensembl",
organism = "Saccharomyces cerevisiae",
path = "yeast_genome",
gunzip = TRUE)
```
The respective genome assembly file is now stored at `yeast_genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa` and needs to be copied
into the `ltrpred_data` folder you just created.
Now users can mount the `ltrpred_data` folder to the `ltrpred` Docker container (using the `-v` option). This `-v` mounting option is also available for the `RStudio` container version and can also be run within the `RStudio` Terminal.
```bash
docker run --rm -p 8787:8787 -v /put/here/your/path/to/ltrpred_data:/app/ltrpred_data -ti ltrpred
```
Within the `ltrpred` Docker container the `ltrpred_data` folder is now stored in the
working directory `/app`.
When running the `ltrpred` Docker container you should be able to see the
mounted `ltrpred_data` folder as following:
```bash
~:/app# ls
```
```
latticeExtra_0.6-28.tar.gz
ltrpred_data
software_downloads
```
Next, users can again run the `R prompt` within the `ltrpred` Docker container to
run `LTRpred` with the local data that was mounted:
```bash
~:/app# R
```
```r
# run LTRpred on the yeast genome that was mounted
# to the ltrpred container from your local 'ltrpred_data' folder
LTRpred(genome.file = "ltrpred_data/yeast_genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa", cores = 2)
```
As you can see, within the `ltrpred` container `R prompt` the current working directory is `/app`.
To also include the `Dfam` database for further annotation users can specify
the path to the Dfam folder:
```r
# run LTRpred on the yeast genome that was mounted
# to the ltrpred container from your local 'ltrpred_data' folder
LTRpred(genome.file = "ltrpred_data/yeast_genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa",
annotate = "Dfam",
Dfam.db = "ltrpred_data/Dfam",
cores = 2)
```
Please be aware that using the `Dfam` database for further annotation significantly increases
the computation time of the `LTRpred` pipeline.
#### Download `drostlab/ltrpred_rstudio` container for use with RStudio Server
```bash
# retrieve docker image from dockerhub
docker pull drostlab/ltrpred_rstudio
# run ltrpred container
docker run -e PASSWORD=ltrpred --rm -p 8787:8787 -ti drostlab/ltrpred_rstudio
```
To open `RStudio` and interact with the container go to your standard web browser and type in the following url:
`http://localhost:8787`
Username: `rstudio`
Password: `ltrpred`
Within `RStudio` you can now run the example:
```r
LTRpred::LTRpred(genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"))
```
Users can `exit` the container by pressing `Ctrl + c` multiple times.
Next, users can mount their `ltrpred_data` folder to the RStudio server run
the same way they mounted folders in the command line container version (using `-v`).
This folder mounting can also be run within the `RStudio` Terminal of the `drostlab/ltrpred_rstudio` container.
```bash
# retrieve docker image from dockerhub
docker pull drostlab/ltrpred_rstudio
# run ltrpred container
docker run -e PASSWORD=ltrpred --rm -p 8787:8787 -v /put/here/your/path/to/ltrpred_data:/home/rstudio/ltrpred_data -ti drostlab/ltrpred_rstudio
```
Now go to your standard web browser and type in the following url:
`http://localhost:8787`
Username: `rstudio`
Password: `ltrpred`
In RStudio type:
```r
list.files()
```
You should be able to see the `ltrpred_data` folder.
Users can `exit` the container by pressing `Ctrl + c` multiple times.
__Please read more details about how to transfer genome files and the Dfam database in the
previous section `Download ltrpred container for use with R command line`__.
### Install Tool Dependencies on Linux
#### Programming Languages
Please make sure that the following programming languages are installed on your system:
- [Perl](https://www.perl.org/get.html)
- [Ruby](https://www.ruby-lang.org/en/documentation/installation/)
- [Python](https://www.python.org/downloads/)
- [C/C++](https://www.cplusplus.com/)
- [R](http://lib.stat.cmu.edu/R/CRAN/)
#### Install Programming languages and Linux Tools
```bash
apt-get update \
&& apt-get -y install apt-utils \
&& apt-get -y install gcc \
&& apt-get -y install python3 \
&& apt-get -y install perl \
&& apt-get -y install make \
&& apt-get -y install sudo \
&& apt-get -y install wget \
&& apt-get -y install genometools \
&& apt-get -y install git \
&& apt-get -y install autoconf \
&& apt-get -y install g++ \
&& apt-get -y install ncbi-blast+ \
&& apt-get -y install build-essential \
&& apt-get -y install libcurl4-gnutls-dev \
&& apt-get -y install libxml2-dev \
&& apt-get -y install libssl-dev \
&& apt-get -y install libpq-dev \
&& apt-get -y install software-properties-common
```
#### Install [HMMER](http://hmmer.org/download.html)
A detailed description of how to install `HMMER` for several operating systems
can be found [here](http://hmmer.org/download.html).
```bash
mkdir software_downloads \
&& cd software_downloads \
&& wget http://eddylab.org/software/hmmer/hmmer-3.3.tar.gz \
&& tar xf hmmer-3.3.tar.gz \
&& cd hmmer-3.3 \
&& ./configure \
&& make \
&& make check \
&& sudo make install \
&& cd ..
```
#### Install [USEARCH](http://drive5.com/usearch/download.html)
A detailed description of how to install `USEARCH` for several operating systems
can be found [here](http://drive5.com/usearch/manual/install.html).
First, users will need to register and download USEARCH for their operating system
from http://drive5.com/usearch/download.html .
After downloading USEARCH you will need to install it as a command line tool in your `/usr/local/` directory and you should be able to execute the following command in your Terminal:
```sh
cd software_downloads \
&& wget https://www.drive5.com/downloads/usearch11.0.667_i86linux32.gz \
&& gunzip usearch11.0.667_i86linux32.gz \
&& chmod +x usearch11.0.667_i86linux32 \
&& sudo mv usearch11.0.667_i86linux32 usearch \
&& sudo cp usearch /usr/local/bin/usearch \
&& cd ..
usearch -version
```
```
usearch v11.0.667_i86linux32
```
#### Install [VSEARCH](https://github.com/torognes/vsearch)
A detailed description of how to install `VSEARCH` for several operating systems
can be found [here](https://github.com/torognes/vsearch).
Please [install git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) before running the following commands.
```bash
cd software_downloads \
&& wget https://github.com/torognes/vsearch/archive/v2.14.2.tar.gz \
&& tar xzf v2.14.2.tar.gz \
&& cd vsearch-2.14.2 \
&& sudo ./autogen.sh \
&& sudo ./configure \
&& sudo make \
&& sudo make install \
&& cd ..
```
### Install [dfamscan.pl](http://www.dfam.org/web_download/Current_Release/dfamscan.pl)
[dfamscan.pl](https://www.dfam.org/releases/Dfam_3.1/infrastructure/dfamscan.pl.gz) needs to be unzipped and stored at `/usr/local/bin/dfamscan.pl` and executabe via `perl /usr/local/bin/dfamscan.pl -help`. This is important to be able to run the hmmer search against the Dfam database.
```bash
cd software_downloads \
&& wget https://www.dfam.org/releases/Dfam_3.1/infrastructure/dfamscan.pl.gz \
&& gunzip dfamscan.pl.gz \
&& sudo cp dfamscan.pl /usr/local/bin/dfamscan.pl \
&& cd ..
```
Users can download the [Dfam database v3.1](https://www.dfam.org/releases/Dfam_3.1/families/) by running:
```sh
wget https://www.dfam.org/releases/Dfam_3.1/families/Dfam.hmm.gz
gunzip Dfam.hmm.gz
# run hmmpress
hmmpress Dfam.hmm
```
The path to the folder where the formatted `Dfam` database can be found can then be passed
as argument `Dfam.db` to the `LTRpred::LTRpred()` function.
### Install R packages
Please make sure that [Bioconductor](https://www.bioconductor.org/install/) and all package dependencies are installed on the system on which you would like to run `LTRpred`.
Install prerequisite CRAN and Bioconductor packages:
```r
install.packages('devtools')
install.packages('tidyverse')
install.packages('BiocManager')
BiocManager::install()
BiocManager::install(c('rtracklayer', 'GenomicFeatures', 'GenomicRanges', 'GenomeInfoDb', 'biomaRt', 'Biostrings'))
install.packages(c('tidyverse', 'data.table', 'seqinr', 'biomartr', 'ape', 'dtplyr', 'devtools'))
devtools::install_github('HajkD/metablastr', build_vignettes = TRUE, dependencies = TRUE)
install.packages(c('BSDA', 'ggrepel', 'gridExtra'))
https://cran.r-project.org/src/contrib/Archive/latticeExtra/latticeExtra_0.6-28.tar.gz
install.packages('latticeExtra_0.6-28.tar.gz', type = 'source')
install.packages('survival')
BiocManager::install('ggbio')
```
Now users may install `LTRpred` as follows:
```r
# install.packages("devtools")
devtools::install_github("HajkD/LTRpred")
```
## Quick Start
The fastest way to generate a LTR retrotransposon prediction for a genome of interest (after [installing](https://hajkd.github.io/LTRpred/articles/Introduction.html#installation) all prerequisite command line tools) is to use the
`LTRpred()` function and relying on the default parameters. In the following example,
a LTR transposon prediction is performed for parts of the Human Y chromosome.
```r
# load LTRpred package
library(LTRpred)
# de novo LTR transposon prediction for the Human Y chromosome
LTRpred(
genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"),
cores = 4
)
```
```
Running LTRpred on genome '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/LTRpred/Hsapiens_ChrY.fa' with 4 core(s) and searching for retrotransposons using the overlaps option (overlaps = 'no') ...
No hmm files were specified, thus the internal HMM library will be used! See '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/LTRpred/HMMs/hmm_*' for details.
No tRNA files were specified, thus the internal tRNA library will be used! See '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/LTRpred/tRNAs/tRNA_library.fa' for details.
The output folder 'Hsapiens_ChrY_ltrpred' seems to exist already and will be used to store LTRpred results ...
LTRpred - Step 1:
Run LTRharvest...
LTRharvest: Generating index file Hsapiens_ChrY_ltrharvest/Hsapiens_ChrY_index.fsa with gt suffixerator...
Running LTRharvest and writing results to Hsapiens_ChrY_ltrharvest...
LTRharvest analysis finished!
LTRpred - Step 2:
Generating index file Hsapiens_ChrY_ltrdigest/Hsapiens_ChrY_index_ltrdigest.fsa with suffixerator...
LTRdigest: Sort index file...
Running LTRdigest and writing results to Hsapiens_ChrY_ltrdigest...
LTRdigest analysis finished!
LTRpred - Step 3:
Import LTRdigest Predictions...
Input: Hsapiens_ChrY_ltrdigest/Hsapiens_ChrY_LTRdigestPrediction.gff -> Row Number: 179
Remove 'NA' -> New Row Number: 179
(1/8) Filtering for repeat regions has been finished.
(2/8) Filtering for LTR retrotransposons has been finished.
(3/8) Filtering for inverted repeats has been finished.
(4/8) Filtering for LTRs has been finished.
(5/8) Filtering for target site duplication has been finished.
(6/8) Filtering for primer binding site has been finished.
(7/8) Filtering for protein match has been finished.
(8/8) Filtering for RR tract has been finished.
LTRpred - Step 4:
Perform ORF Prediction using 'usearch -fastx_findorfs' ...
usearch v8.1.1861_i86osx32, 4.0Gb RAM (17.2Gb total), 8 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
00:00 1.9Mb 100.0% Working
Join ORF Prediction table: nrow(df) = 24 candidates.
unique(ID) = 24 candidates.
unique(orf.id) = 24 candidates.
LTRpred - Step 5:
Perform methylation context quantification..
Join methylation context (CG, CHG, CHH, CCG) count table: nrow(df) = 24 candidates.
unique(ID) = 24 candidates.
unique(orf.id) = 24 candidates.
Copy files to result folder 'Hsapiens_ChrY_ltrpred'.
LTRpred - Step 6:
Starting retrotransposon evolutionary age estimation by comparing the 3' and 5' LTRs using the molecular evolution model 'K80' and the mutation rate '1.3e-07' (please make sure the mutation rate can be assumed for your species of interest!) for 24 predicted elements ...
Please be aware that evolutionary age estimation based on 3' and 5' LTR comparisons are only very rough time estimates and don't take reverse-transcription mediated retrotransposon recombination between family members of retroelements into account! Please consult Sanchez et al., 2017 Nature Communications and Drost & Sanchez, 2019 Genome Biology and Evolution for more details on retrotransposon recombination.
LTRpred - Step 7:
The LTRpred prediction table has been filtered (default) to remove potential false positives. Predicted LTRs must have an PBS or Protein Domain and must fulfill thresholds: sim = 70%; #orfs = 0. Furthermore, TEs having more than 10% of N's in their sequence have also been removed.
Input #TEs: 24
Output #TEs: 21
LTRpred finished all analyses successfully. All output files were stored at 'Hsapiens_ChrY_ltrpred'.
[1] "Successful job 1 ."
```
### LTRpred output
The `LTRpred()` function internally generates a folder named `*_ltrpred` which
stores all output annotation and sequence files.
In detail, the following files and folders are generated by the `LTRpred()` function:
- __Folder `*_ltrpred`__
- `*_ORF_prediction_nt.fsa` : Stores the predicted open reading frames within the predicted LTR transposons as DNA sequence.
- `*_ORF_prediction_aa.fsa` : Stores the predicted open reading frames within the predicted LTR transposons as protein sequence.
- `*_LTRpred.gff` : Stores the LTRpred predicted LTR transposons in GFF format.
- `*_LTRpred.bed` : Stores the LTRpred predicted LTR transposons in BED format.
- `*_LTRpred_DataSheet.tsv` : Stores the output table as data sheet.
- __Folder `*_ltrharvest`__
- `*_ltrharvest/*_BetweenLTRSeqs.fsa` : DNA sequences of the region between the LTRs in fasta format.
- `*_ltrharvest/*_Details.tsv` : A spread sheet containing detailed information about the predicted LTRs.
- `*_ltrharvest/*_FullLTRRetrotransposonSeqs.fsa` : DNA sequences of the entire predicted LTR retrotransposon.
- `*_ltrharvest/*_index.fsa` : The suffixarray index file used to predict putative LTR retrotransposons.
- `*_ltrharvest/*_Prediction.gff` : A spread sheet containing detailed additional information about the predicted LTRs (partially redundant with the *_Details.tsv file).
- `*_ltrdigest/*_index_ltrdigest.fsa` : The suffixarray index file used to predict putative LTR retrotransposonswith LTRdigest.
- __Folder `*_ltrdigest`__
- `*_ltrdigest/*_LTRdigestPrediction.gff` : A spread sheet containing detailed information about the predicted LTRs.
- `*_ltrdigest/*-ltrdigest_tabout.csv` : A spread sheet containing additional detailed information about the predicted LTRs.
- `*_ltrdigest/*-ltrdigest_complete.fas` : The full length DNA sequences of all predicted LTR transposons.
- `*_ltrdigest/*-ltrdigest_conditions.csv` : Contains information about the parameters used for a given LTRdigest run.
- `*_ltrdigest/*-ltrdigest_pbs.fas` : Stores the predicted PBS sequences for the putative LTR retrotransposons.
- `*_ltrdigest/*-ltrdigest_ppt.fas` : Stores the predicted PPT sequences for the putative LTR retrotransposons.
- `*_ltrdigest/*-ltrdigest_5ltr.fas` and `*-ltrdigest_3ltr.fas`: Stores the predicted 5' and 3' LTR sequences. Note: If the direction of the putative retrotransposon could be predicted, these files will contain the corresponding 3' and 5' LTR sequences. If no direction could be predicted, forward direction with regard to the original sequence will be assumed by LTRdigest, i.e. the 'left' LTR will be considered the 5' LTR.
- `*_ltrdigest/*-ltrdigest_pdom_<domainname>.fas` : Stores the DNA sequences of the HMM matches to the LTR retrotransposon candidates.
- `*_ltrdigest/*-ltrdigest_pdom_<domainname>_aa.fas` : Stores the concatenated protein sequences of the HMM matches to the LTR retrotransposon candidates.
- `*_ltrdigest/*-ltrdigest_pdom_<domainname>_ali.fas` : Stores the alignment information for all matches of the given protein domain model to the translations of all candidates.
### Import LTRpred output
The `LTRpred()` output table `*_LTRpred_DataSheet.tsv` is in [tidy](http://vita.had.co.nz/papers/tidy-data.html) format and can then be imported using `read.ltrpred()`.
The `tidy` output format is designed to work seamlessly with the [tidyverse](https://www.tidyverse.org/) and [R data science](http://r4ds.had.co.nz/) framework.
```r
# import LTRpred prediction output
Hsapiens_chrY <- read.ltrpred("Hsapiens_ChrY_ltrpred/Hsapiens_ChrY_LTRpred_DataSheet.tsv")
# look at some results
dplyr::select(Hsapiens_chrY, ltr_similarity:end, tRNA_motif, Clust_cn)
```
```
# A tibble: 14 x 9
ltr_similarity similarity protein_domain orfs chromosome
<dbl> <chr> <chr> <int> <chr>
1 80.73 (80,82] RVT_1 1 NC000024.10Homosa
2 89.85 (88,90] RVT_1 1 NC000024.10Homosa
3 79.71 (78,80] <NA> 0 NC000024.10Homosa
4 84.93 (84,86] RVT_1 0 NC000024.10Homosa
5 75.52 <NA> RVT_1 0 NC000024.10Homosa
6 76.47 [76,78] RVT_1 1 NC000024.10Homosa
7 80.28 (80,82] <NA> 0 NC000024.10Homosa
8 76.92 [76,78] RVT_1 0 NC000024.10Homosa
9 89.53 (88,90] RVT_1 0 NC000024.10Homosa
10 93.57 (92,94] RVT_1 1 NC000024.10Homosa
11 82.35 (82,84] RVT_1 2 NC000024.10Homosa
12 79.51 (78,80] RVT_1 0 NC000024.10Homosa
13 78.42 (78,80] RVT_1 3 NC000024.10Homosa
14 92.75 (92,94] RVT_1 1 NC000024.10Homosa
# ... with 4 more variables: start <int>, end <int>, tRNA_motif <chr>,
# Clust_cn <int>
```
Looking at all columns:
```r
dplyr::glimpse(Hsapiens_chrY)
```
```
Observations: 21
Variables: 92
$ species <chr> "Hsapiens_ChrY", "Hsapien...
$ ID <chr> "Hsapiens_ChrY_LTR_retrot...
$ dfam_target_name <chr> NA, NA, NA, NA, NA, NA, N...
$ ltr_similarity <dbl> 80.73, 89.85, 79.71, 83.2...
$ ltr_age_mya <dbl> 0.7936246, 0.2831139, 0.7...
$ similarity <chr> "(80,82]", "(88,90]", "(7...
$ protein_domain <chr> "RVT_1", "RVT_1", NA, NA,...
$ orfs <int> 1, 1, 0, 0, 0, 0, 0, 1, 0...
$ chromosome <chr> "NC000024.10Homosa", "NC0...
$ start <int> 3143582, 3275798, 3313536...
$ end <int> 3162877, 3299928, 3318551...
$ strand <chr> "-", "-", "+", "+", "-", ...
$ width <int> 19296, 24131, 5016, 12952...
$ annotation <chr> "LTR_retrotransposon", "L...
$ pred_tool <chr> "LTRpred", "LTRpred", "LT...
$ frame <chr> ".", ".", ".", ".", ".", ...
$ score <chr> ".", ".", ".", ".", ".", ...
$ lLTR_start <int> 3143582, 3275798, 3313536...
$ lLTR_end <int> 3143687, 3276408, 3313665...
$ lLTR_length <int> 106, 611, 130, 126, 218, ...
$ rLTR_start <int> 3162769, 3299338, 3318414...
$ rLTR_end <int> 3162877, 3299928, 3318551...
$ rLTR_length <int> 109, 591, 138, 137, 219, ...
$ lTSD_start <int> 3143578, 3275794, 3313532...
$ lTSD_end <int> 3143581, 3275797, 3313535...
$ lTSD_motif <chr> "acag", "ttgt", "ttag", "...
$ rTSD_start <int> 3162878, 3299929, 3318552...
$ rTSD_end <int> 3162881, 3299932, 3318555...
$ rTSD_motif <chr> "acag", "ttgt", "ttag", "...
$ PPT_start <int> NA, NA, NA, NA, NA, 34660...
$ PPT_end <int> NA, NA, NA, NA, NA, 34660...
$ PPT_motif <chr> NA, NA, NA, NA, NA, "agag...
$ PPT_strand <chr> NA, NA, NA, NA, NA, "+", ...
$ PPT_offset <int> NA, NA, NA, NA, NA, 23, N...
$ PBS_start <int> NA, NA, 3313667, 3372512,...
$ PBS_end <int> NA, NA, 3313677, 3372522,...
$ PBS_strand <chr> NA, NA, "+", "+", "-", "+...
$ tRNA <chr> NA, NA, "Homo_sapiens_tRN...
$ tRNA_motif <chr> NA, NA, "aattagctgga", "c...
$ PBS_offset <int> NA, NA, 1, 3, 0, 5, 2, 5,...
$ tRNA_offset <int> NA, NA, 1, 0, 2, 5, 1, 5,...
$ `PBS/tRNA_edist` <int> NA, NA, 1, 1, 1, 1, 1, 1,...
$ orf.id <chr> "NC000024.10Homosa_314358...
$ repeat_region_length <int> 19304, 24139, 5024, 12960...
$ PPT_length <int> NA, NA, NA, NA, NA, 27, N...
$ PBS_length <int> NA, NA, 11, 11, 11, 11, 1...
$ dfam_acc <chr> NA, NA, NA, NA, NA, NA, N...
$ dfam_bits <dbl> NA, NA, NA, NA, NA, NA, N...
$ dfam_e_value <dbl> NA, NA, NA, NA, NA, NA, N...
$ dfam_bias <dbl> NA, NA, NA, NA, NA, NA, N...
$ `dfam_hmm-st` <dbl> NA, NA, NA, NA, NA, NA, N...
$ `dfam_hmm-en` <dbl> NA, NA, NA, NA, NA, NA, N...
$ dfam_strand <chr> NA, NA, NA, NA, NA, NA, N...
$ `dfam_ali-st` <dbl> NA, NA, NA, NA, NA, NA, N...
$ `dfam_ali-en` <dbl> NA, NA, NA, NA, NA, NA, N...
$ `dfam_env-st` <dbl> NA, NA, NA, NA, NA, NA, N...
$ `dfam_env-en` <dbl> NA, NA, NA, NA, NA, NA, N...
$ dfam_modlen <dbl> NA, NA, NA, NA, NA, NA, N...
$ dfam_target_description <chr> NA, NA, NA, NA, NA, NA, N...
$ Clust_Cluster <chr> NA, NA, NA, NA, NA, NA, N...
$ Clust_Target <chr> NA, NA, NA, NA, NA, NA, N...
$ Clust_Perc_Ident <dbl> NA, NA, NA, NA, NA, NA, N...
$ Clust_cn <int> NA, NA, NA, NA, NA, NA, N...
$ TE_CG_abs <dbl> 62, 125, 35, 70, 139, 83,...
$ TE_CG_rel <dbl> 0.003213101, 0.005180059,...
$ TE_CHG_abs <dbl> 659, 830, 150, 396, 742, ...
$ TE_CHG_rel <dbl> 0.03415216, 0.03439559, 0...
$ TE_CHH_abs <dbl> 2571, 3454, 748, 1743, 31...
$ TE_CHH_rel <dbl> 0.1332400, 0.1431354, 0.1...
$ TE_CCG_abs <dbl> 13, 24, 6, 15, 33, 16, 4,...
$ TE_CCG_rel <dbl> 0.0006737148, 0.000994571...
$ TE_N_abs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ CG_3ltr_abs <dbl> 1, 0, 1, 4, 8, 3, 1, 11, ...
$ CG_3ltr_rel <dbl> 0.009433962, 0.000000000,...
$ CHG_3ltr_abs <dbl> 2, 24, 9, 8, 14, 14, 9, 9...
$ CHG_3ltr_rel <dbl> 0.01886792, 0.03927987, 0...
$ CHH_3ltr_abs <dbl> 18, 69, 18, 26, 43, 70, 2...
$ CHH_3ltr_rel <dbl> 0.16981132, 0.11292962, 0...
$ CCG_3ltr_abs <dbl> 0, 0, 0, 2, 2, 0, 1, 4, 5...
$ CCG_3ltr_rel <dbl> 0.000000000, 0.000000000,...
$ N_3ltr_abs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ CG_5ltr_abs <dbl> 1, 0, 1, 4, 8, 3, 1, 11, ...
$ CG_5ltr_rel <dbl> 0.009433962, 0.000000000,...
$ CHG_5ltr_abs <dbl> 2, 24, 9, 8, 14, 14, 9, 9...
$ CHG_5ltr_rel <dbl> 0.01886792, 0.03927987, 0...
$ CHH_5ltr_abs <dbl> 18, 69, 18, 26, 43, 70, 2...
$ CHH_5ltr_rel <dbl> 0.16981132, 0.11292962, 0...
$ CCG_5ltr_abs <dbl> 0, 0, 0, 2, 2, 0, 1, 4, 5...
$ CCG_5ltr_rel <dbl> 0.000000000, 0.000000000,...
$ N_5ltr_abs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ cn_3ltr <dbl> NA, NA, NA, NA, NA, NA, N...
$ cn_5ltr <dbl> NA, NA, NA, NA, NA, NA, N...
```
### Output file format of `LTRpred()`
The columns of the prediction file store the following information:
tidy format (each row contains all information for a predicted LTR retrotransposon):
- `species` : Name of the species or genomic sequence that is passed as input to `LTRpred`
- `ID` : unique identifier that is generated by `LTRpred` to mark individual LTR retrotransposon predictions
- `dfam_target_name` : name of hmmer search match in Dfam database (see [LTRpred HMM Models](#hmm-models) for details)
- `ltr_similarity` : sequence similarity (global alignment) between the 5' and 3' long terminal repeat (LTR)
- `ltr_age_mya`: Rough estimation of retrotransposon insertion age in Mya based on 5 prime and 3 prime LTR sequence homology
- `ltr_evo_distance`: the molecular evolutionary distance value returned by the `evolutionary model` (e.g. Kimura-2P-Model) used to estamate DNA distances
- `similarity`: `ltr_similarity` classified into 2% intervals
- `chromosome` : chromosome at which LTR retrotransposon was found
- `start` : start coordinate of the LTR retrotransposon within the chromosome
- `end` : end coordinate of the LTR retrotransposon within the chromosome
- `strand` : strand on which the corresponding LTR retrotransposon was found
- `dfam_acc` : Dfam accession
- `width` : length/width of the LTR retrotransposon
- `orfs` : number of predicted open reading frames within the LTR retrotransposon sequence
- `protein_domain` : name of protein domains found between the 5' and 3' LTRs (see [LTRpred HMM Models](#hmm-models) for details)
- `Clust_Cluster` : cluster identifier
- `Clust_Target` : LTR retrotransposon identifier that builds the cluster center of cluster `Clust_Cluster`
- `Clust_Perc_Ident` : sequence similarity (global alignment) with `Clust_Target`
- `Clust_cn` : number of LTR retrotransposon that belong to the same cluster (see `Clust_Target`) - if `Clust_cn` = 1 -> unique element
- `lLTR_start` : start coordinate of the 5' LTR within the chromosome
- `lLTR_end` : end coordinate of the 5' LTR within the chromosome
- `lLTR_length` : length/width of the 5' LTR
- `rLTR_start` : start coordinate of the 3' LTR within the chromosome
- `rLTR_end` : end coordinate of the 3' LTR within the chromosome
- `rLTR_length` : length/width of the 3' LTR
- `lTSD_start` : start coordinate of the 5' target site duplication (TSD) within the chromosome
- `lTSD_end` : end coordinate of the 5' target site duplication (TSD) within the chromosome
- `lTSD_motif` : 5' TSD sequence motif
- `PPT_start` : start coordinate of the PPT motif within the chromosome
- `PPT_end` : end coordinate of the PPT motif within the chromosome
- `PPT_motif` : PPT sequence motif
- `PPT_strand` : strand of PPT sequence
- `PPT_offset` : width between `PPT_end` and `rLTR_start`
- `PBS_start` : start coordinate of the PBS motif within the chromosome
- `PBS_end` : end coordinate of the PBS motif within the chromosome
- `PBS_strand` : strand of PBS sequence
- `tRNA` : name of tRNA that matches PBS sequence
- `tRNA_motif` : tRNA sequence motif
- `PBS_offset` : width between `lLTR_end` and `PBS_start`
- `tRNA_offset` : number of nucleotide that are allowed for tRNA to offset the PBS
- `PBS/tRNA_edist` : edit distance between tRNA motif and PBS motif
- `PPT_length` : length/width of the PPT motif
- `PBS_length` : length/width of the PBS motif
- `dfam_target_description` : description of Dfam target hit
- `orf.id` : id generated by ORF prediction
- `TE_CG_abs` : total number of CG motifs found within the full length LTR retrotransposon sequence
- `TE_CG_rel` : relative number of CG motifs found within the full length LTR retrotransposon (normalized by element length - `width`)
- `TE_CHG_abs` : total number of CHG motifs found within the full length LTR retrotransposon sequence
- `TE_CHG_rel` : relative number of CHG motifs found within the full length LTR retrotransposon sequence (normalized by element length - `width`)
- `TE_CHH_abs` : total number of CHH motifs found within the full length LTR retrotransposon sequence
- `TE_CHH_rel` : relative number of CHH motifs found within the full length LTR retrotransposon sequence (normalized by element length - `width`)
- `TE_N_abs` : total number of N's found within the full length LTR retrotransposon sequence
- `CG_3ltr_abs` : total number of CG motifs found within only the 3' LTR sequence
- `CG_3ltr_rel` : relative number of CG motifs found within only the 3' LTR sequence (normalized by LTR length)
- `CHG_3ltr_abs` : total number of CHG motifs found within only the 3' LTR sequence
- `CHG_3ltr_rel` : relative number of CHG motifs found within only the 3' LTR sequence (normalized by LTR length)
- `CHH_3ltr_abs` : total number of CHH motifs found within only the 3' LTR sequence
- `CG_5ltr_abs` : total number of CG motifs found within only the 5' LTR sequence
- `CG_5ltr_rel` : relative number of CG motifs found within only the 5' LTR sequence (normalized by LTR length)
- `N_3ltr_abs` : total number of N's found within only the 3' LTR sequence
- `CHG_5ltr_abs` : total number of CHG motifs found within only the 5' LTR sequence
- `CHG_5ltr_rel` : relative number of CHG motifs found within only the 5' LTR sequence (normalized by LTR length)
- `CHH_5ltr_abs` : total number of CHH motifs found within only the 5' LTR sequence
- `CHH_5ltr_rel` : relative number of CHH motifs found within only the 5' LTR sequence (normalized by LTR length)
- `N_5ltr_abs` : total number of N's found within only the 5' LTR sequence
- `cn_3ltr` : number of solo 3' LTRs found in the genome
- `cn_5ltr` : number of solo 5' LTRs found in the genome (minus number of solo 3' LTRs)
## Detailed LTRpred run
The `Quick Start` section aimed at showing you an example `LTRpred` run and the corresponding output files and file formats.
In this section, users will learn about the input data, the available parameters that can be altered and about the diverse set of functions to perform functional annotation of LTR retrotransposons. Please make sure that you have all [prerequisite tools](https://github.com/HajkD/LTRpred/blob/master/vignettes/Installation.Rmd) installed on your system.
As an example, we will run LTRpred on the `Saccharomyces cerevisiae` genome.
__Please be aware that functional de novo annotation is a computationally heavy task that requires substantial computing resources. For larger genomes recommend running LTRpred on a computing server or a high performance computing cluster. Large genome computations can take up to days even with tens of computing cores.__
The following example, however, will terminate within a few minutes.
First we use the [biomartr](https://github.com/ropensci/biomartr) package to download the genome of `Saccharomyces cerevisiae` from NCBI RefSeq:
```r
# retrieve S. cerevisiae genome from from NCBI RefSeq
Scerevisiae_genome <- biomartr::getGenome(db = "refseq",
organism = "Saccharomyces cerevisiae")
# show path to S. cerevisiae genome
Scerevisiae_genome
```
Next, we run `LTRpred` to generate functional annotation for `S. cerevisiae` LTR retrotransposons. In general, the `LTRpred()` function takes the path to a genome file in `fasta` format (compressed files can be used as well) as input to the `genome.file` argument. The `trnas` argument takes a file path to a `fasta` file storing tRNA sequences.
In this case, for example reasons only, we use the tRNA file `sacCer3-tRNAs.fa` that comes with the `LTRpred` package. In addition, users can find tRNA files at http://gtrnadb.ucsc.edu/ or http://gtrnadb2009.ucsc.edu/. The `hmms` argument takes a path to a folder storing the HMM models of the gag and pol protein domains. Protein domains
can be found at http://pfam.xfam.org/ . When renaming the protein domains using the specification `hmm_*` users can specify `hmm_*` which indicates that all files in the folder starting with `hmm_` shall be used. The `LTRpred` package already stores the most prominent gag and pol protein domains and can be used by specifying the path `paste0(system.file("HMMs/", package = "LTRpred"), "hmm_*")`. In detail, the following HMM models are stored in `LTRpred`:
#### HMM Models:
We retrieved the HMM models for protein domain annotation of the region
between de novo predicted LTRs from [Pfam](http://pfam.xfam.org):
- RNA dependent RNA polymerase: [Overview](http://pfam.xfam.org/clan/CL0027)
- [RdRP_1](http://pfam.xfam.org/family/PF00680#tabview=tab6)
- [RdRP_2](http://pfam.xfam.org/family/PF00978#tabview=tab6)
- [RdRP_3](http://pfam.xfam.org/family/PF00998#tabview=tab6)
- [RdRP_4](http://pfam.xfam.org/family/PF02123#tabview=tab6)
- [RVT_1](http://pfam.xfam.org/family/PF00078#tabview=tab6)
- [RVT_2](http://pfam.xfam.org/family/PF07727#tabview=tab6)
- [Integrase DNA binding domain](http://pfam.xfam.org/family/PF00552#tabview=tab6)
- [Integrase Zinc binding domain](http://pfam.xfam.org/family/PF02022#tabview=tab6)
- [Retrotrans_gag](http://pfam.xfam.org/family/PF03732#tabview=tab6)
- [RNase H](http://pfam.xfam.org/family/PF00075#tabview=tab6)
- [Integrase core domain](http://pfam.xfam.org/family/PF00665#tabview=tab6)
- [Several Gypsy/Env proteins from Drosophila (Gypsy)](http://pfam.xfam.org/family/PF07253#tabview=tab6)
- [ENV polyprotein (TLV coat polyprotein)](http://pfam.xfam.org/family/PF00429#tabview=tab6)
The argument `cluster = TRUE` indicates that individually predicted LTR retrotransposons are clustered by sequence similarity into families using the `vsearch` program. The `clust.sim`
defines the sequence similarity threshold for considering family/cluster members.
```r
library(LTRpred)
# de novo LTR transposon prediction of 'D. simulans'
LTRpred(
genome.file = Scerevisiae_genome,
trnas = paste0(system.file("tRNAs/", package = "LTRpred"),"sacCer3-tRNAs.fa"),
hmms = paste0(system.file("HMMs/", package = "LTRpred"), "hmm_*"),
cluster = TRUE,
clust.sim = 0.9,
copy.number.est = TRUE,
cores = 4
)
```
```
Running LTRpred on genome '_ncbi_downloads/genomes/Saccharomyces_cerevisiae_genomic_refseq.fna.gz' with 4 core(s) and searching for retrotransposons using the overlaps option (overlaps = 'no') ...
The output folder 'Saccharomyces_cerevisiae_genomic_refseq_ltrpred' does not seem to exist yet and will be created ...
LTRpred - Step 1:
Run LTRharvest...
LTRharvest: Generating index file Saccharomyces_cerevisiae_genomic_refseq_ltrharvest/Saccharomyces_cerevisiae_genomic_refseq_index.fsa with gt suffixerator...
Running LTRharvest and writing results to Saccharomyces_cerevisiae_genomic_refseq_ltrharvest...
LTRharvest analysis finished!
LTRpred - Step 2:
Generating index file Saccharomyces_cerevisiae_genomic_refseq_ltrdigest/Saccharomyces_cerevisiae_genomic_refseq_index_ltrdigest.fsa with suffixerator...
LTRdigest: Sort index file...
Running LTRdigest and writing results to Saccharomyces_cerevisiae_genomic_refseq_ltrdigest...
LTRdigest analysis finished!
LTRpred - Step 3:
Import LTRdigest Predictions...
Input: Saccharomyces_cerevisiae_genomic_refseq_ltrdigest/Saccharomyces_cerevisiae_genomic_refseq_LTRdigestPrediction.gff -> Row Number: 283
Remove 'NA' -> New Row Number: 283
(1/8) Filtering for repeat regions has been finished.
(2/8) Filtering for LTR retrotransposons has been finished.
(3/8) Filtering for inverted repeats has been finished.
(4/8) Filtering for LTRs has been finished.
(5/8) Filtering for target site duplication has been finished.
(6/8) Filtering for primer binding site has been finished.
(7/8) Filtering for protein match has been finished.
(8/8) Filtering for RR tract has been finished.
LTRpred - Step 4:
Perform ORF Prediction using 'usearch -fastx_findorfs' ...
usearch v8.1.1861_i86osx32, 4.0Gb RAM (17.2Gb total), 8 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
00:00 2.2Mb 100.0% Working
Join ORF Prediction table: nrow(df) = 36 candidates.
unique(ID) = 36 candidates.
unique(orf.id) = 36 candidates.
Perform clustering of similar LTR transposons using 'vsearch --cluster_fast' ...
Running CLUSTpred with 90% as sequence similarity threshold using 4 cores ...
Reading file /Users/h-gd/Desktop/Projekte/R Packages/Packages/LTRpred/Saccharomyces_cerevisiae_genomic_refseq_ltrdigest/Saccharomyces_cerevisiae_genomic_refseq-ltrdigest_complete.fas 100%
248182 nt in 36 seqs, min 5189, max 24168, avg 6894
Sorting by length 100%
Counting unique k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 10 Size min 1, max 19, avg 3.6
Singletons: 7, 19.4% of seqs, 70.0% of clusters
Sorting clusters by abundance 100%
vsearch v1.10.2_osx_x86_64, 16.0GB RAM, 8 cores
https://github.com/torognes/vsearch
CLUSTpred output has been stored in: Saccharomyces_cerevisiae_genomic_refseq_ltrpred
Join Cluster table: nrow(df) = 36 candidates.
unique(ID) = 36 candidates.
unique(orf.id) = 36 candidates.
Join Cluster Copy Number table: nrow(df) = 36 candidates.
unique(ID) = 36 candidates.
unique(orf.id)) = 36 candidates.
LTRpred - Step 5:
Perform methylation context quantification..
Join methylation context (CG, CHG, CHH, CCG) count table: nrow(df) = 36 candidates.
unique(ID) = 36 candidates.
unique(orf.id) = 36 candidates.
Copy files to result folder 'Saccharomyces_cerevisiae_genomic_refseq_ltrpred'.
LTRpred - Step 6:
Starting retrotransposon evolutionary age estimation by comparing the 3' and 5' LTRs using the molecular evolution model 'K80' and the mutation rate '1.3e-07' (please make sure the mutation rate can be assumed for your species of interest!) for 36 predicted elements ...
Please be aware that evolutionary age estimation based on 3' and 5' LTR comparisons are only very rough time estimates and don't take reverse-transcription mediated retrotransposon recombination between family members of retroelements into account! Please consult Sanchez et al., 2017 Nature Communications and Drost & Sanchez, 2019 Genome Biology and Evolution for more details on retrotransposon recombination.
LTRpred - Step 7:
The LTRpred prediction table has been filtered (default) to remove potential false positives. Predicted LTRs must have an PBS or Protein Domain and must fulfill thresholds: sim = 70%; #orfs = 0. Furthermore, TEs having more than 10% of N's in their sequence have also been removed.
Input #TEs: 36
Output #TEs: 31
Perform solo LTR Copy Number Estimation....
Run makeblastdb of the genome assembly...
Building a new DB, current time: 01/24/2020 14:58:55
New DB name: _ncbi_downloads/genomes/Saccharomyces_cerevisiae_genomic_refseq.fna
New DB title: _ncbi_downloads/genomes/Saccharomyces_cerevisiae_genomic_refseq.fna
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 17 sequences in 0.129384 seconds.
Perform BLAST searches of 3' prime LTRs against genome assembly...
Perform BLAST searches of 5' prime LTRs against genome assembly...
Import BLAST results...
Filter hit results...
Estimate CNV for each LTR sequence...
Finished LTR CNV estimation!
TRpred finished all analyses successfully. All output files were stored at 'Saccharomyces_cerevisiae_genomic_refseq_ltrpred'.
[1] "Successful job 1 ."
Warning message:
The LTR copy number estimation returned an empty file. This suggests that there were no solo LTRs found in the input genome sequence.
```
The output can then be imported using:
```r
# import LTRpred output for S. cerevisiae
file <- file.path("Saccharomyces_cerevisiae_genomic_refseq_ltrpred",
"Saccharomyces_cerevisiae_genomic_refseq_LTRpred_DataSheet.tsv")
Scerevisiae_LTRpred <- read.ltrpred(file)
# look at output
dplyr::glimpse(Scerevisiae_LTRpred)
```
```
Observations: 31
Variables: 92
$ species <chr> "Saccharomyces_cerevisiae...
$ ID <chr> "Saccharomyces_cerevisiae...
$ dfam_target_name <chr> NA, NA, NA, NA, NA, NA, N...
$ ltr_similarity <dbl> 100.00, 97.15, 99.40, 96....
$ ltr_age_mya <dbl> 0.00000000, 0.05613384, 0...
$ similarity <chr> "(98,100]", "(96,98]", "(...
$ protein_domain <chr> "rve/RVT_2", "rve/RVT_2",...
$ orfs <int> 2, 2, 2, 2, 2, 3, 2, 2, 2...
$ chromosome <chr> "NC", "NC", "NC", "NC", "...
$ start <int> 160238, 221030, 259578, 2...
$ end <int> 166162, 226960, 265494, 2...
$ strand <chr> "-", "+", "+", "-", "+", ...
$ width <int> 5925, 5931, 5917, 5927, 5...
$ annotation <chr> "LTR_retrotransposon", "L...
$ pred_tool <chr> "LTRpred", "LTRpred", "LT...
$ frame <chr> ".", ".", ".", ".", ".", ...
$ score <chr> ".", ".", ".", ".", ".", ...
$ lLTR_start <int> 160238, 221030, 259578, 2...
$ lLTR_end <int> 160574, 221380, 259909, 2...
$ lLTR_length <int> 337, 351, 332, 350, 380, ...
$ rLTR_start <int> 165826, 226615, 265163, 2...
$ rLTR_end <int> 166162, 226960, 265494, 2...
$ rLTR_length <int> 337, 346, 332, 339, 369, ...
$ lTSD_start <int> 160233, 221026, 259573, 2...
$ lTSD_end <int> 160237, 221029, 259577, 2...
$ lTSD_motif <chr> "gaacc", "aaca", "gtaat",...
$ rTSD_start <int> 166163, 226961, 265495, 2...
$ rTSD_end <int> 166167, 226964, 265499, 2...
$ rTSD_motif <chr> "gaacc", "aaca", "gtaat",...
$ PPT_start <int> NA, NA, NA, NA, NA, NA, N...
$ PPT_end <int> NA, NA, NA, NA, NA, NA, N...
$ PPT_motif <chr> NA, NA, NA, NA, NA, NA, N...
$ PPT_strand <chr> NA, NA, NA, NA, NA, NA, N...
$ PPT_offset <int> NA, NA, NA, NA, NA, NA, N...
$ PBS_start <int> NA, NA, NA, NA, NA, NA, N...
$ PBS_end <int> NA, NA, NA, NA, NA, NA, N...
$ PBS_strand <chr> NA, NA, NA, NA, NA, NA, N...