Matthias Benoit, Hajk-Georg Drost, Marco Catoni, Quentin Gouil, Sara Lopez-Gomollon, David Charles Baulcombe, Jerzy Paszkowski. Environmental and epigenetic regulation of Rider retrotransposons in tomato. PloS Genetics (2019) (in press).
-
1. De novo Annotation of RIDER retrotransposons in the plant kingdom
-
2. Determine the distribution of RIDER-like retrotransposons in the plant kingdom
The following sections enable to computationally reproduce all LTR retrotransposon annotation and analytics steps. Please make sure that you follow all steps and install all necessary software tools to be able to reproduce the data generated for the paper.
For the computational reproducibility of de novo annotations of LTR retrotransposons we implemented the R package LTRpred.
LTRpred
calls the command line tools suffixerator
, LTRharvest
(Ellinghaus et al., 2008), and LTRdigest
(Steinbiss et al., 2009), which are part of the GenomeTools library (Gremme et al., 2013) to screen for repeated LTRs, specific sequence motifs such as primer binding sites (PBS), polypurine tract motifs (PPT), and target site duplications (TSD) and for conserved protein domains such as reverse transcriptase (gag), integrase DNA binding domain, integrase Zinc binding domain, RNase H, and the integrase core domain. Subsequently, LTRpred
implements customized parser functions to import LTRdigest
output and structures the data in a tidy data format
(Wickham, 2014) which subsequently enables automation of false positive curation. In a second step, open reading frame (ORF) prediction is performed by a customized wrapper function that runs the command line tool usearch
(Edgar, 2010). This automated step allows to automatically filter out RTEs that might have conserved protein domains such as an integrase or a reverse transcriptase, but fail to have ORFs and thus are not expressed. In a third step, RTE family clustering is performed using the command line tool vsearch
(Rognes et al., 2016) which defines family members by >90% sequence homology of the full element to each other. In a fourth step, an automated hmmer search
(Finn et al., 2011) against the Dfam
database (Hubley et al., 2016) is performed to assign super-family associations such as Copia or Gypsy by comparing the protein domains of de novo predicted RTEs with already annotated RTEs in the Dfam
database. For each step, LTRpred
implements customized parser functions to import usearch
, vsearch
, and Dfam
output and transforms this output in tidy data format
for subsequent automated false positive curation. In a fifth step, for each predicted element the count and proportion (count divided by element length) of CHH, CHG, CG, and NNN motifs are quantified for the entire element, the 3’ LTR and 5’ LTR separately. In a sixth step, automated false positive curation is performed by the LTRpred
function quality.filter()
to conservatively reduce false positive predictions.
Please install the following R packages before running the reproducible scripts:
install.packages("dplyr")
install.packages("ggplot2")
install.packages("readr")
install.packages("readxl")
source("http://bioconductor.org/biocLite.R")
biocLite('biomartr')
biocLite("devtools")
biocLite("HajkD/LTRpred")
Please also make sure that you follow the INSTALLATION instructions
of the LTRpred
package to install all command line tools that LTRpred
depends on. Otherwise, LTRpred
will not generate proper annotation files.
Please also make sure that you follow the INSTALLATION instructions
of the LTRpred
package to install all command line tools that LTRpred
depends on.
We retrieved tRNA sequences in *.fasta
format from the following databases:
We combined tRNA sequences from both databases to have a comprehensive collection of tRNA sequences specific for each kingdom of life.
We retrieved the HMM models for protein domain annotation of the region between de novo predicted LTRs from Pfam:
- RNA dependent RNA polymerase: Overview
Please make sure that you follow the INSTALLATION instructions
of the LTRpred
package to install all command line tools that LTRpred
depends on. Otherwise, LTRpred
will not generate proper annotation files.
The following code can be run on a computer with n
cores. Please be aware that computation times might correspond to days due to the genome sizes of the respective species.
For further details about LTRpred
please consult the LTRpred: Introduction Vignette.
We assume that users will store the genome assembly files in fasta
format
of the species Asterids:
Capsicum annuum, Capsicum baccatum MLFT02_5, Capsicum chinense MCIT02_5, Coffea canephora, Petunia axillaris, Phytophthora inflata, Solanum arcanum, Solanum habrochaites, Solanum lycopersicum, Solanum melongena, Solanum pennellii, Solanum pimpinellifolium, Solanum tuberosum; Rosids:
Arabidopsis thaliana, Vitis vinifera, and Cucumis melo; monocots:
Oryza sativa in a folder named genomes/
which will then be passed
as string to the argument genome.folder
in the LTRpred.meta()
function.
All annotation files generated with LTRpred
are then stored in the new folder
LTRpred_results
as is specified in the argument output.folder
in the function LTRpred.meta()
.
library(LTRpred)
# generate LTRpred annotations for all genomes stored in folder genomes/
LTRpred::LTRpred.meta(
genome.folder = "genomes",
output.folder = "LTRpred_results",
cluster = FALSE,
cores = 32,
copy.number.est = FALSE,
minlenltr = 100,
maxlenltr = 5000,
mindistltr = 4000,
maxdistltr = 30000,
mintsd = 3,
maxtsd = 20,
vic = 80,
overlaps = "no",
xdrop = 7,
motifmis = 1,
pbsradius = 60,
pbsalilen = c(8,40),
pbsoffset = c(0,10),
quality.filter = TRUE,
n.orfs = 0
)
# import all LTRpred annotations for all individual species
# and combine them to a large meta-annotation file
meta_tbl <- LTRpred::meta.summarize(
result.folder = "LTRpred_results/",
ltr.similarity = 80,
quality.filter = TRUE,
n.orfs = 0)
# store the large meta-annotation file locally as `*.tsv` file
readr::write_tsv(meta_tbl, "LTRpred_Rider_MetaTable.tsv")
Users interested in the exact algorithms and procedures running inside of the LTRpred.meta()
function
can access the LTRpred
open-source code here.
Alternatively, it is also possible to write a for
-loop around the function
LTRpred()
to run LTRpred annotations individually for all species.
RIDER
clustering denotes the analytical step in which LTRpred
annotated
retrotransposon sequences are clustered across species using an optimal global alignment approach (full dynamic programming Needleman-Wunsch)
(performed using the tool VSEARCH).
The cluster file generated with VSEARCH
can be found at solanocea.uc.
# import the LTRpred_Rider_MetaTable.tsv file generated in the previous annotation step
LTRpred_Rider_MetaTable <- LTRpred::read.ltrpred("LTRpred_Rider_MetaTable.tsv")
LTRpred_Rider_MetaTable <- dplyr::select(LTRpred_Rider_MetaTable, -Clust_Cluster, -Clust_Target, -Clust_Perc_Ident)
# import cluster result from VSEARCH
cluster <- read.uc("solanocea.uc")
# filter for clusters
cluster_filter <- dplyr::filter(cluster, Type == "H")
cluster_filter <-
dplyr::select(cluster_filter,
Cluster,
Query,
Target,
Perc_Ident)
names(cluster_filter) <-
paste0("Clust_", names(cluster_filter))
names(cluster_filter)[2] <- "orf.id"
# join clusters with annotation file
LTRpred_Rider_MetaTable_joined_85 <- dplyr::left_join(LTRpred_Rider_MetaTable, cluster_filter, by = "orf.id")
# select only retrotransposons that have at least 85% sequence homology
# between their 5' and 3' LTRs
LTRpred_Rider_MetaTable_joined_85 <- LTRpred::quality.filter(LTRpred_Rider_MetaTable_joined_85, 0.85, 1, "stringent")
# select only closely related species
meta_tbl_sarcanum_shabrochaites_spimpinellifolium <-
dplyr::filter(
LTRpred_Rider_MetaTable_joined_85,
stringr::str_detect(species, "arcanum") |
stringr::str_detect(species, "habrochaites") |
stringr::str_detect(species, "Spimpinellifolium") |
stringr::str_detect(species, "Spennellii$")
)
# select only S pennellii specific RIDER TEs
Spennellii_Rider <- dplyr::filter(LTRpred_Rider_MetaTable_joined_85,stringr::str_detect(species, "Spennellii$"))
# select only S pimpinellifolium specific RIDER TEs
Spimpinellifolium_Rider <- dplyr::filter(LTRpred_Rider_MetaTable_joined_85,stringr::str_detect(species, "Spimpinellifolium"))
# store as *.tsv file
LTRpred::pred2tsv(
Spennellii_Rider,
"meta_tbl_spennellii.tsv"
)
# store as *.tsv file
LTRpred::pred2fasta(
Spennellii_Rider,
"../solanocea-complete.fas",
"meta_tbl_spennellii.fasta"
)
# store as *.tsv file
LTRpred::pred2tsv(
Spimpinellifolium_Rider,
"meta_tbl_spimpinellifolium.tsv"
)
# store as *.fasta file
LTRpred::pred2fasta(
Spimpinellifolium_Rider,
"../solanocea-complete.fas",
"meta_tbl_spimpinellifolium.fasta"
)
For determining the distribution of RIDER-like retrotransposons in the plant kingdom we use the developer versions of the R packages biomartr and metablastr which need to be installed before running the following commands.
The biomartr
package enables large-scale retrieval of all plant genomes
from NCBI RefSeq
.
# retrieve all >110 plant genomes stored at NCBI RefSeq
install.packages("magrittr")
library(magrittr)
biomartr::meta.retrieval(kingdom = "plant",
db = "refseq",
type = "genome") %>%
biomartr::clean.retrieval()
Next, we assume that all plant genomes are stored in the folder plant
.
Please make sure that you manually remove the documentation
folder
in the folder plant
before running the next command. In addition,
you will need to store the following reference sequences (rider_annotation_85_sequence_file, Rider_LTR_5.fasta) in the same folder
from which you run the following metablastr
commands.
# retrieve all genome names
subj_genomes_all_plants <- file.path("plant",
list.files("plant"))
# run BLAST analysis to determine RIDER-like element distribution across the
# plant kingdom
Rider_kingdom_blast <- metablastr::blast_genomes(
"rider_annotation_85_sequence_file.fasta", # reference RIDER (query) sequence
subj_genomes_all_plants,
task = "blastn",
blast_output_path = "rider_genome_blast",
cores = 26,
evalue = 1E-5,
max.target.seqs = 5000
)
# store BLAST results as `*.tsv` file
readr::write_tsv(Rider_kingdom_blast, "Rider_kingdom_blast_results_all.tsv")
# retrieve all genome names
subj_genomes_all_plants <- file.path("plant", list.files("plant"))
# run BLAST analysis to determine RIDER LTR distribution across the
# plant kingdom
Rider_LTR <- metablastr::blast_genomes(
"Rider_LTR_5.fasta", # reference RIDER LTR sequence (= query)
subj_genomes_all_plants,
task = "blastn",
blast_output_path = "rider_solo_ltr_blast_result",
cores = 28
)
# store BLAST results as `*.tsv` file
readr::write_tsv(Rider_LTR, "Rider_LTR_5_blast_results_all.tsv")
# import and filter for valid Rider LTR sequences
Rider_LTR <- readr::read_tsv("Rider_LTR_5_blast_results_all.tsv")
Rider_LTR <- dplyr::mutate(Rider_LTR, scope = 1 - (abs(q_len - alig_length) / q_len))
Rider_LTR <- dplyr::filter(Rider_LTR, perc_identity >= 50)
Please install the R package tidygenomics to be able to perform the following analyses:
Rider_kingdom_blast <-
readr::read_tsv("Rider_kingdom_blast_results_all.tsv", col_names = TRUE)
Rider_kingdom_blast_filtered_between_50_85_similarity <-
dplyr::filter(Rider_kingdom_blast,
perc_identity >= 50,
perc_identity < 85,
qcovhsp >= 50)
subj_genomes_all_plants <-
file.path("rider_blast_genomes/Plants",
list.files("rider_blast_genomes/Plants/"))
metablastr::extract_hit_seqs_from_genomes(
Rider_kingdom_blast_filtered_between_50_85_similarity,
subject_genomes = subj_genomes_all_plants,
file_name = "Rider_BLAST_Hits_Sequences_between_50_and_85_homology.fa"
)
rider_blast_Slycopersicum_rider_like <-
metablastr::blast_nucleotide_to_nucleotide(
"Rider_BLAST_Hits_Sequences_between_50_and_85_homology.fa",
"rider_blast_genomes/Plants/Slycopersicum.fa",
evalue = 1E-5,
max.target.seqs = 1,
task = "blastn",
cores = 28
)
readr::write_tsv(
rider_blast_Slycopersicum_rider_like,
"rider_blast_Slycopersicum_rider_like.tsv"
)
Slycopersicum_rider_like <-
metablastr::read_blast(file = "rider_blast_Slycopersicum_rider_like_perc_identity_85_qcovhsp_85.csv", out.format = "csv")
Rider_coordinates_all <- readxl::read_xls("Rider_coordinates_all.xls")
Slyco_ltrpred <-
LTRpred::read.ltrpred("Slycopersicum_v3_ltrpred/Slycopersicum_v3_LTRpred_DataSheet.tsv")
Slyco_x_plus <-
dplyr::select(dplyr::filter(Slyco_ltrpred, strand == "+"),
ID,
chromosome,
start,
end)
Slyco_y_plus <-
dplyr::select(
dplyr::filter(Slycopersicum_rider_like, s_end - s_start + 1 > 1),
query_id,
subject_id ,
s_start,
s_end
)
names(Slyco_y_plus) <- c("ID", "chromosome", "start", "end")
Slyco_x_minus <-
dplyr::select(dplyr::filter(Slyco_ltrpred, strand == "-"),
ID,
chromosome,
start,
end)
names(Slyco_x_minus) <- c("ID", "chromosome", "start", "end")
Slyco_y_minus <-
dplyr::select(
dplyr::filter(Slycopersicum_rider_like, s_start - s_end + 1 > 1),
query_id,
subject_id ,
s_start,
s_end
)
names(Slyco_y_minus) <- c("ID", "chromosome", "start", "end")
Slyco_y_minus_new <- Slyco_y_minus
Slyco_y_minus_new <-
dplyr::mutate(Slyco_y_minus_new, start = Slyco_y_minus$end)
Slyco_y_minus_new <-
dplyr::mutate(Slyco_y_minus_new, end = Slyco_y_minus$start)
Slyco_z_plus <-
dplyr::select(dplyr::filter(Rider_coordinates_all, strand == "+"),
ID,
chromosome,
start,
end)
names(Slyco_z_plus) <- c("ID", "chromosome", "start", "end")
Slyco_z_minus <-
dplyr::select(dplyr::filter(Rider_coordinates_all, strand == "-"),
ID,
chromosome,
start,
end)
names(Slyco_z_minus) <- c("ID", "chromosome", "start", "end")
Slyco_z_minus_new <- Slyco_z_minus
Slyco_z_minus_new <-
dplyr::mutate(Slyco_z_minus_new, start = Slyco_z_minus$end)
Slyco_z_minus_new <-
dplyr::mutate(Slyco_z_minus_new, end = Slyco_z_minus$start)
Slyco_rider_overlap_plus <- tidygenomics::genome_intersect(Slyco_x_plus, Slyco_y_plus, by=c("chromosome", "start", "end"), mode="both")
Slyco_rider_overlap_minus <- tidygenomics::genome_intersect(Slyco_x_minus_new, Slyco_y_minus_new, by=c("chromosome", "start", "end"), mode="both")
Slyco_rider_overlap_plus_rider_coordinates <- tidygenomics::genome_intersect(Slyco_z_plus, Slyco_y_plus, by=c("chromosome", "start", "end"), mode="both")
Slyco_rider_overlap_minus_rider_coordinates <- tidygenomics::genome_intersect(Slyco_z_minus, Slyco_y_minus_new, by=c("chromosome", "start", "end"), mode="both")
# Annotating short LTR sequences from Rider family
### Short Rider LTR
Rider_LTR_short <- dplyr::filter(
Rider_LTR,
dplyr::between(alig_length, 150, 300),
perc_identity >= 50,
scope >= 0.5,
species %in% c(
"Slycopersicum",
"Spimpinellifolium",
"Sarcanum_LA2157",
"Spennellii",
"Shabrochaites_LYC4",
"Stuberosum",
"Cannuum",
"Athaliana"
)
)
# ectract the sequences of short RIDER LTR sequences
metablastr::extract_hit_seqs_from_genomes(
blast_tbl = Rider_LTR_short,
subject_genomes = subj_genomes_all_plants,
file_name = "Rider_LTR_short.fa" ,
separated_by_genome = FALSE
)
### Annotating long LTR sequences from Rider family
Rider_LTR_long <- dplyr::filter(
Rider_LTR,
dplyr::between(alig_length, 350, 450),
perc_identity >= 50,
scope >= 0.5,
species %in% c(
"Slycopersicum",
"Spimpinellifolium",
"Sarcanum_LA2157",
"Spennellii",
"Shabrochaites_LYC4",
"Stuberosum",
"Cannuum",
"Athaliana"
)
)
# ectract the sequences of long RIDER LTR sequences
metablastr::extract_hit_seqs_from_genomes(
blast_tbl = Rider_LTR_long,
subject_genomes = subj_genomes_all_plants,
file_name = "Rider_LTR_long.fa" ,
separated_by_genome = FALSE
)
install.packages("dplyr")
install.packages("gridExtra")
library(dplyr)
# add species to plot that didn't generate BLAST hits fulfilling
# the filter criteria
missing_species <- tibble::tibble(query_id = c("None", "None", "None"),
subject_id = c("None", "None", "None"),
perc_identity = c(0,0,0),
num_ident_matches = c(0,0,0),
alig_length = c(50,50,50),
mismatches = c(0,0,0),
gap_openings = c(0,0,0),
n_gaps = c(0,0,0),
pos_match = c(0,0,0),
ppos = c(0,0,0),
q_start = c(20,20,20),
q_end = c(100,100,100),
q_len = c(100,100,100),
qcov = c(0,0,0),
qcovhsp = c(0,0,0),
s_start = c(0,0,0),
s_end = c(0,0,0),
s_len = c(0,0,0),
evalue = c(0,0,0),
bit_score = c(0,0,0),
score_raw = c(0,0,0),
scope = c(0.1,0.1,0.1),
species = c("Stuberosum", "Cannuum", "Athaliana"))
# plot BLAST hit distribution of only RIDER LTR sequences
p1_rider_ltr <- Rider_LTR %>% dplyr::filter(
species %in% c(
"Slycopersicum",
"Spimpinellifolium",
"Sarcanum_LA2157",
"Shabrochaites_LYC4",
"Spennellii",
"Stuberosum",
"Cannuum",
"Athaliana"
)
) %>% dplyr::bind_rows(missing_species) %>% gg_blast_hits(
type = "scope",
scope_cutoff = 0.5,
xlab = "Sequence homology of BLAST hits to initial RIDER copy in %",
ylab = "Density over Number of BLAST hits",
levels = rev(
c(
"Slycopersicum",
"Spimpinellifolium",
"Sarcanum_LA2157",
"Shabrochaites_LYC4",
"Spennellii",
"Stuberosum",
"Cannuum",
"Athaliana"
)
),
xticks = 10
) + ggplot2::scale_fill_manual(values = ggsci::pal_lancet("lanonc")(6)[c(2,3,4,5,6)]) + ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n = 8))
# plot BLAST alignment length distribution of only RIDER LTR sequences
p2_rider_ltr <- Rider_LTR %>% dplyr::filter(
species %in% c(
"Slycopersicum",
"Spimpinellifolium",
"Sarcanum_LA2157",
"Shabrochaites_LYC4",
"Spennellii",
"Stuberosum",
"Cannuum",
"Athaliana"
)
) %>% dplyr::bind_rows(missing_species) %>% gg_blast_hits(
type = "alig_length",
scope_cutoff = 0.5,
xlab = "Alignent length of BLAST hits with initial RIDER LTR copy in bp",
ylab = "Density over Number of BLAST hits",
levels = rev(
c(
"Slycopersicum",
"Spimpinellifolium",
"Sarcanum_LA2157",
"Shabrochaites_LYC4",
"Spennellii",
"Stuberosum",
"Cannuum",
"Athaliana"
)
),
xticks = 10
) + ggplot2::scale_fill_manual(values = ggsci::pal_lancet("lanonc")(6)[c(2,3,4,5,6)]) + ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) + ggplot2::theme(axis.text.y = ggplot2::element_blank()) + ggplot2::ggtitle("")
p_rider_ltr <- gridExtra::grid.arrange(
p1_rider_ltr,
p2_rider_ltr,
nrow = 1
)
# store plot as pdf file
cowplot::save_plot(
"Rider_5'LTR_BLAST_hits_selected_species_50perc.pdf",
p_rider_ltr,
base_height = 10,
base_width = 18
)
# import BLAST hits of RIDER elements
rider_blast_df_all <- readr::read_tsv("Rider_kingdom_blast_results_all.tsv")
rider_blast_df_all <- dplyr::mutate(rider_blast_df_all, scope = 1 - (abs(q_len - alig_length) / q_len))
# select only hits with at least 50% perc_identity
rider_blast_df_all_over_50 <- dplyr::filter(rider_blast_df_all, scope >= 0.5, perc_identity >= 50)
# prepare selected species
rider_blast_df_selected_Figure <-
dplyr::filter(
rider_blast_df_all_over_50,
species %in% c(
"Slycopersicum",
"Spimpinellifolium",
"Sarcanum_LA2157",
"Shabrochaites_LYC4",
"Spennellii",
"Stuberosum",
"Cannuum",
"Athaliana"
)
)
rider_blast_df_selected_Figure
missing_species_2 <- tibble::tibble(query_id = c("None", "None", "None"),
subject_id = c("None", "None", "None"),
perc_identity = c(0,0,0),
num_ident_matches = c(0,0,0),
alig_length = c(50,50,50),
mismatches = c(0,0,0),
gap_openings = c(0,0,0),
n_gaps = c(0,0,0),
pos_match = c(0,0,0),
ppos = c(0,0,0),
q_start = c(20,20,20),
q_end = c(100,100,100),
q_len = c(100,100,100),
qcov = c(0,0,0),
qcovhsp = c(0,0,0),
s_start = c(0,0,0),
s_end = c(0,0,0),
s_len = c(0,0,0),
evalue = c(0,0,0),
bit_score = c(0,0,0),
score_raw = c(0,0,0),
scope = c(0.1,0.1,0.1),
species = c("Stuberosum", "Cannuum", "Spimpinellifolium"))
p1 <- rider_blast_df_selected_Figure %>%
dplyr::bind_rows(missing_species_2) %>%
gg_blast_hits(
type = "scope",
scope_cutoff = 0.5,
xlab = "Sequence homology of BLAST hits to initial RIDER copy in %",
ylab = "Density over Number of BLAST hits",
levels = rev(c(
"Slycopersicum",
"Spimpinellifolium",
"Sarcanum_LA2157",
"Shabrochaites_LYC4",
"Spennellii",
"Stuberosum",
"Cannuum",
"Athaliana"
)),
) + ggplot2::scale_fill_manual(values = ggsci::pal_lancet("lanonc")(6)[c(1,2,3,4,6)]) + ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n = 8))
p2 <- rider_blast_df_selected_Figure %>%
dplyr::bind_rows(missing_species_2) %>%
gg_blast_hits(
type = "alig_length",
scope_cutoff = 0.5,
xlab = "Alignent length of BLAST hits with initial RIDER copy in bp",
ylab = "Density over Number of BLAST hits",
levels = rev(c(
"Slycopersicum",
"Spimpinellifolium",
"Sarcanum_LA2157",
"Shabrochaites_LYC4",
"Spennellii",
"Stuberosum",
"Cannuum",
"Athaliana"
))
) + ggplot2::scale_fill_manual(values = ggsci::pal_lancet("lanonc")(6)[c(1,2,3,4,6)]) + ggplot2::theme(axis.text.y = ggplot2::element_blank()) + ggplot2::ggtitle("") + ggplot2::scale_x_continuous(breaks = scales::pretty_breaks(n = 8))
p_all <- gridExtra::grid.arrange(
p1,
p2,
nrow = 1
)
# save plot
cowplot::save_plot(
"Rider_full_TE_BLAST_hits_selected_species_50perc.pdf",
p_all,
base_height = 10,
base_width = 18
)
# extract 1000 randomly sampled loci of length 5000 (each) from S lycopersicum
Slycopersicum_RandomSeqs_1000 <- metablastr::extract_random_seqs_from_genome(
size = 1000,
interval_width = 5000,
subject_genome = "Slycopersicum.fa",
file_name = "Slycopersicum_RandomSeqs_1000.fa"
)
ABA_motifs <- c("ACGCGC",
"ACGCGG",
"ACGCGT",
"CCGCGC",
"CCGCGG",
"CCGCGT",
"GCGCGC",
"GCGCGG",
"GCGCGT",
"ACCGAC",
"GCCGAC",
"ACCGAC",
"ATCGAC",
"GCCGAC",
"GTCGAC",
"CAGTTA",
"CAGTTG",
"CTGTTA",
"CTGTTG",
"CGGTTA",
"CGGTTG",
"CCGTTA",
"CCGTTG",
"CAACGG",
"CAACTG",
"TAACGG",
"TAACTG",
"CCGAC",
"ACGT",
"ACGTG",
"GAAAAA")
ABA_motifs <- unique(ABA_motifs)
Sly_ABA_motif_count <- metablastr::motif_compare("Slycopersicum_RiderCoordinates_All.fa",
Slycopersicum_RandomSeqs_1000,
motifs = ABA_motifs)
Sly_ABA_motif_enrichment <- metablastr::motif_enrichment("Slycopersicum_RiderCoordinates_All.fa",
Slycopersicum_RandomSeqs_1000,
motifs = ABA_motifs)
Sly_ABA_motif_enrichment <-
dplyr::mutate(
Sly_ABA_motif_enrichment,
status = ifelse(fisher_pval <= 0.01, "significant", "not significant")
)
readr::write_excel_csv(Sly_ABA_motif_enrichment, "Sly_ABA_motif_enrichment.csv")
readr::write_excel_csv(Sly_ABA_motif_count, "Sly_ABA_motif_counts.csv")
# Alternative motifs
Sly_alternative_motif_count <- metablastr::motif_compare("Slycopersicum_RiderCoordinates_All.fa",
Slycopersicum_RandomSeqs_1000,
motifs = c("CACGTA", "CGCGTT"))
Sly_alternative_motif_enrichment <- metablastr::motif_enrichment("Slycopersicum_RiderCoordinates_All.fa",
Slycopersicum_RandomSeqs_1000,
motifs = c("CACGTA", "CGCGTT"))
Sly_alternative_motif_enrichment <-
dplyr::mutate(
Sly_alternative_motif_enrichment,
status = ifelse(fisher_pval <= 0.01, "significant", "not significant")
)
readr::write_excel_csv(Sly_alternative_motif_enrichment, "Sly_alternative_motif_enrichment.csv")
readr::write_excel_csv(Sly_alternative_motif_count, "Sly_alternative_motif_counts.csv")
### S pennellii ABA motif enrichment
meta_tbl_spennellii.fasta
# negative control
Sly_negative_control_motif_count <- metablastr::motif_compare("Slycopersicum_RiderCoordinates_All.fa",
Slycopersicum_RandomSeqs_1000,
motifs = "TGTCGG")
Sly_negative_control_motif_enrichment <- metablastr::motif_enrichment("Slycopersicum_RiderCoordinates_All.fa",
Slycopersicum_RandomSeqs_1000,
motifs = "TGTCGG")
Sly_negative_control_motif_enrichment <-
dplyr::mutate(
Sly_negative_control_motif_enrichment,
status = ifelse(fisher_pval <= 0.01, "significant", "not significant")
)
readr::write_excel_csv(Sly_negative_control_motif_enrichment, "Sly_negative_control_motif_enrichment.csv")
readr::write_excel_csv(Sly_negative_control_motif_count, "Sly_negative_control_motif_count.csv")
Rider_LTR_short_motif_compare_multi <- motif_compare_multi(
blast_tbl = Rider_LTR_short,
subject_genomes = Rider_LTR_short_subject_genomes,
size = 1000,
interval_width = 300,
motifs = c(ABA_motifs, "CACGTA", "CGCGTT")
)
Rider_LTR_short_motif_enrichment_multi <- motif_enrichment_multi(
blast_tbl = Rider_LTR_short,
subject_genomes = Rider_LTR_short_subject_genomes,
size = 1000,
interval_width = 300,
motifs = c(ABA_motifs, "CACGTA", "CGCGTT")
)
Rider_LTR_long_motif_compare_multi <- motif_compare_multi(
blast_tbl = Rider_LTR_long,
subject_genomes = Rider_LTR_long_subject_genomes,
size = 1000,
interval_width = 440,
motifs = c(ABA_motifs, "CACGTA", "CGCGTT")
)
Rider_LTR_long_motif_enrichment_multi <- motif_enrichment_multi(
blast_tbl = Rider_LTR_long,
subject_genomes = Rider_LTR_long_subject_genomes,
size = 1000,
interval_width = 300,
motifs = c(ABA_motifs, "CACGTA", "CGCGTT")
# Alternative approach: compare motif enrichment between
# RIDER sequences and all gene promotors 400bp upstream of S. lycopersicum
biomartr::getCollection("ensemblgenomes", "Solanum lycopersicum")
# Extracting 400 bp promotors
extract_upstream_promotor_seqs(
organism = "Solanum lycopersicum",
annotation_file = "_db_downloads/collections/ensemblgenomes/Solanum_lycopersicum/Solanum_lycopersicum.SL3.0.42_ensemblgenomes.gtf",
annotation_format = "gtf",
genome_file = "_db_downloads/collections/ensemblgenomes/Solanum_lycopersicum/Solanum_lycopersicum.SL3.0.dna.toplevel.fa",
promotor_width = 400,
replaceUnstranded = TRUE
)
ABA_motifs <- c("ACGCGC",
"ACGCGG",
"ACGCGT",
"CCGCGC",
"CCGCGG",
"CCGCGT",
"GCGCGC",
"GCGCGG",
"GCGCGT",
"ACCGAC",
"GCCGAC",
"ACCGAC",
"ATCGAC",
"GCCGAC",
"GTCGAC",
"CAGTTA",
"CAGTTG",
"CTGTTA",
"CTGTTG",
"CGGTTA",
"CGGTTG",
"CCGTTA",
"CCGTTG",
"CAACGG",
"CAACTG",
"TAACGG",
"TAACTG",
"CCGAC",
"ACGT",
"ACGTG",
"GAAAAA")
ABA_motifs <- unique(ABA_motifs)
Sly_ABA_motif_count_400 <- metablastr::motif_compare("Slycopersicum_RiderCoordinates_All.fa",
"Solanum_lycopersicum_all_genes_promotor_seqs_400.fa",
motifs = ABA_motifs)
Sly_ABA_motif_enrichment_400 <- metablastr::motif_enrichment("Slycopersicum_RiderCoordinates_All.fa",
"Solanum_lycopersicum_all_genes_promotor_seqs_400.fa",
motifs = ABA_motifs)
Sly_ABA_motif_enrichment_400 <-
dplyr::mutate(
Sly_ABA_motif_enrichment_400,
status = ifelse(fisher_pval <= 0.01, "significant", "not significant")
)
readr::write_excel_csv(Sly_ABA_motif_enrichment_400, "Sly_ABA_motif_enrichment_promotors_all_genes_400bp.csv")
readr::write_excel_csv(Sly_ABA_motif_count_400, "Sly_ABA_motif_counts_promotors_all_genes_400bp.csv")
# Alternative motifs
Sly_alternative_motif_count_400 <- metablastr::motif_compare("Slycopersicum_RiderCoordinates_All.fa",
"Solanum_lycopersicum_all_genes_promotor_seqs_400.fa",
motifs = c("CACGTA", "CGCGTT"))
Sly_alternative_motif_enrichment_400 <- metablastr::motif_enrichment("Slycopersicum_RiderCoordinates_All.fa",
"Solanum_lycopersicum_all_genes_promotor_seqs_400.fa",
motifs = c("CACGTA", "CGCGTT"))
Sly_alternative_motif_enrichment_400 <-
dplyr::mutate(
Sly_alternative_motif_enrichment_400,
status = ifelse(fisher_pval <= 0.01, "significant", "not significant")
)
readr::write_excel_csv(Sly_alternative_motif_enrichment_400, "Sly_alternative_motif_enrichment_all_genes_promotor_seqs_400.csv")
readr::write_excel_csv(Sly_alternative_motif_count_400, "Sly_alternative_motif_counts_all_genes_promotor_seqs_400.csv")
# negative control
Sly_negative_control_motif_count_promotor_400 <- metablastr::motif_compare("Slycopersicum_RiderCoordinates_All.fa",
"Solanum_lycopersicum_all_genes_promotor_seqs_400.fa",
motifs = "TGTCGG")
Sly_negative_control_motif_enrichment_promotor_400 <- metablastr::motif_enrichment("Slycopersicum_RiderCoordinates_All.fa",
"Solanum_lycopersicum_all_genes_promotor_seqs_400.fa",
motifs = "TGTCGG")
Sly_negative_control_motif_enrichment_promotor_400 <-
dplyr::mutate(
Sly_negative_control_motif_enrichment_promotor_400,
status = ifelse(fisher_pval <= 0.01, "significant", "not significant")
)
readr::write_excel_csv(Sly_negative_control_motif_enrichment_promotor_400, "Sly_negative_control_motif_enrichment_promotor_400.csv")
readr::write_excel_csv(Sly_negative_control_motif_count_promotor_400, "Sly_negative_control_motif_count_promotor_400.csv")
Define N50 computation function.
N50 <- function(len) {
# sort scaffold or chromosome lenghts in descending order
len.sorted <- rev(sort(len))
# compute N50 over all scaffold or chromosome lenghts in Mbp
N50 <- len.sorted[cumsum(len.sorted) >= sum(len.sorted)*0.5][1] / 1000000
return(N50)
}
Slycopersicum <- biomartr::read_genome("Slycopersicum.fa")
Slycopersicum_N50 <- N50(Slycopersicum@ranges@width)
Sarcanum <- biomartr::read_genome("Sarcanum_LA2157.fasta.gz")
Sarcanum_N50 <- N50(Sarcanum@ranges@width)
Spennellii <- biomartr::read_genome("Spennellii.fa")
Spennellii_N50 <- N50(Spennellii@ranges@width)
Shabrochaites <- biomartr::read_genome("Shabrochaites_LYC4.fasta.gz")
Shabrochaites_N50 <- N50(Shabrochaites@ranges@width)
Stuberosum <- biomartr::read_genome("Stuberosum.fa")
Stuberosum_N50 <- N50(Stuberosum@ranges@width)
Spimpinellifolium <- biomartr::read_genome("Spimpinellifolium.fa")
Spimpinellifolium_N50 <- N50(Spimpinellifolium@ranges@width)