This README
outlines all the steps required to fully build (or update) the datasets in RLBase.
It has only been tested on Ubuntu 20.04/18.04 LTS
.
Processed data-sets can be accessed via the RLHub
R package.
Time required: It took around 3 weeks to fully run this protocol with 100 threads and 250GiB of RAM allocated. Updating the database with new data will also take substantial time and computing resources in proportion to the number and size of FASTQ files which must be aligned.
- Clone the repos
git clone https://github.com/Bishop-Laboratory/RLBase-data.git
git clone https://github.com/Bishop-Laboratory/RLPipes.git
git clone https://github.com/Bishop-Laboratory/RLSeq.git
- Create the environment (requires conda)
cd RLBase-data/misc-data/
git clone https://github.com/srhartono/SkewR.git
wget https://github.com/KorfLab/StochHMM/archive/refs/tags/v0.38.tar.gz
tar -xvzf v0.38.tar.gz
cd StochHMM-0.38/
./configure
make
cd ../
RLBASEDIR=$(pwd)
export PATH="$RLBASEDIR/misc-data/StochHMM-0.38/:$PATH"
conda install -c conda-forge mamba -y
mamba env create -f env.yml --force
conda activate rlbaseData
- Install
RLPipes
andRLSeq
pip install -e ../RLPipes/
R -e "BiocManager::install(c('EnsDb.Hsapiens.v86', 'EnsDb.Mmusculus.v79'))"
R -e "install.packages(c('ggprism', 'tableone'), repos = 'http://cran.us.r-project.org')"
# If prior to bioc release:
# R -e "BiocManager::install(verion='devel')"
R -e "remotes::install_local('../RLHub/', dependencies=TRUE, force=TRUE)"
R -e "remotes::install_local('../RLSeq/', dependencies=TRUE, force=TRUE)"
The following steps taking to generate and update RLBase datasets.
If you have write permissions on the AWS buckets used here, you can complete
every step if awscli
is configured:
aws configure
If you do not, then you will only be able to mirror and download existing buckets.
These preliminary steps illustrate how the original annotation data and genome info was collated.
- Available genome info
This script downloads available genomes from the UCSC Genome Browser FTP.
It then uses unique-kmers.py
(from the khmer
pacakge) to calculate the
effective genome size at various read lengths
(36bp, 50bp, 75bp, 100bp, 125bp, 150p, 250bp). It then combines these results
with the other metadata available for the genome, saving the output in a
data.frame
stored in misc-data/available_genomes.rda
.
conda deactivate
Rscript scripts/makeAvailableGenomes.R
conda activate rlbaseData
- RLFS Beds files
R-loop-forming sequences (RLFS) were obtained for each genome of interest using
the QmRLFS-finder.py
script.
QmRLFS-finder.py
was downloaded by the RLBase authors in October 2020 from the public repository here and used with the permission of its creator.
CORES=40 # Number of cores for parallel operations
Rscript scripts/makeRLFSBeds.R 1 FALSE # Downloads needed genomes
Rscript scripts/makeRLFSBeds.R $CORES TRUE # Runs QmRLFS-Finder.py in parallel
- Generate Annotation Database
This script generates the genomic annotations needed by RLSuite. See the "Annotation Sources" section for further details.
Annotation sources
Annotations relevant to R-loop biology were aggregated from a variety of sources:
- UCSC tables (obtained via
rtracklayer::ucscTableQuery()
)- cpgIslandExt
- centromeres
- encodeCcreCombined
- knownGene
- microsat
- rmsk
- knownAl
- wgRn
- tRNAs
- coriellDelDup
- wgEncodeGencodePolyaV38
- encRegTfbsClustered
- UCSC Ensembl Gene GTFs
- G4-Quadruplex Predictions - link
- G4-Quadruplex Experiments - link (+); link (-)
- Encode
- Histone BED narrowPeak files (Full manifest in
misc-data/histone_encode_manifest.csv
) - RBP binding sites (ChIP and eCLiP) contributed by this study
- Histone BED narrowPeak files (Full manifest in
- SkewR annotations of human and mouse genomes - generated by running
skewr
- link - RBP binding site predictions from oRNAment - link
- Cohesin binding sites generated by the authors of this study.
# Runs the annotation script
CORES=15 # Set low due to high memory consumption per thread
Rscript scripts/getGenomicFeatures.R $CORES
# Compress resulting files
find misc-data/annotations/ -name "*.csv" -exec gzip -f {} \;
- (optional) Upload the results to AWS (Requires admin privileges)
aws s3 cp misc-data/available_genomes.rda s3://rlbase-data/misc/
aws s3 cp misc-data/rlfs/ s3://rlbase-data/rlfs-beds/ --recursive --exclude "*" --include "*.bed" --exclude "*/*"
aws s3 sync misc-data/annotations/ s3://rlbase-data/annotations/
This part of the protocol requires a catalog of publicly-available R-loop-mapping samples, hand-curated in the manner described in the RLSuite publication.
The following steps are performed to generate (or update) the database:
-
Prepare catalog of publicly-available samples. The current catalog can serve as a template.
-
Make pipeline manifests from catalog
CATALOG="rlbase-data/rlbase_catalog.xlsx"
MANIFEST="rlbase-data/rlbase_manifest.csv"
Rscript scripts/makeManifest.R $CATALOG $MANIFEST
- Build the pipeline config
RLPIPESOUT="rlbase-data/rlpipes-out/"
RLPipes build $RLPIPESOUT $MANIFEST
cp $RLPIPESOUT/config.tsv $RLPIPESOUT/config.save.tsv
- Wrangle the config (fix genomes and add condition types)
Rscript scripts/wrangleConfig.R $CATALOG $RLPIPESOUT/config.tsv
- Remove any datasets which have to be re-run
Rscript scripts/cleanOld.R $CATALOG $RLPIPESOUT/config.tsv
- Then check the pipeline to ensure it will work correctly:
RLPipes check $RLPIPESOUT --bwamem2 --noreport --tsv
- Run the pipeline
CORES=44
RLPipes run $RLPIPESOUT --bwamem2 --noreport --tsv -t $CORES
- Archive quant folders
rm -r $RLPIPESOUT/quant_save
cp -r $RLPIPESOUT/quant $RLPIPESOUT/quant_save
cd $RLPIPESOUT/quant_save
find . -type d -maxdepth 1 -mindepth 1 -print0 | parallel -0 tar cfJ {}.tar.xz {}
find . -type d -maxdepth 1 -mindepth 1 -exec rm -rf {} \;
cd $RLBASEDIR
- Archive peaks
mkdir $RLPIPESOUT/peaks_save
find $RLPIPESOUT/peaks -type f -maxdepth 1 -mindepth 1 -name "*.broadPeak" -exec cp {} $RLPIPESOUT/peaks_save \;
- Upload datasets to aws (requires admin privledges)
cd $RLBASEDIR
aws s3 sync --size-only $RLPIPESOUT/coverage/ s3://rlbase-data/coverage/
aws s3 sync --size-only $RLPIPESOUT/peaks_save/ s3://rlbase-data/peaks/
aws s3 sync --size-only $RLPIPESOUT/quant_save/ s3://rlbase-data/quant/
aws s3 sync --size-only $RLPIPESOUT/bam_stats/ s3://rlbase-data/bam_stats/
aws s3 sync --size-only $RLPIPESOUT/fastq_stats/ s3://rlbase-data/fastq_stats/
- Store .bam files (we used BOX for this part)
FTPSITE="ftp.box.com"
REMOTEDIR="RLBase-Archive/"
lftp -c "open -u $(read -p "User: ";echo $REPLY),$(read -sp "Password: ";echo $REPLY) $FTPSITE; mirror -P 2 -n -R $RLPIPESOUT/bam/ $REMOTEDIR"
- Calculate RLFS enrichment for each sample
CORES=44
RLPIPESOUT="rlbase-data/rlpipes-out/"
Rscript scripts/rlfsAnalyze.R $RLPIPESOUT/peaks $RLPIPESOUT/rlfs_rda $CORES
- Upload to AWS
aws s3 sync --size-only $RLPIPESOUT/rlfs_rda/ s3://rlbase-data/rlfs_rda/
- If you didn't do the previous steps, then download the data needed for this part:
RLPIPESOUT="rlbase-data/rlpipes-out/"
aws s3 sync s3://rlbase-data/peaks/ $RLPIPESOUT/peaks/
aws s3 sync s3://rlbase-data/rlfs_rda/ $RLPIPESOUT/rlfs_rda/
aws s3 sync s3://rlbase-data/quant/ $RLPIPESOUT/quant/
cd $RLPIPESOUT/quant/ && find . -name "*.tar.xz" -exec tar -xJf {} \; && cd $RLBASEDIR
aws s3 sync s3://rlbase-data/coverage/ $RLPIPESOUT/coverage/
aws s3 sync s3://rlbase-data/fastq_stats/ $RLPIPESOUT/fastq_stats/
aws s3 sync s3://rlbase-data/bam_stats/ $RLPIPESOUT/bam_stats/
- Run the interactive model-building app (requires AWS privileges)
CONFIG=$RLPIPESOUT/config.tsv
HOST="0.0.0.0"
PORT=4848
Rscript scripts/models.R $CONFIG $HOST $PORT
- Classify all samples
Rscript scripts/classifySamples.R
- Select final peakset
Rscript scripts/prepareConsensus.R
- Run the R-loop consensus pipeline
CORES=44
MANIFEST="rlbase_samples.tsv"
BLACKLIST="https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz"
snakemake --snakefile scripts/rlregions.smk -d rlbase-data/ --cores $CORES --dryrun --config manifest=$MANIFEST blacklist=$BLACKLIST # Verify
snakemake --snakefile scripts/rlregions.smk -d rlbase-data/ --cores $CORES --config manifest=$MANIFEST blacklist=$BLACKLIST --dag | dot -Tpng > rlbase-data/rlregions/dag.png # DAG
snakemake --snakefile scripts/rlregions.smk -d rlbase-data/ --config manifest=$MANIFEST blacklist=$BLACKLIST --cores $CORES # Run it
- Annotate peaks (genomic features and genes)
CORES=44
PEAKS=$RLPIPESOUT/peaks
OUTANNO="rlbase-data/misc/annotatedPeaks.tsv"
Rscript scripts/annotatePeaks.R $PEAKS $OUTANNO $CORES
- Build the new
gene_expression
table and calculate correlations.
# Creates misc/gene_expression.rda
MANIFEST="rlbase-data/rlbase_samples.tsv"
GENE_EXP_TABLE="rlbase-data/misc/gene_expression.rda"
QUANTDIR="rlbase-data/rlpipes-out/quant/"
Rscript scripts/buildExpression.R $QUANTDIR $GENE_EXP_TABLE $MANIFEST
- Get the genomic feature -> rlregion mapping
# Creates misc/rlregion_features.rda
Rscript scripts/rlregionsToFeatures.R
- Build the new correlation matrix.
Rscript scripts/gsgCorr.R
- Get RLRegion count matrix
CORES=44
Rscript scripts/rlregionCountMat.R $CORES 1
- Build/Update the RLHub
Rscript scripts/prepRLHub.R
find misc-data/rlhub/ -name "*.rda" -exec aws s3 cp {} s3://rlbase-data/RLHub/ \;
- Rebuild all RLRanges and HTML notebooks
CORES=32
## Clear previous runs
# rm -r "misc-data/reports/"
# rm -r "misc-data/rlranges/"
Rscript scripts/runRLSeq.R $CORES
- Upload results to AWS
aws s3 sync misc-data/reports s3://rlbase-data/reports
aws s3 sync misc-data/rlranges s3://rlbase-data/rlranges
- Update the RLHub Genome Browser TrackHub
Rscript scripts/buildGenomeBrowserHub.R
aws s3 sync misc-data/RLBase_TrackHub/ s3://rlbase-data/misc/RLBase_TrackHub/
- Upload the consensus peaksets
aws s3 cp rlbase-data/rlregions/rlregions_dRNH.narrowPeak s3://rlbase-data/misc/rlregions_dRNH.narrowPeak
aws s3 cp rlbase-data/rlregions/rlregions_S96.narrowPeak s3://rlbase-data/misc/rlregions_S96.narrowPeak