RLBase-Datasets

This README outlines all the steps required to fully build (or update) the datasets in RLBase. It has only been tested on Ubuntu 20.04/18.04 LTS. Processed data-sets can be accessed via the RLHub R package.

Time required: It took around 3 weeks to fully run this protocol with 100 threads and 250GiB of RAM allocated. Updating the database with new data will also take substantial time and computing resources in proportion to the number and size of FASTQ files which must be aligned.

Setup

Clone the repos

git clone https://github.com/Bishop-Laboratory/RLBase-data.git
git clone https://github.com/Bishop-Laboratory/RLPipes.git
git clone https://github.com/Bishop-Laboratory/RLSeq.git

Create the environment (requires conda)

cd RLBase-data/misc-data/
git clone https://github.com/srhartono/SkewR.git
wget https://github.com/KorfLab/StochHMM/archive/refs/tags/v0.38.tar.gz
tar -xvzf v0.38.tar.gz
cd StochHMM-0.38/
./configure
make
cd ../
RLBASEDIR=$(pwd)
export PATH="$RLBASEDIR/misc-data/StochHMM-0.38/:$PATH" 
conda install -c conda-forge mamba -y
mamba env create -f env.yml --force
conda activate rlbaseData

Install RLPipes and RLSeq

pip install -e ../RLPipes/
R -e "BiocManager::install(c('EnsDb.Hsapiens.v86', 'EnsDb.Mmusculus.v79'))"
R -e "install.packages(c('ggprism', 'tableone'), repos = 'http://cran.us.r-project.org')"
# If prior to bioc release:
# R -e "BiocManager::install(verion='devel')"
R -e "remotes::install_local('../RLHub/', dependencies=TRUE, force=TRUE)"
R -e "remotes::install_local('../RLSeq/', dependencies=TRUE, force=TRUE)"

Generate datasets

The following steps taking to generate and update RLBase datasets. If you have write permissions on the AWS buckets used here, you can complete every step if awscli is configured:

aws configure

If you do not, then you will only be able to mirror and download existing buckets.

Preliminary

These preliminary steps illustrate how the original annotation data and genome info was collated.

Available genome info

This script downloads available genomes from the UCSC Genome Browser FTP. It then uses unique-kmers.py (from the khmer pacakge) to calculate the effective genome size at various read lengths (36bp, 50bp, 75bp, 100bp, 125bp, 150p, 250bp). It then combines these results with the other metadata available for the genome, saving the output in a data.frame stored in misc-data/available_genomes.rda.

conda deactivate
Rscript scripts/makeAvailableGenomes.R
conda activate rlbaseData

RLFS Beds files

R-loop-forming sequences (RLFS) were obtained for each genome of interest using the QmRLFS-finder.py script.

QmRLFS-finder.py was downloaded by the RLBase authors in October 2020 from the public repository here and used with the permission of its creator.

CORES=40  # Number of cores for parallel operations
Rscript scripts/makeRLFSBeds.R 1 FALSE  # Downloads needed genomes
Rscript scripts/makeRLFSBeds.R $CORES TRUE  # Runs QmRLFS-Finder.py in parallel

Generate Annotation Database

This script generates the genomic annotations needed by RLSuite. See the "Annotation Sources" section for further details.

Annotation sources

Annotations relevant to R-loop biology were aggregated from a variety of sources:

UCSC tables (obtained via rtracklayer::ucscTableQuery())
- cpgIslandExt
- centromeres
- encodeCcreCombined
- knownGene
- microsat
- rmsk
- knownAl
- wgRn
- tRNAs
- coriellDelDup
- wgEncodeGencodePolyaV38
- encRegTfbsClustered
UCSC Ensembl Gene GTFs
- hg38 - link
- mm10 - link
G4-Quadruplex Predictions - link
G4-Quadruplex Experiments - link (+); link (-)
Encode
- Histone BED narrowPeak files (Full manifest in misc-data/histone_encode_manifest.csv)
- RBP binding sites (ChIP and eCLiP) contributed by this study
SkewR annotations of human and mouse genomes - generated by running skewr - link
RBP binding site predictions from oRNAment - link
Cohesin binding sites generated by the authors of this study.

# Runs the annotation script
CORES=15  # Set low due to high memory consumption per thread
Rscript scripts/getGenomicFeatures.R $CORES

# Compress resulting files
find misc-data/annotations/ -name "*.csv" -exec gzip -f {} \;

(optional) Upload the results to AWS (Requires admin privileges)

aws s3 cp misc-data/available_genomes.rda s3://rlbase-data/misc/
aws s3 cp misc-data/rlfs/ s3://rlbase-data/rlfs-beds/ --recursive --exclude "*" --include "*.bed" --exclude "*/*"
aws s3 sync misc-data/annotations/ s3://rlbase-data/annotations/

Run RLPipes on all public samples

This part of the protocol requires a catalog of publicly-available R-loop-mapping samples, hand-curated in the manner described in the RLSuite publication.

The following steps are performed to generate (or update) the database:

Prepare catalog of publicly-available samples. The current catalog can serve as a template.
Make pipeline manifests from catalog

CATALOG="rlbase-data/rlbase_catalog.xlsx"
MANIFEST="rlbase-data/rlbase_manifest.csv"
Rscript scripts/makeManifest.R $CATALOG $MANIFEST

Build the pipeline config

RLPIPESOUT="rlbase-data/rlpipes-out/"
RLPipes build $RLPIPESOUT $MANIFEST
cp $RLPIPESOUT/config.tsv $RLPIPESOUT/config.save.tsv

Wrangle the config (fix genomes and add condition types)

Rscript scripts/wrangleConfig.R $CATALOG $RLPIPESOUT/config.tsv

Remove any datasets which have to be re-run

Rscript scripts/cleanOld.R $CATALOG $RLPIPESOUT/config.tsv

Then check the pipeline to ensure it will work correctly:

RLPipes check $RLPIPESOUT --bwamem2 --noreport --tsv

Run the pipeline

CORES=44
RLPipes run $RLPIPESOUT --bwamem2 --noreport --tsv -t $CORES

Archive quant folders

rm -r $RLPIPESOUT/quant_save
cp -r $RLPIPESOUT/quant $RLPIPESOUT/quant_save
cd $RLPIPESOUT/quant_save
find . -type d -maxdepth 1 -mindepth 1 -print0 | parallel -0 tar cfJ {}.tar.xz {}
find . -type d -maxdepth 1 -mindepth 1 -exec rm -rf {} \;
cd $RLBASEDIR

Archive peaks

mkdir $RLPIPESOUT/peaks_save
find $RLPIPESOUT/peaks -type f -maxdepth 1 -mindepth 1 -name "*.broadPeak" -exec cp {} $RLPIPESOUT/peaks_save \;

Upload datasets to aws (requires admin privledges)

cd $RLBASEDIR
aws s3 sync --size-only $RLPIPESOUT/coverage/ s3://rlbase-data/coverage/
aws s3 sync --size-only $RLPIPESOUT/peaks_save/ s3://rlbase-data/peaks/
aws s3 sync --size-only $RLPIPESOUT/quant_save/ s3://rlbase-data/quant/
aws s3 sync --size-only $RLPIPESOUT/bam_stats/ s3://rlbase-data/bam_stats/
aws s3 sync --size-only $RLPIPESOUT/fastq_stats/ s3://rlbase-data/fastq_stats/

Store .bam files (we used BOX for this part)

FTPSITE="ftp.box.com"
REMOTEDIR="RLBase-Archive/"
lftp -c "open -u $(read -p "User: ";echo $REPLY),$(read -sp "Password: ";echo $REPLY) $FTPSITE; mirror -P 2 -n -R $RLPIPESOUT/bam/ $REMOTEDIR"

Calculate RLFS enrichment for each sample

CORES=44
RLPIPESOUT="rlbase-data/rlpipes-out/"
Rscript scripts/rlfsAnalyze.R $RLPIPESOUT/peaks $RLPIPESOUT/rlfs_rda $CORES

Upload to AWS

aws s3 sync --size-only $RLPIPESOUT/rlfs_rda/ s3://rlbase-data/rlfs_rda/

Downstream data processing

If you didn't do the previous steps, then download the data needed for this part:

RLPIPESOUT="rlbase-data/rlpipes-out/"
aws s3 sync s3://rlbase-data/peaks/ $RLPIPESOUT/peaks/ 
aws s3 sync s3://rlbase-data/rlfs_rda/ $RLPIPESOUT/rlfs_rda/ 
aws s3 sync s3://rlbase-data/quant/ $RLPIPESOUT/quant/ 
cd $RLPIPESOUT/quant/ && find . -name "*.tar.xz" -exec tar -xJf {} \; && cd $RLBASEDIR
aws s3 sync s3://rlbase-data/coverage/ $RLPIPESOUT/coverage/ 
aws s3 sync s3://rlbase-data/fastq_stats/ $RLPIPESOUT/fastq_stats/ 
aws s3 sync s3://rlbase-data/bam_stats/ $RLPIPESOUT/bam_stats/

Build discriminator model

Run the interactive model-building app (requires AWS privileges)

CONFIG=$RLPIPESOUT/config.tsv
HOST="0.0.0.0"
PORT=4848
Rscript scripts/models.R $CONFIG $HOST $PORT

Classify all samples

Rscript scripts/classifySamples.R

Get R-loop consensus

Select final peakset

Rscript scripts/prepareConsensus.R

Run the R-loop consensus pipeline

CORES=44
MANIFEST="rlbase_samples.tsv"
BLACKLIST="https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz"
snakemake --snakefile scripts/rlregions.smk -d rlbase-data/ --cores $CORES --dryrun --config manifest=$MANIFEST blacklist=$BLACKLIST # Verify
snakemake --snakefile scripts/rlregions.smk -d rlbase-data/ --cores $CORES --config manifest=$MANIFEST blacklist=$BLACKLIST --dag | dot -Tpng > rlbase-data/rlregions/dag.png  # DAG
snakemake --snakefile scripts/rlregions.smk -d rlbase-data/ --config manifest=$MANIFEST blacklist=$BLACKLIST --cores $CORES # Run it

Update processed datasets

Annotate peaks (genomic features and genes)

CORES=44
PEAKS=$RLPIPESOUT/peaks
OUTANNO="rlbase-data/misc/annotatedPeaks.tsv"
Rscript scripts/annotatePeaks.R $PEAKS $OUTANNO $CORES

Build the new gene_expression table and calculate correlations.

# Creates misc/gene_expression.rda
MANIFEST="rlbase-data/rlbase_samples.tsv"
GENE_EXP_TABLE="rlbase-data/misc/gene_expression.rda"
QUANTDIR="rlbase-data/rlpipes-out/quant/"
Rscript scripts/buildExpression.R $QUANTDIR $GENE_EXP_TABLE $MANIFEST

Get the genomic feature -> rlregion mapping

# Creates misc/rlregion_features.rda
Rscript scripts/rlregionsToFeatures.R

Build the new correlation matrix.

Rscript scripts/gsgCorr.R

Get RLRegion count matrix

CORES=44
Rscript scripts/rlregionCountMat.R $CORES 1

Build/Update the RLHub

Rscript scripts/prepRLHub.R
find misc-data/rlhub/ -name "*.rda" -exec aws s3 cp {} s3://rlbase-data/RLHub/ \;

Rebuild all RLRanges and HTML notebooks

CORES=32
## Clear previous runs
# rm -r "misc-data/reports/"
# rm -r "misc-data/rlranges/"
Rscript scripts/runRLSeq.R $CORES

Upload results to AWS

aws s3 sync misc-data/reports s3://rlbase-data/reports
aws s3 sync misc-data/rlranges s3://rlbase-data/rlranges

Update the RLHub Genome Browser TrackHub

Rscript scripts/buildGenomeBrowserHub.R
aws s3 sync misc-data/RLBase_TrackHub/ s3://rlbase-data/misc/RLBase_TrackHub/

Upload the consensus peaksets

aws s3 cp rlbase-data/rlregions/rlregions_dRNH.narrowPeak s3://rlbase-data/misc/rlregions_dRNH.narrowPeak
aws s3 cp rlbase-data/rlregions/rlregions_S96.narrowPeak s3://rlbase-data/misc/rlregions_S96.narrowPeak

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
misc-data		misc-data
rlbase-data		rlbase-data
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
RLBase-data.Rproj		RLBase-data.Rproj
env.yml		env.yml
rlbase-data.smk		rlbase-data.smk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RLBase-Datasets

Setup

Generate datasets

Preliminary

Run RLPipes on all public samples

Downstream data processing

Build discriminator model

Get R-loop consensus

Update processed datasets

About

Releases

Packages

Languages

License

Bishop-Laboratory/RLBase-data

Folders and files

Latest commit

History

Repository files navigation

RLBase-Datasets

Setup

Generate datasets

Preliminary

Run RLPipes on all public samples

Downstream data processing

Build discriminator model

Get R-loop consensus

Update processed datasets

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages