Add manual version of pipeline

ParkerLab · Dec 4, 2019 · d5c9d45 · d5c9d45
1 parent 89d97e0
commit d5c9d45
Show file tree

Hide file tree

Showing 2 changed files with 158 additions and 41 deletions.
diff --git a/README.md b/README.md
@@ -1,17 +1,22 @@
 # Parker Lab BMO
 
 - [Overview](#overview)
-- [Required input](#required-input)
-- [Algorithm overview](#algorithm-overview)
+  * [Algorithm overview](#algorithm-overview)
 - [Installation](#installation)
+  * [Create and activate BMO conda environment (optional)](#create-and-activate-bmo-conda-environment--optional-)
+    + [Using the provided conda environment](#using-the-provided-conda-environment)
+    + [Or manually creating your own environment](#or-manually-creating-your-own-environment)
+  * [Installing dependencies manually](#installing-dependencies-manually)
 - [Running BMO](#running-bmo)
+  * [Required input](#required-input)
   * [Using Snakemake](#using-snakemake)
-    + [1) Create and activate BMO conda environment (optional)](#1--create-and-activate-bmo-conda-environment--optional-)
-    + [2) Setting up the Snakemake configuration file](#2--setting-up-the-snakemake-configuration-file)
-    + [3) Running BMO pipeline in Snakemake](#3--running-bmo-pipeline-in-snakemake)
-    + [4) Running the example data](#4--running-the-example-data)
+    + [Setting up the Snakemake configuration file](#setting-up-the-snakemake-configuration-file)
+    + [Running BMO pipeline](#running-bmo-pipeline)
   * [Manually (standalone version)](#manually--standalone-version-)
   * [Run time and memory requirements](#run-time-and-memory-requirements)
+- [Running the example data](#running-the-example-data)
+  * [With Snakemake](#with-snakemake)
+  * [Manually](#manually)
 - [Interpreting BMO results](#interpreting-bmo-results)
   * [Processed BED files](#processed-bed-files)
   * [Full output](#full-output)
@@ -25,39 +30,21 @@ BMO (pronounced *beemo*) is an algorithm to predict TF binding from ATAC-seq dat
 
 **Platforms tested**: Linux (Debian 9), macOS (10.13)
 
-
-# Required input
-1. ATAC-seq experiment processed <sup>1</sup> and indexed BAM file
-2. ATAC-seq peak calls <sup>2</sup>
-3. Motif BED6 <sup>3</sup> files. One file per PWM scan <sup>4</sup>
-
-<sup>1</sup> We recommend high quality (MAPQ ≥ 30), autosomal, duplicate removed (we use [Picard](https://broadinstitute.github.io/picard/)), properly paired, and uniquely mapped reads only (SAM [flags](https://broadinstitute.github.io/picard/explain-flags.html) `-f 3 -F 4 -F 8 -F 256 -F 1024 -F 2048`). If you don't have an ATAC-seq pipeline set up yet, you should consider using the [Parker Lab ATAC-seq Snakemake pipeline](https://github.com/ParkerLab/ATACseq-Snakemake).
-
-<sup>2</sup> We call them with [MACS2](https://github.com/taoliu/MACS), using flags `--broad --nomodel --shift -100 --extsize 200 --keep-dup all` and then intersect out [blacklisted regions](https://sites.google.com/site/anshulkundaje/projects/blacklists).
-
-<sup>3</sup> Mandatory. Pipeline will crash otherwise.
-
-<sup>4</sup> We use [FIMO](http://meme-suite.org/doc/fimo.html).
-
-
-# Algorithm overview
+## Algorithm overview
 ![Workflow overview](docs/diagram.png)
 
-
 # Installation
-From your machine run the command below wherever you want BMO's main directory to be (*e.g.* `~/software`).
+Installation time: \~5 minutes. Will take longer (up to 1h) if you create the conda environment.
+
+From your machine, run the command below wherever you want BMO's main directory to be (*e.g.* `~/software`).
 ```sh
 git clone https://github.com/ParkerLab/BMO.git
 ```
 
-# Running BMO
-## Using Snakemake
-We strongly recommend you run BMO using [Snakemake](https://snakemake.readthedocs.io), an extremely powerful tool for running reproducible pipelines. 
-
-### 1) Create and activate BMO conda environment (optional)
+## Create and activate BMO conda environment (optional)
 **Requirements**: Having a working [Anaconda](https://docs.anaconda.com/anaconda/install/) or [Miniconda](https://conda.io/docs/user-guide/install/index.html) installation in your system.
 
-#### Using the provided conda environment
+### Using the provided conda environment
 If you do not have Snakemake installed, don't have administrative privileges in your machine/cluster, or don't want to overwrite anything in your main environment, we provide a separate [conda environment](https://conda.io/docs/user-guide/tasks/manage-environments.html) with everything you need to get started (it's actually just a barebones python 3 with Snakemake, pysam, and R installed). To create the environment, run the command below from wherever you cloned BMO:
 ```sh
 conda env create -f conda/environment.yml
@@ -74,7 +61,7 @@ Rscript conda/install_R_dependencies.R
 ```
 You are now ready to run BMO.
 
-#### Or manually creating your own environment
+### Or manually creating your own environment
 Some users reported an [issue](https://github.com/ContinuumIO/anaconda-issues/issues/9480) with conda where the environment creation from a file fails with a `ResolvePackageNotFound` error. If you encountered this or any error trying to use the provided environment file, you can manually create BMO environment by running:
 ```
 # Create and activate environment
@@ -93,7 +80,32 @@ conda install r-essentials
 Rscript conda/install_R_dependencies.R
 ```
 
-### 2) Setting up the Snakemake configuration file
+## Installing dependencies manually
+Here's the core software you need to have installed in your machine:
+* [bedtools](https://bedtools.readthedocs.io/en/latest/)
+* [htslib](http://www.htslib.org)
+* [pysam](https://pysam.readthedocs.io/en/latest/)
+* [R](https://www.r-project.org) (required libraries: MASS, fitdistrplus, and metap)
+
+# Running BMO
+
+## Required input
+1. ATAC-seq experiment processed <sup>1</sup> and indexed BAM file
+2. ATAC-seq peak calls <sup>2</sup>
+3. Motif BED6 <sup>3</sup> files. One file per PWM scan <sup>4</sup>
+
+<sup>1</sup> We recommend high quality (MAPQ ≥ 30), autosomal, duplicate removed (we use [Picard](https://broadinstitute.github.io/picard/)), properly paired, and uniquely mapped reads only (SAM [flags](https://broadinstitute.github.io/picard/explain-flags.html) `-f 3 -F 4 -F 8 -F 256 -F 1024 -F 2048`). If you don't have an ATAC-seq pipeline set up yet, you should consider using the [Parker Lab ATAC-seq Snakemake pipeline](https://github.com/ParkerLab/ATACseq-Snakemake).
+
+<sup>2</sup> We call them with [MACS2](https://github.com/taoliu/MACS), using flags `--broad --nomodel --shift -100 --extsize 200 --keep-dup all` and then intersect out [blacklisted regions](https://sites.google.com/site/anshulkundaje/projects/blacklists).
+
+<sup>3</sup> Mandatory. Pipeline will crash otherwise.
+
+<sup>4</sup> We use [FIMO](http://meme-suite.org/doc/fimo.html).
+
+## Using Snakemake
+We strongly recommend you run BMO using [Snakemake](https://snakemake.readthedocs.io), an extremely powerful tool for running reproducible pipelines. 
+
+### Setting up the Snakemake configuration file
 Our Snakemake pipeline makes use of a configuration file, which includes information of paths and input files. The template is located in `config/config.yaml`. Let's go through each of the fields:
 * **`bmo_dir`**: this is the folder where you cloned this repository.
 * **`motif_dir`**: path to your motif BED6 files.
@@ -107,7 +119,7 @@ Our Snakemake pipeline makes use of a configuration file, which includes informa
 
 IMPORTANT: Because the config is a [yaml](http://yaml.org) file, **make sure to maintain all the indentations as they appear** or unexpected behaviors may occur.
 
-### 3) Running BMO pipeline in Snakemake
+### Running BMO pipeline
 To execute the pipeline, you need to copy both the `Snakefile` and `input/config.yaml` to wherever you want to run the analysis in your machine/cluster and then update the config file as described in the previous section. After that, just run the code below and Snakemake will take care of the rest. Simple as that.
 
 ```
@@ -117,21 +129,41 @@ snakemake [-j {threads}] [--resources io_limit={concurrency}] --configfile confi
 * **`{threads}`** is an integer value for the total number of cores Snakemake is allowed to use. if running on a HPC cluster, then set to 999 or some arbitrarily high value.
 * **`{concurrency}`** is also an integer, and determines the number of maximum I/O intensive jobs to run simultaneously (we recommend starting with 1-3 and keeping an eye on the `%iowait` column of `sar` to see how much your machine can handle). Alternatively, the high I/O jobs can be pushed to RAM or to an SSD partition by changing `use_shm: True` in the config file (see details in the previous section). If using the latter option, then also add `shm_limit={shm_concurrency}` to the `--resources` call above, where `{shm_concurrency}` is an integer for the maximum number of concurring jobs when using the RAM/SSD partition. If either `io_limit` or `shm_limit` are not set, then all jobs will be submitted with no regards to maximum concurrency (which should not be an issue if running in a cluster).
 
-### 4) Running the example data
-We included some example data at `examples`. It consists of a heavily downsampled chromosome 1 of one of the Buenrostro *et al* 2013<sup>[1]</sup> original GM12878 ATAC-seq samples and chromosomes 1 of our CTCF_known2 and ELF1_known1 motif scans. The example also includes a config file, which we will use to instruct Snakemake. To run the example, execute the code below.
+## Manually (standalone version)
+Edit the variables in the `# Config` section of `src/bmo_manual.sh` similarly to the `config.yaml` file and run:
 ```sh
-source activate BMO_env #optional
-snakemake -j 4 --configfile examples/example_config.yaml
+bash scripts/bmo_manual.sh > pipeline.sh
 ```
 
-[1] doi: 10.1038/nmeth.2688
-
-## Manually (standalone version)
-*Coming soon*
+This will create the `work` directory structure and a file with the list of commands named `pipeline.sh`. If you wish to use a job scheduler, then change the  `# drmr:` lines to whatever resource call parameters your scheduler uses. The example here uses [drmr](https://github.com/ParkerLab/drmr):
+```sh
+drmr pipeline.sh
+```
 
 ## Run time and memory requirements
 The expected run time will vary depending on the size of your input files. As an example, a relatively large motif file (\~400K instances) running on a very large BAM file (\~80M reads) takes less than 5 minutes and 500Mb of RAM using 4 cores. The same motif file takes less than 2 minutes and the same amount of RAM to run on a smaller BAM file (\~35M reads) using the same number of cores.
 
+# Running the example data
+We included some example data at `examples`. It consists of a heavily downsampled chromosome 1 of one of the Buenrostro *et al* 2013<sup>[1]</sup> original GM12878 ATAC-seq samples and chromosomes 1 of our CTCF_known2 and ELF1_known1 motif scans. The example code will generate the output files at `work/bmo/example_buenrostro_rep1/bound`.
+
+[1] doi: 10.1038/nmeth.2688
+
+## With Snakemake
+The example also includes a config file, which we will use to instruct Snakemake. To run the example, execute the code below.
+```sh
+# source activate BMO_env  # if using conda environment
+snakemake -j 4 --configfile examples/example_config.yaml
+```
+
+## Manually
+The `# Config` preamble of `src/bmo_manual.sh` is already set to run the example data.
+```sh
+# Generate and execute pipeline file
+bash src/bmo_manual.sh > pipeline.sh
+bash pipeline.sh
+```
+
+
 # Interpreting BMO results
 ## Processed BED files
 Once the pipeline finishes running, you can find results in the `work/bmo/{sample}/bound` folder. The BED6 files there will correspond to the **predicted bound motif instances** for each motif.

diff --git a/src/bmo_manual.sh b/src/bmo_manual.sh
@@ -0,0 +1,85 @@
+#!/bin/bash
+
+# Config
+bmo_dir="."
+sample_name="example_buenrostro_rep1"
+bam="examples/${sample_name}.subsampled.bam"
+peaks="examples/example_peaks.bed.gz"
+motif_list="examples/motifs.txt"
+motif_dir="examples"
+motif_ext=".bed.gz"
+cores="4"
+
+
+# Run
+raw_signals_dir="work/raw_signals/${sample_name}/all_regions"
+outside_peaks_dir="work/raw_signals/${sample_name}/outside_peaks"
+cooccur_dir="work/raw_signals/${sample_name}/co_occurring"
+nb_dir="work/negative_binomials/${sample_name}"
+bmo_outdir="work/bmo/${sample_name}"
+
+mkdir -p ${raw_signals_dir} ${outside_peaks_dir} ${cooccur_dir} \
+    ${nb_dir} ${bmo_outdir} 
+
+
+# Pipeline
+
+echo "# measure_raw_signal"
+echo "# drmr:job processors=4 processor_memory=1gb time_limit=15:00"
+while read motif
+do
+    motif_file="${motif_dir}/${motif}${motif_ext}"
+    raw_signals="${raw_signals_dir}/${motif}.bed"
+
+    echo "${bmo_dir}/src/measureRawSignal.py -p ${cores} -b ${bam} \
+        -m ${motif_file} > ${raw_signals}"
+done < ${motif_list}
+
+
+echo -e "\n\n# motifs_outside_peaks"
+echo "# drmr:job processors=1 processor_memory=1gb time_limit=15:00"
+while read motif
+do
+    raw_signals="${raw_signals_dir}/${motif}.bed"
+    outside_peaks="${outside_peaks_dir}/${motif}.bed"
+
+    echo "intersectBed -v -a ${raw_signals} -b ${peaks} > ${outside_peaks}"
+done < ${motif_list}
+
+
+echo -e "\n\n# count_overlapping_motifs"
+echo "# drmr:job processors=1 processor_memory=1gb time_limit=15:00"
+while read motif
+do
+    raw_signals="${raw_signals_dir}/${motif}.bed"
+    cooccur="${cooccur_dir}/${motif}.bed"
+
+    echo "${bmo_dir}/src/count_co-occuring_motifs.sh ${raw_signals} 100 \
+        ${bmo_dir} > ${cooccur}"
+done < ${motif_list}
+
+
+echo -e "\n\n# fit_nbinoms"
+echo "# drmr:job processors=1 processor_memory=5gb time_limit=30:00"
+while read motif
+do
+    nb_handle="${nb_dir}/${motif}"
+    nb_out="${nb_handle}.bed.gz"
+    echo "Rscript ${bmo_dir}/src/rawSigNBModel.R -f ${motif}.bed \
+            --dir1 ${raw_signals_dir}/ --dir2 ${outside_peaks_dir}/ \
+            -o ${nb_handle} -c 7 --writeBed"
+done < ${motif_list}
+
+
+echo -e "\n\n# BMO"
+echo "# drmr:job processors=1 processor_memory=5gb time_limit=15:00"
+while read motif
+do
+    cooccur="${cooccur_dir}/${motif}.bed"
+    nb_handle="${nb_dir}/${motif}"
+    nb_out="${nb_handle}.bed.gz"
+
+    echo "Rscript ${bmo_dir}/src/bmo.R --f1 ${nb_out} --f2 ${cooccur} \
+        -o ${bmo_outdir} -n ${motif} -p 1"
+done < ${motif_list}
+