Skip to content

Commit

Permalink
Add manual version of pipeline
Browse files Browse the repository at this point in the history
  • Loading branch information
rdalbanus committed Dec 4, 2019
1 parent 89d97e0 commit d5c9d45
Show file tree
Hide file tree
Showing 2 changed files with 158 additions and 41 deletions.
114 changes: 73 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
# Parker Lab BMO

- [Overview](#overview)
- [Required input](#required-input)
- [Algorithm overview](#algorithm-overview)
* [Algorithm overview](#algorithm-overview)
- [Installation](#installation)
* [Create and activate BMO conda environment (optional)](#create-and-activate-bmo-conda-environment--optional-)
+ [Using the provided conda environment](#using-the-provided-conda-environment)
+ [Or manually creating your own environment](#or-manually-creating-your-own-environment)
* [Installing dependencies manually](#installing-dependencies-manually)
- [Running BMO](#running-bmo)
* [Required input](#required-input)
* [Using Snakemake](#using-snakemake)
+ [1) Create and activate BMO conda environment (optional)](#1--create-and-activate-bmo-conda-environment--optional-)
+ [2) Setting up the Snakemake configuration file](#2--setting-up-the-snakemake-configuration-file)
+ [3) Running BMO pipeline in Snakemake](#3--running-bmo-pipeline-in-snakemake)
+ [4) Running the example data](#4--running-the-example-data)
+ [Setting up the Snakemake configuration file](#setting-up-the-snakemake-configuration-file)
+ [Running BMO pipeline](#running-bmo-pipeline)
* [Manually (standalone version)](#manually--standalone-version-)
* [Run time and memory requirements](#run-time-and-memory-requirements)
- [Running the example data](#running-the-example-data)
* [With Snakemake](#with-snakemake)
* [Manually](#manually)
- [Interpreting BMO results](#interpreting-bmo-results)
* [Processed BED files](#processed-bed-files)
* [Full output](#full-output)
Expand All @@ -25,39 +30,21 @@ BMO (pronounced *beemo*) is an algorithm to predict TF binding from ATAC-seq dat

**Platforms tested**: Linux (Debian 9), macOS (10.13)


# Required input
1. ATAC-seq experiment processed <sup>1</sup> and indexed BAM file
2. ATAC-seq peak calls <sup>2</sup>
3. Motif BED6 <sup>3</sup> files. One file per PWM scan <sup>4</sup>

<sup>1</sup> We recommend high quality (MAPQ ≥ 30), autosomal, duplicate removed (we use [Picard](https://broadinstitute.github.io/picard/)), properly paired, and uniquely mapped reads only (SAM [flags](https://broadinstitute.github.io/picard/explain-flags.html) `-f 3 -F 4 -F 8 -F 256 -F 1024 -F 2048`). If you don't have an ATAC-seq pipeline set up yet, you should consider using the [Parker Lab ATAC-seq Snakemake pipeline](https://github.com/ParkerLab/ATACseq-Snakemake).

<sup>2</sup> We call them with [MACS2](https://github.com/taoliu/MACS), using flags `--broad --nomodel --shift -100 --extsize 200 --keep-dup all` and then intersect out [blacklisted regions](https://sites.google.com/site/anshulkundaje/projects/blacklists).

<sup>3</sup> Mandatory. Pipeline will crash otherwise.

<sup>4</sup> We use [FIMO](http://meme-suite.org/doc/fimo.html).


# Algorithm overview
## Algorithm overview
![Workflow overview](docs/diagram.png)


# Installation
From your machine run the command below wherever you want BMO's main directory to be (*e.g.* `~/software`).
Installation time: \~5 minutes. Will take longer (up to 1h) if you create the conda environment.

From your machine, run the command below wherever you want BMO's main directory to be (*e.g.* `~/software`).
```sh
git clone https://github.com/ParkerLab/BMO.git
```

# Running BMO
## Using Snakemake
We strongly recommend you run BMO using [Snakemake](https://snakemake.readthedocs.io), an extremely powerful tool for running reproducible pipelines.

### 1) Create and activate BMO conda environment (optional)
## Create and activate BMO conda environment (optional)
**Requirements**: Having a working [Anaconda](https://docs.anaconda.com/anaconda/install/) or [Miniconda](https://conda.io/docs/user-guide/install/index.html) installation in your system.

#### Using the provided conda environment
### Using the provided conda environment
If you do not have Snakemake installed, don't have administrative privileges in your machine/cluster, or don't want to overwrite anything in your main environment, we provide a separate [conda environment](https://conda.io/docs/user-guide/tasks/manage-environments.html) with everything you need to get started (it's actually just a barebones python 3 with Snakemake, pysam, and R installed). To create the environment, run the command below from wherever you cloned BMO:
```sh
conda env create -f conda/environment.yml
Expand All @@ -74,7 +61,7 @@ Rscript conda/install_R_dependencies.R
```
You are now ready to run BMO.

#### Or manually creating your own environment
### Or manually creating your own environment
Some users reported an [issue](https://github.com/ContinuumIO/anaconda-issues/issues/9480) with conda where the environment creation from a file fails with a `ResolvePackageNotFound` error. If you encountered this or any error trying to use the provided environment file, you can manually create BMO environment by running:
```
# Create and activate environment
Expand All @@ -93,7 +80,32 @@ conda install r-essentials
Rscript conda/install_R_dependencies.R
```

### 2) Setting up the Snakemake configuration file
## Installing dependencies manually
Here's the core software you need to have installed in your machine:
* [bedtools](https://bedtools.readthedocs.io/en/latest/)
* [htslib](http://www.htslib.org)
* [pysam](https://pysam.readthedocs.io/en/latest/)
* [R](https://www.r-project.org) (required libraries: MASS, fitdistrplus, and metap)

# Running BMO

## Required input
1. ATAC-seq experiment processed <sup>1</sup> and indexed BAM file
2. ATAC-seq peak calls <sup>2</sup>
3. Motif BED6 <sup>3</sup> files. One file per PWM scan <sup>4</sup>

<sup>1</sup> We recommend high quality (MAPQ ≥ 30), autosomal, duplicate removed (we use [Picard](https://broadinstitute.github.io/picard/)), properly paired, and uniquely mapped reads only (SAM [flags](https://broadinstitute.github.io/picard/explain-flags.html) `-f 3 -F 4 -F 8 -F 256 -F 1024 -F 2048`). If you don't have an ATAC-seq pipeline set up yet, you should consider using the [Parker Lab ATAC-seq Snakemake pipeline](https://github.com/ParkerLab/ATACseq-Snakemake).

<sup>2</sup> We call them with [MACS2](https://github.com/taoliu/MACS), using flags `--broad --nomodel --shift -100 --extsize 200 --keep-dup all` and then intersect out [blacklisted regions](https://sites.google.com/site/anshulkundaje/projects/blacklists).

<sup>3</sup> Mandatory. Pipeline will crash otherwise.

<sup>4</sup> We use [FIMO](http://meme-suite.org/doc/fimo.html).

## Using Snakemake
We strongly recommend you run BMO using [Snakemake](https://snakemake.readthedocs.io), an extremely powerful tool for running reproducible pipelines.

### Setting up the Snakemake configuration file
Our Snakemake pipeline makes use of a configuration file, which includes information of paths and input files. The template is located in `config/config.yaml`. Let's go through each of the fields:
* **`bmo_dir`**: this is the folder where you cloned this repository.
* **`motif_dir`**: path to your motif BED6 files.
Expand All @@ -107,7 +119,7 @@ Our Snakemake pipeline makes use of a configuration file, which includes informa

IMPORTANT: Because the config is a [yaml](http://yaml.org) file, **make sure to maintain all the indentations as they appear** or unexpected behaviors may occur.

### 3) Running BMO pipeline in Snakemake
### Running BMO pipeline
To execute the pipeline, you need to copy both the `Snakefile` and `input/config.yaml` to wherever you want to run the analysis in your machine/cluster and then update the config file as described in the previous section. After that, just run the code below and Snakemake will take care of the rest. Simple as that.

```
Expand All @@ -117,21 +129,41 @@ snakemake [-j {threads}] [--resources io_limit={concurrency}] --configfile confi
* **`{threads}`** is an integer value for the total number of cores Snakemake is allowed to use. if running on a HPC cluster, then set to 999 or some arbitrarily high value.
* **`{concurrency}`** is also an integer, and determines the number of maximum I/O intensive jobs to run simultaneously (we recommend starting with 1-3 and keeping an eye on the `%iowait` column of `sar` to see how much your machine can handle). Alternatively, the high I/O jobs can be pushed to RAM or to an SSD partition by changing `use_shm: True` in the config file (see details in the previous section). If using the latter option, then also add `shm_limit={shm_concurrency}` to the `--resources` call above, where `{shm_concurrency}` is an integer for the maximum number of concurring jobs when using the RAM/SSD partition. If either `io_limit` or `shm_limit` are not set, then all jobs will be submitted with no regards to maximum concurrency (which should not be an issue if running in a cluster).

### 4) Running the example data
We included some example data at `examples`. It consists of a heavily downsampled chromosome 1 of one of the Buenrostro *et al* 2013<sup>[1]</sup> original GM12878 ATAC-seq samples and chromosomes 1 of our CTCF_known2 and ELF1_known1 motif scans. The example also includes a config file, which we will use to instruct Snakemake. To run the example, execute the code below.
## Manually (standalone version)
Edit the variables in the `# Config` section of `src/bmo_manual.sh` similarly to the `config.yaml` file and run:
```sh
source activate BMO_env #optional
snakemake -j 4 --configfile examples/example_config.yaml
bash scripts/bmo_manual.sh > pipeline.sh
```

[1] doi: 10.1038/nmeth.2688

## Manually (standalone version)
*Coming soon*
This will create the `work` directory structure and a file with the list of commands named `pipeline.sh`. If you wish to use a job scheduler, then change the `# drmr:` lines to whatever resource call parameters your scheduler uses. The example here uses [drmr](https://github.com/ParkerLab/drmr):
```sh
drmr pipeline.sh
```

## Run time and memory requirements
The expected run time will vary depending on the size of your input files. As an example, a relatively large motif file (\~400K instances) running on a very large BAM file (\~80M reads) takes less than 5 minutes and 500Mb of RAM using 4 cores. The same motif file takes less than 2 minutes and the same amount of RAM to run on a smaller BAM file (\~35M reads) using the same number of cores.

# Running the example data
We included some example data at `examples`. It consists of a heavily downsampled chromosome 1 of one of the Buenrostro *et al* 2013<sup>[1]</sup> original GM12878 ATAC-seq samples and chromosomes 1 of our CTCF_known2 and ELF1_known1 motif scans. The example code will generate the output files at `work/bmo/example_buenrostro_rep1/bound`.

[1] doi: 10.1038/nmeth.2688

## With Snakemake
The example also includes a config file, which we will use to instruct Snakemake. To run the example, execute the code below.
```sh
# source activate BMO_env # if using conda environment
snakemake -j 4 --configfile examples/example_config.yaml
```

## Manually
The `# Config` preamble of `src/bmo_manual.sh` is already set to run the example data.
```sh
# Generate and execute pipeline file
bash src/bmo_manual.sh > pipeline.sh
bash pipeline.sh
```


# Interpreting BMO results
## Processed BED files
Once the pipeline finishes running, you can find results in the `work/bmo/{sample}/bound` folder. The BED6 files there will correspond to the **predicted bound motif instances** for each motif.
Expand Down
85 changes: 85 additions & 0 deletions src/bmo_manual.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/bin/bash

# Config
bmo_dir="."
sample_name="example_buenrostro_rep1"
bam="examples/${sample_name}.subsampled.bam"
peaks="examples/example_peaks.bed.gz"
motif_list="examples/motifs.txt"
motif_dir="examples"
motif_ext=".bed.gz"
cores="4"


# Run
raw_signals_dir="work/raw_signals/${sample_name}/all_regions"
outside_peaks_dir="work/raw_signals/${sample_name}/outside_peaks"
cooccur_dir="work/raw_signals/${sample_name}/co_occurring"
nb_dir="work/negative_binomials/${sample_name}"
bmo_outdir="work/bmo/${sample_name}"

mkdir -p ${raw_signals_dir} ${outside_peaks_dir} ${cooccur_dir} \
${nb_dir} ${bmo_outdir}


# Pipeline

echo "# measure_raw_signal"
echo "# drmr:job processors=4 processor_memory=1gb time_limit=15:00"
while read motif
do
motif_file="${motif_dir}/${motif}${motif_ext}"
raw_signals="${raw_signals_dir}/${motif}.bed"

echo "${bmo_dir}/src/measureRawSignal.py -p ${cores} -b ${bam} \
-m ${motif_file} > ${raw_signals}"
done < ${motif_list}


echo -e "\n\n# motifs_outside_peaks"
echo "# drmr:job processors=1 processor_memory=1gb time_limit=15:00"
while read motif
do
raw_signals="${raw_signals_dir}/${motif}.bed"
outside_peaks="${outside_peaks_dir}/${motif}.bed"

echo "intersectBed -v -a ${raw_signals} -b ${peaks} > ${outside_peaks}"
done < ${motif_list}


echo -e "\n\n# count_overlapping_motifs"
echo "# drmr:job processors=1 processor_memory=1gb time_limit=15:00"
while read motif
do
raw_signals="${raw_signals_dir}/${motif}.bed"
cooccur="${cooccur_dir}/${motif}.bed"

echo "${bmo_dir}/src/count_co-occuring_motifs.sh ${raw_signals} 100 \
${bmo_dir} > ${cooccur}"
done < ${motif_list}


echo -e "\n\n# fit_nbinoms"
echo "# drmr:job processors=1 processor_memory=5gb time_limit=30:00"
while read motif
do
nb_handle="${nb_dir}/${motif}"
nb_out="${nb_handle}.bed.gz"
echo "Rscript ${bmo_dir}/src/rawSigNBModel.R -f ${motif}.bed \
--dir1 ${raw_signals_dir}/ --dir2 ${outside_peaks_dir}/ \
-o ${nb_handle} -c 7 --writeBed"
done < ${motif_list}


echo -e "\n\n# BMO"
echo "# drmr:job processors=1 processor_memory=5gb time_limit=15:00"
while read motif
do
cooccur="${cooccur_dir}/${motif}.bed"
nb_handle="${nb_dir}/${motif}"
nb_out="${nb_handle}.bed.gz"

echo "Rscript ${bmo_dir}/src/bmo.R --f1 ${nb_out} --f2 ${cooccur} \
-o ${bmo_outdir} -n ${motif} -p 1"
done < ${motif_list}

0 comments on commit d5c9d45

Please sign in to comment.