![encodelogo](images/encodelogo.gif)

## ENCODE ChIP-seq pipeline on AWS

This notebook demonstrates how to launch the [ENCODE ChIP-seq pipeline](https://github.com/ENCODE-DCC/chip-seq-pipeline2) on AWS. The pipeline is defined in [Workflow Description Language (WDL)](https://software.broadinstitute.org/wdl/), executed by [Cromwell](https://github.com/broadinstitute/cromwell), and is meant to be reproducible across compute environments.

## Configure AWS

First we set up our compute environment on AWS following the directions here: https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-overview/. For simplicity we deploy the *Cromwell All-in-One* stack, which requires specifying the name of the S3 bucket where we want to store our results, the maximum number of CPUs our instances can have access to, and the keypair we want to use to SSH onto the Cromwell server.

## Connect to Cromwell server

Once the stack is created SSH onto the Cromwell server:
```bash
$ ssh -i </path/to/ssh/keypair.pem> ec2-user@<cromwell-server-public-dns>
```

Install Emacs (optional) and Git:
```bash
$ sudo yum install emacs
$ sudo yum install git
```

Clone ChIP-seq pipeline repository:
```bash
$ git clone https://github.com/ENCODE-DCC/chip-seq-pipeline2.git
```

Build genome input into pipeline:
```bash
$ .chip-seq-pipeline2/genome/download_genome_data.sh hg38
```

## Gather ENCODE files for processing

Here we will use [ENCSR659LJJ](https://www.encodeproject.org/experiments/ENCSR659LJJ/), which is an ENCODE ChIP-seq experiment for the [HDAC2](https://www.genecards.org/cgi-bin/carddisp.pl?gene=HDAC2) transcription factor in the [A549](https://en.wikipedia.org/wiki/A549_cell) cell line. The FASTQ files we want from this experiment are:

1. [ENCFF735IUI](https://www.encodeproject.org/files/ENCFF735IUI/)
2. [ENCFF093KEW](https://www.encodeproject.org/files/ENCFF093KEW/) 

We also need the files from the control experiment [ENCSR787VQB](https://www.encodeproject.org/experiments/ENCSR787VQB/) and the male genome index [ENCFF643CGH](https://www.encodeproject.org/files/ENCFF643CGH/).

## Create input.json

```javascript
{
    "chip.pipeline_type" : "tf",
    "chip.genome_tsv" : "gs://encode-pipeline-genome-data/hg38_google.tsv",
    "chip.fastqs" : [
        [["s3://encode-public/2016/03/03/a3da815e-482c-429d-ba2a-d01ab1923cbf/ENCFF735IUI.fastq.gz"]],
        [["s3://encode-public/2016/03/03/6bfd3839-5f50-464f-8282-0e56b0cb27f2/ENCFF093KEW.fastq.gz"]]
    ],
    "chip.ctl_fastqs" : [
        [["s3://encode-public/2016/03/03/5a77f593-86f0-4a3c-adc1-a8dbde1183c8/ENCFF214KMN.fastq.gz",
          "s3://encode-public/2016/03/03/ea2f7095-dc5c-42b9-900f-4f4b45718cb6/ENCFF343EVP.fastq.gz"]],
        [["s3://encode-public/2016/03/03/bc0b0714-aefa-4932-886f-8e9072d6ba5f/ENCFF920WNX.fastq.gz"]]
    ],
    "chip.paired_end" : false,
    "chip.always_use_pooled_ctl" : true,
}
```

We can find the files paths in the `s3://encode-public` bucket by adding `?format=json` to the file URLs above and looking in the `s3_uri` field. 

## Submit job to Cromwell server

```bash
$ curl -X POST "http://localhost:8000/api/workflows/v1" -H  "accept: application/json" -F "workflowSource=@chip-seq-pipeline2/chip.wdl" -F "workflowInputs=@input.json" -F "workflowOptions=@chip-seq-pipeline2/workflow_opts/docker.json"
```

## Examine results