![encodelogo](images/encodelogo.gif)

## ENCODE ChIP-seq pipeline on AWS

This notebook demonstrates how to launch the ENCODE ChIP-seq pipeline on AWS. The pipeline is defined in [Workflow Description Language (WDL)](https://software.broadinstitute.org/wdl/), executed by [Cromwell](https://github.com/broadinstitute/cromwell), and is meant to be reproducible across compute environments. The full code describing the pipeline can be found at https://github.com/ENCODE-DCC/chip-seq-pipeline2.

First we set up our compute environment on AWS following the directions here: https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-overview/. For simplicity we deploy the *Cromwell All-in-One* stack, which requires specifying the name of the S3 bucket where we want to store our results, the maximum number of CPUs our instances can have access to, and the keypair we want to use to SSH onto the Cromwell server.

Once the stack is created we SSH onto the Cromwell server and create our AWS configuration file:

```bash
// aws.conf
include required(classpath("application"))

webservice {
    interface = localhost
    port = 8000
}

// this stanza controls how fast Cromwell submits jobs to AWS Batch
// and avoids running into API request limits
system {
    job-rate-control {
        jobs = 1
        per = 2 second
    }
}

// this stanza defines how your server will authenticate with your AWS
// account.  it is recommended to use the "default-credential-provider" scheme.
aws {
  application-name = "cromwell"
  auths = [{
      name = "default"
      scheme = "default"
  }]

  // you must provide your operating region here - e.g. "us-east-1"
  // this should be the same region your S3 bucket and AWS Batch resources
  // are created in
  region = "<your-region>"
}

engine {
  filesystems {
    s3 { auth = "default" }
  }
}

backend {
  // this configures the AWS Batch Backend for Cromwell
  default = "AWSBATCH"
  providers {
    AWSBATCH {
      actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
      config {
        root = "s3://<your-bucket-name>/cromwell-execution"
        auth = "default"

        numSubmitAttempts = 3
        numCreateDefinitionAttempts = 3

        default-runtime-attributes {
          queueArn: "<your-queue-arn>"
        }

        filesystems {
          s3 {
            auth = "default"
          }
        }
      }
    }
  }
}
```

The important thing here is to fill in our region, bucket name, and queue ARN. We will pass this configuration file to Cromwell along with the pipeline WDL and input files.

Next we find the replicated FASTQ files that we want to use as inputs to the pipeline. Here we will use [ENCSR659LJJ](https://www.encodeproject.org/experiments/ENCSR659LJJ/), which is an ENCODE ChIP-seq experiment for the [HDAC2](https://www.genecards.org/cgi-bin/carddisp.pl?gene=HDAC2) transcription factor in the [A549](https://en.wikipedia.org/wiki/A549_cell) cell line. The FASTQ files we want from this experiment are [ENCFF735IUI](https://www.encodeproject.org/files/ENCFF735IUI/) and [ENCFF093KEW](https://www.encodeproject.org/files/ENCFF093KEW/). We also need the files from the control experiment [ENCSR787VQB](https://www.encodeproject.org/experiments/ENCSR787VQB/) and the male genome index [ENCFF643CGH](https://www.encodeproject.org/files/ENCFF643CGH/).

In [5]:
cromwell-aws-test-bucket

NameError: name 'cromwell' is not defined