![encodelogo](images/encodelogo.gif)

# Exploring ENCODE data from EC2 with Jupyter notebook

This notebook demonstrates how to mount *s3://encode-public* on an EC2 instance using Goofys, which makes an S3 bucket appear as a normal file system, and is useful for tools that expect a local file path. Once the bucket is mounted we can launch a Jupyter notebook on the instance and connect to it remotely. The benefit of using EC2 is that the compute is scalable to the analysis you would like to perform, and you don't have to download anything locally.

# Spin up instance

We will log into our AWS console and start an EC2 instance from a base Ubuntu image (it is also possible to find images that include most of the dependencies that we will install manually below).

1. Go to EC2 instances.

![launch1](images/ec2_goofys_jupyter/launch1.png)

2. Click launch instance.

![launch2](images/ec2_goofys_jupyter/launch2.png)

3. Choose base Ubuntu image.

![launch3](images/ec2_goofys_jupyter/launch3.png)

4. Choose instance type.

![launch4](images/ec2_goofys_jupyter/launch4.png)

5. Add key pair.

![launch5](images/ec2_goofys_jupyter/launch5.png)

For this example we will use `t2.xlarge` instance size. Make sure to provide or create a key pair for your instance so we can SSH on later.

# SSH into the instance

Search for the instance you just created and find its public DNS.

![launch6](images/ec2_goofys_jupyter/launch6.png)

Open a terminal and connect to the instance using SSH, filling in your secret key and instance address:
```
$ ssh -i ~/.ssh/keenan.pem ubuntu@ec2-54-191-241-6.us-west-2.compute.amazonaws.com
```

# Install dependencies

We will install:

[Anaconda](https://www.anaconda.com/distribution/)
```
$ curl -O https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh
$ bash Anaconda3-2019.03-Linux-x86_64.sh
$ source ~/.bashrc
$ conda create -n encode-public python=3.7
$ conda activate encode-public
```

[awscli](https://github.com/aws/aws-cli)
```
$ pip install awscli
```

[Jupyter notebook](https://jupyter.org/)
```
$ conda install jupyter
```

[pandas](https://pandas.pydata.org/)
```
$ conda install pandas
```

[seaborn](https://seaborn.pydata.org/)
```
$ conda install seaborn
```

[pyBigWig](https://github.com/deeptools/pyBigWig)
```
$ conda install pybigwig -c bioconda
```

[Go](https://golang.org/)
```
$ sudo apt-get update
$ sudo apt-get install golang-go
```


[Goofys](https://github.com/kahing/goofys)
```
$ export GOPATH=$HOME/work
$ go get github.com/kahing/goofys
$ go install github.com/kahing/goofys
```

[Tree](http://manpages.ubuntu.com/manpages/trusty/man1/tree.1.html)
```
$ sudo apt-get install tree
```

# Mount S3 bucket

Goofys expects valid AWS credentials (though they don't need to have permission to do anything since we are mounting a public bucket). Run `aws configure` and enter your *aws_access_key_id*, *aws_secret_access_key*, and default region (e.g. `us-west-2`).

Mount *s3://encode-public* to local folder called *encode-public*:

```
$ mkdir encode-public
$ $GOPATH/bin/goofys encode-public/ encode-public/
```

# Start Jupyter notebook

Now we can run a Jupyter notebook on the EC2 instance and connect to it remotely.
```
$ jupyter notebook --no-browser --port=8888
```
Note the token in the returned URL (e.g. http://localhost:8888/?token=213b9a2799fe83807ab9e2e1254677ed3eb82cea9d05f452).

# Link local port to remote port

Open another terminal window and type (again filling in your details):

```
$ ssh -i ~/.ssh/keenan.pem -L 8000:localhost:8888 ubuntu@ec2-54-191-241-6.us-west-2.compute.amazonaws.com
```

This links your local 8000 port to the Jupyter notebook running on port 8888 of your EC2 instance. Launch a browser and type in `localhost:8000`. You should see a Jupyter window asking you for the token from above.

![launch5](images/ec2_goofys_jupyter/launch7.png)

# Create notebook

Create a new Jupyter notebook using Python 3.

![launch8](images/ec2_goofys_jupyter/launch8.png)

# Explore bucket structure

In the notebook we can `ls` the *encode-public* folder to list the contents of the S3 bucket.

In [1]:
!ls encode-public/

2008  2010  2012  2014	2016  2018  encode_file_manifest.tsv
2009  2011  2013  2015	2017  2019  robots.txt


We can see that the files are organized by year/month/day and that there is a TSV file manifest. To get a better idea of the structure we can use `tree` to recursively iterate through the 2008 files.

In [2]:
!tree encode-public/2008

[01;34mencode-public/2008[00m
└── [01;34m11[00m
    └── [01;34m24[00m
        ├── [01;34m034e3689-9903-4c86-9237-040f8f795b73[00m
        │   └── [01;31mENCFF001SNN.broadPeak.gz[00m
        ├── [01;34m0868284e-8c3c-488d-89e6-487cd89971c3[00m
        │   └── ENCFF000AAU.broadPeak.bigbed
        ├── [01;34m0b903d8b-824c-4e34-9b24-a1d23e31d83f[00m
        │   └── [01;31mENCFF001SNC.broadPeak.gz[00m
        ├── [01;34m0e0d13f7-4e7c-4cee-95a8-e3dc1c1351d1[00m
        │   └── ENCFF000ABT.broadPeak.bigbed
        ├── [01;34m0e5cd34b-4a2a-4bd5-a063-15a46fa48016[00m
        │   └── [01;31mENCFF001SMZ.broadPeak.gz[00m
        ├── [01;34m0ecf15e8-7cf9-4f33-9e1c-71d5b384c12e[00m
        │   └── [01;31mENCFF001SNQ.broadPeak.gz[00m
        ├── [01;34m0ffbb597-0ddf-4b0a-96b4-390e9d6dc997[00m
        │   └── ENCFF000ABS.broadPeak.bigbed
        ├── [01;34m16c640cb-a136-4a9e-b539-7ca412d7de00[00m
        │   └── ENCFF000ABM.broadPeak.bigbed
        ├── [01;34m17795607-720

Notice that every file is identified by a UUID (e.g. 034e3689-9903-4c86-9237-040f8f795b73) and accession (e.g. ENCFF001SNN). In addition to the file manifest you can always append the UUID or accession to the end of https://www.encodeproject/ to get more information about the file.

For example these are equivalent:
* https://www.encodeproject.org/034e3689-9903-4c86-9237-040f8f795b73
* https://www.encodeproject.org/ENCFF001SNN

# Open ENCODE bigWig using local path

The nice thing about having a Jupyter notebook running on the EC2 instance is that we can open the file manfiest directly in *pandas*.

In [3]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pyBigWig
import seaborn as sns

Open the tab-delimited manifest.

In [4]:
files = pd.read_csv('encode-public/encode_file_manifest.tsv', sep='\t')

Every row is a file (~400,000 files in the bucket).

In [5]:
files.shape

(412361, 18)

The columns contain important metadata about the files, such as their format and full S3 key.

In [6]:
files.columns

Index(['accession', 'status', 'file_format', 'file_type', 'assembly',
       'award.rfa', 's3_uri', 'cloud_metadata.url', 'dataset', 'lab.@id',
       'output_type', 'output_category', 'file_size', 'date_created', 'md5sum',
       'cloud_metadata.md5sum_base64', 'replicate_libraries',
       'analysis_step_version.analysis_step.name'],
      dtype='object')

We can see how many files there are by format.

In [7]:
files.file_format.value_counts()

bigWig      126742
bam          78986
bed          70165
bigBed       65757
fastq        47718
tsv          15459
tagAlign      2268
tar           1657
gtf           1125
gff            720
idat           554
hdf5           280
rcc            227
sam            188
wig            188
hic            160
csfasta         49
csqual          37
vcf             37
fasta           27
bedpe            9
CEL              8
Name: file_format, dtype: int64

We can also filter the manifest to only include select files. For example all the GRCh38 bigWigs from [ENCSR901SIL](https://www.encodeproject.org/experiments/ENCSR901SIL/), a H3K4me3 ChIP-seq experiment in heart tissue.

In [8]:
filtered_files = files[
    (files.dataset == '/experiments/ENCSR901SIL/')
    & (files.file_format == 'bigWig')
    & (files.assembly == 'GRCh38')
].reset_index(drop=True)
filtered_files[['accession', 'dataset', 'file_format', 'assembly', 'output_type', 's3_uri']]

Unnamed: 0,accession,dataset,file_format,assembly,output_type,s3_uri
0,ENCFF254JZR,/experiments/ENCSR901SIL/,bigWig,GRCh38,fold change over control,s3://encode-public/2017/03/21/e8e286f4-14a2-4c...
1,ENCFF112WFU,/experiments/ENCSR901SIL/,bigWig,GRCh38,signal p-value,s3://encode-public/2017/03/21/52a1bef1-d28c-4e...


By removing the *s3://* from the s3_uri we will have the file paths to our locally mounted bucket.

In [9]:
filtered_files['local_path'] = filtered_files.s3_uri.apply(lambda x: x.replace('s3://', ''))

Now we can open the signal p-value bigWig using `pyBigWig` and the local path. 

In [10]:
path_to_ENCFF112WFU = filtered_files.iloc[1].local_path
path_to_ENCFF112WFU

'encode-public/2017/03/21/52a1bef1-d28c-4e7e-849d-c7fa4da3c589/ENCFF112WFU.bigWig'

In [11]:
bw = pyBigWig.open(path_to_ENCFF112WFU)

# Explore bigWig file

In [12]:
bw.chroms()

{'chrUn_KI270748v1': 93321,
 'chrUn_KI270337v1': 1121,
 'chrUn_KI270749v1': 158759,
 'chr1_KI270713v1_random': 40745,
 'chrUn_KI270418v1': 2145,
 'chr13': 114364328,
 'chr12': 133275309,
 'chrUn_KI270304v1': 2165,
 'chr10': 133797422,
 'chr17': 83257441,
 'chr16': 90338345,
 'chr15': 101991189,
 'chrUn_KI270305v1': 1472,
 'chrUn_GL000218v1': 161147,
 'chr19': 58617616,
 'chr18': 80373285,
 'chrUn_KI270320v1': 4416,
 'chrUn_GL000219v1': 179198,
 'chrUn_KI270518v1': 2186,
 'chr3_GL000221v1_random': 155397,
 'chrUn_GL000213v1': 164239,
 'chrUn_KI270746v1': 66486,
 'chrUn_KI270516v1': 1300,
 'chr16_KI270728v1_random': 1872759,
 'chrUn_KI270521v1': 7642,
 'chrUn_GL000214v1': 137718,
 'chr9_KI270720v1_random': 39050,
 'chrUn_KI270593v1': 3041,
 'chrUn_KI270538v1': 91309,
 'chr22_KI270731v1_random': 150754,
 'chr1_KI270707v1_random': 32032,
 'chrUn_KI270322v1': 21476,
 'chrUn_KI270579v1': 31033,
 'chr1_KI270708v1_random': 127682,
 'chrUn_KI270378v1': 1048,
 'chr5': 181538259,
 'chr15_KI270727