## Learning Goals
* Introduction to ENCODE datasets 
* Extract sequences corresponding to promoter or enhancer regions for a gene from CHiP-seq data 
    * Mastery of bedtools command:  ["intersectBed"](http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html) 

###  Linux command reference 
Link to [Unix Command Reference](../Unix_Basics.ipynb)

## Working with [ENCODE](https://www.encodeproject.org/) data


<img src="images/overview_2016-05May-17.png" style="width: 60%; height: 60%" align="center"//>
Promoter-like regions and enhancer-like regions are considered 'middle-level' data 

Promoter-like regions: 
<img src="images/Example.promoter.png" style="width: 60%; height: 60%" align="center"//>

Enhancer-like regions: 
<img src="images/Example.enhancerlike.png" style="width: 60%; height: 60%" align="center"//>


We will fetch a list of enhancer-like and promoter-like regions for the H1 embryonic stem cell line in humans (hg19). Select chromosome 1:1-10000000.  

Use these links to fetch the regions of interest: 

[Promoters](http://zlab-annotations.umassmed.edu/promoters/#)
Make sure your selections in the form match the values below: 
<img src="images/encode_promoter.png",align="center"//>

[Enhancers](http://zlab-annotations.umassmed.edu/enhancers/) 
Make sure  your selections in the form match those below: 
<img src="images/encode_enhancer.png",align="center"//>


Click on the "Download" link and save the zipped files in the same directory as this notebook. 



In [None]:
## Extract the data you have downloaded. 
!unzip *promoter*zip 
!unzip *enhancer*zip 
!gzip -d *gz 
%ls

## Use the "more" command to examine the contents of the promoter file  
%more ENCSR000DRY_predictions.bed 

## Look at the first 10 lines of the enhancer file. 
## YOUR CODE HERE 


## the naming conventions of the files are not helpful. Let's rename the files to indicate which annotates 
## promoter regions and which annotates enhancer regions 
!mv ENCSR000DRY_predictions.bed ENCSR000DRY_predictions_promoter.bed 
!mv ENCSR000ANP_predictions.bed ENCSR000ANP_predictions_enhancer.bed

#### What file format is the data in? What is contained in each of the first five columns? 
Your answer: 
#### Bonus question -- can you explain the meaning of the remaining columns in the file (column 6 and up)? 
Your answer:

## Bedtools Intersect 
Prior studies have shown that the gene [*LIN28A*](http://www.genecards.org/cgi-bin/carddisp.pl?gene=LIN28A) is associated with cell differentiation, and you hypothesize that this gene is likely to be expressed in the H1 cell line, as these are embryonic cells that are undergoing differentiation.  Use th LIN28A Gene Cards link above to find the chromosome and position of the LIN28A gene.  Fill in the chromosome and the starting and ending coordinates of LIN28A below:


In [None]:
chrom_lin28a="" #replace "" with the chromosome of the gene 
startpos_lin28a=None #replace None with the gene start position 
endpos_lin28a=None #replace None with the gene end position

##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 
chrom_lin28a="chr1" #replace "" with the chromosome of the gene 
startpos_lin28a=26410778 #replace None with the gene start position 
endpos_lin28a=26429728 #replace None with the gene end position


We will use the bedtools intersect command to find the H1  promoter and enhancer peaks that are within 50kb of the LIN28A gene. 

In [1]:
## use the -help flag to learn about the inputs and outputs of the bedtools intersect command 
!bedtools intersect --help


Tool:    bedtools intersect (aka intersectBed)
Version: v2.26.0
Summary: Report overlaps between two feature files.

Usage:   bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>

	Note: -b may be followed with multiple databases and/or 
	wildcard (*) character(s). 
Options: 
	-wa	Write the original entry in A for each overlap.

	-wb	Write the original entry in B for each overlap.
		- Useful for knowing _what_ A overlaps. Restricted by -f and -r.

	-loj	Perform a "left outer join". That is, for each feature in A
		report each overlap with B.  If no overlaps are found, 
		report a NULL feature for B.

	-wo	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlaps restricted by -f and -r.
		  Only A features with overlap are reported.

	-wao	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlapping features restricted by -f 

In [None]:
## Aha! You need to store LIN28A's coordinates in a bed file in order to run the intersect command 
## Execute this code block to generate a bed file containing the LIN28A coordinates, including regions 50k 
## upstream & downstream of the gene 

#add 50kb to the start and end coordinates 
startpos_lin28a-=50000
endpos_lin28a+=50000

outf=open("LIN28A.coords.bed",'w')
outf.write('\t'.join([str(i) for i in [chrom_lin28a,startpos_lin28a,endpos_lin28a,'lin28a']])+'\n')
outf.close() 

In [None]:
## Now you can intersect the promoter and enhancer peak files with the LIN28A coordinates 

## promoter peaks 
!bedtools intersect -wa -a *promoter.bed -b LIN28A.coords.bed 

## you can use the intersectBed shortcut, it will do the same thing 
!intersectBed -wa -a *promoter.bed -b LIN28A.coords.bed

In [None]:
## enhancer peaks 
!bedtools intersect -wa -a *enhancer.bed -b LIN28A.coords.bed

## and again with the shortcut command: 
!intersectBed -wa -a *enhancer.bed -b LIN28A.coords.bed