## Learning Goals
* Introduction to ENCODE datasets 
* Extract sequences corresponding to promoter or enhancer regions for a gene from CHiP-seq data 
    * Mastery of bedtools command:  ["intersectBed"](http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html) 
* Introduction to FASTQ and FASTA files
     * Master of bedtools command: ["getFastaFromBed"](http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html)

###  Linux command reference 
Link to [Unix Command Reference](../Unix_Basics.ipynb)

## Working with [ENCODE](https://www.encodeproject.org/) data


<img src="images/overview_2016-05May-17.png" style="width: 60%; height: 60%" align="center"//>
Promoter-like regions and enhancer-like regions are considered 'middle-level' data 

Promoter-like regions: 
<img src="images/Example.promoter.png" style="width: 60%; height: 60%" align="center"//>

Enhancer-like regions: 
<img src="images/Example.enhancerlike.png" style="width: 60%; height: 60%" align="center"//>


We will fetch a list of enhancer-like and promoter-like regions for the H1 embryonic stem cell line in humans (hg19). Select chromosome 1:1-10000000.  

Use these links to fetch the regions of interest: 

[Promoters](http://zlab-annotations.umassmed.edu/promoters/#)
Make sure your selections in the form match the values below: 
<img src="images/encode_promoter.png",align="center"//>

[Enhancers](http://zlab-annotations.umassmed.edu/enhancers/) 
Make sure  your selections in the form match those below: 
<img src="images/encode_enhancer.png",align="center"//>


Click on the "Download" link and save the zipped files in the same directory as this notebook. 



In [1]:
## Extract the data you have downloaded. 
!unzip *promoter*zip 
!unzip *enhancer*zip 
!gzip -d *gz 
%ls

## Use the "more" command to examine the contents of the promoter file  
%more ENCSR000DRY_predictions.bed 

## Look at the first 10 lines of the enhancer file. 
## YOUR CODE HERE 


## the naming conventions of the files are not helpful. Let's rename the files to indicate which annotates 
## promoter regions and which annotates enhancer regions 
!mv ENCSR000DRY_predictions.bed ENCSR000DRY_predictions_promoter.bed 
!mv ENCSR000ANP_predictions.bed ENCSR000ANP_predictions_enhancer.bed

Archive:  20170417-184329-promoter-like-hg19-H3K4me3.v3.zip
 extracting: ENCSR000DRY_predictions.bed.gz  
Archive:  20170417-192546-enhancer-like-hg19-H3K27ac.v3.zip
 extracting: E093_H3K27ac_predictions.bed.gz  
 extracting: ENCSR000ANP_predictions.bed.gz  
20170417-184329-promoter-like-hg19-H3K4me3.v3.zip
20170417-192546-enhancer-like-hg19-H3K27ac.v3.zip
E093_H3K27ac_predictions.bed
ENCSR000ANP_predictions.bed
ENCSR000DRY_predictions.bed
Extracting_Regulatory_Sequences.ipynb
[0m[01;34mimages[0m/


#### What file format is the data in? What is contained in each of the first five columns? 
Your answer: 
#### Bonus question -- can you explain the meaning of the remaining columns in the file (column 6 and up)? 
Your answer:

## Bedtools Intersect 
Prior studies have shown that the gene [*LIN28A*](http://www.genecards.org/cgi-bin/carddisp.pl?gene=LIN28A) is associated with cell differentiation, and you hypothesize that this gene is likely to be expressed in the H1 cell line, as these are embryonic cells that are undergoing differentiation.  Use th LIN28A Gene Cards link above to find the chromosome and position of the LIN28A gene.  Fill in the chromosome and the starting and ending coordinates of LIN28A below:


In [2]:
chrom_lin28a="" #replace "" with the chromosome of the gene 
startpos_lin28a=None #replace None with the gene start position 
endpos_lin28a=None #replace None with the gene end position

##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 
chrom_lin28a="chr1" #replace "" with the chromosome of the gene 
startpos_lin28a=26410778 #replace None with the gene start position 
endpos_lin28a=26429728 #replace None with the gene end position


We will use the bedtools intersect command to find the H1  promoter and enhancer peaks that are within 50kb of the LIN28A gene. 

In [3]:
## use the -help flag to learn about the inputs and outputs of the bedtools intersect command 
!bedtools intersect --help


Tool:    bedtools intersect (aka intersectBed)
Version: v2.26.0
Summary: Report overlaps between two feature files.

Usage:   bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>

	Note: -b may be followed with multiple databases and/or 
	wildcard (*) character(s). 
Options: 
	-wa	Write the original entry in A for each overlap.

	-wb	Write the original entry in B for each overlap.
		- Useful for knowing _what_ A overlaps. Restricted by -f and -r.

	-loj	Perform a "left outer join". That is, for each feature in A
		report each overlap with B.  If no overlaps are found, 
		report a NULL feature for B.

	-wo	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlaps restricted by -f and -r.
		  Only A features with overlap are reported.

	-wao	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlapping features restricted by -f 

In [4]:
## Aha! You need to store LIN28A's coordinates in a bed file in order to run the intersect command 
## Execute this code block to generate a bed file containing the LIN28A coordinates, including regions 50k 
## upstream & downstream of the gene 

#add 50kb to the start and end coordinates 
startpos_lin28a-=50000
endpos_lin28a+=50000

outf=open("LIN28A.coords.bed",'w')
outf.write('\t'.join([str(i) for i in [chrom_lin28a,startpos_lin28a,endpos_lin28a,'lin28a']])+'\n')
outf.close() 

In [5]:
## Now you can intersect the promoter and enhancer peak files with the LIN28A coordinates 

## promoter peaks 
!bedtools intersect -wa -a *promoter.bed -b LIN28A.coords.bed 

## you can use the intersectBed shortcut, it will do the same thing 
!intersectBed -wa -a *promoter.bed -b LIN28A.coords.bed

chr1	26436688	26441779	Proximal-Prediction-797	1	.	26436688	26439727	255,0,0
chr1	26436688	26441779	Proximal-Prediction-797	1	.	26436688	26439727	255,0,0


In [6]:
## enhancer peaks 
!bedtools intersect -wa -a *enhancer.bed -b LIN28A.coords.bed

## and again with the shortcut command: 
!intersectBed -wa -a *enhancer.bed -b LIN28A.coords.bed

chr1	26437411	26438670	Proximal-Prediction-3068	1	.	26438332	26438670	255,0,0
chr1	26437411	26438670	Proximal-Prediction-3068	1	.	26438332	26438670	255,0,0


You should observe a single promoter peak and a single enhancer peak located in the vicinity of the LIN28A gene. 
Now, we want to figure out what the underlying DNA sequence is. 

## Introduction to FASTA and FASTQ data formats

You have already seen data in the FASTA format. The first line contains the sequence label, preceded by ">". The second line contains the actual sequence bases (A,C,G,T): 

**>FORJUSP02AJWD1** 

**CCGTCAATTCATTTAAGTTTTAACCTT**

FASTQ format takes this a step further by including sequence quality information  in ASCII characters. 
<img src="images/fastq_fig.jpg",align="center"//>



In [7]:
## You can convert the ASCII-encoded quality values to numeric Q scores with the 'ord' function. You must subtract 33
## from the converted value to obtain a Q score

quality_ascii='A:99@::??@@::FFAA'
numerical=[ord(c)-33 for c in quality_ascii]
print(numerical)

[32, 25, 24, 24, 31, 25, 25, 30, 30, 31, 31, 25, 25, 37, 37, 32, 32]


## Bedtools getFastaFromBed

Bedtools provides the **getFastaFromBed** command to extract the FASTA sequence from a specific set of chromosome coordinates. 

In [8]:
!bedtools getfasta 


Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.26.0
Summary: Extract DNA sequences from a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
	-fi	Input FASTA file
	-bed	BED/GFF/VCF file of ranges to extract from -fi
	-name	Use the name field for the FASTA header
	-split	given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
	-tab	Write output in TAB delimited format.
		- Default is FASTA format.

	-s	Force strandedness. If the feature occupies the antisense,
		strand, the sequence will be reverse complemented.
		- By default, strand information is ignored.

	-fullHeader	Use full fasta header.
		- By default, only the word before the first space or tab is used.



Alternatively, you can use the shortcut for the command: 

In [9]:
!fastaFromBed


Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.26.0
Summary: Extract DNA sequences from a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
	-fi	Input FASTA file
	-bed	BED/GFF/VCF file of ranges to extract from -fi
	-name	Use the name field for the FASTA header
	-split	given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
	-tab	Write output in TAB delimited format.
		- Default is FASTA format.

	-s	Force strandedness. If the feature occupies the antisense,
		strand, the sequence will be reverse complemented.
		- By default, strand information is ignored.

	-fullHeader	Use full fasta header.
		- By default, only the word before the first space or tab is used.



The command requires an input FASTA file, a BED file containing your regions of interest, and an output FASTA file name. The reference file in our case is the **h19.fa** containing all DNA bases in the hg19 version of the human genome. You can access this file here:  
#TODO: Replace the nandi-specific path with the hg19.fa path on the class server 
```
/mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa 
```

In [13]:
%cat  /mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa | head -n10

>chr10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
cat: write error: Broken pipe


We need to store our LIN28A enhancer peak and promoter peak in a bed file, we can pipe the output from **intersectBed** to a file. 

In [14]:
## running intersectBed, like before, but this time we pipe the output to a new file 
!intersectBed -wa -a *promoter.bed -b LIN28A.coords.bed > LIN28A.peaks.bed 
!intersectBed -wa -a *enhancer.bed -b LIN28A.coords.bed >> LIN28A.peaks.bed 

##note the use of the ">>" syntax instead of the ">" syntax. 
## If your specified output file already exists, >> will append to the existing file rather than overwriting it. 

In [15]:
## check to see that the LIN28A.peakds.bed file is properly formatted 
%cat LIN28A.peaks.bed

chr1	26436688	26441779	Proximal-Prediction-797	1	.	26436688	26439727	255,0,0
chr1	26437411	26438670	Proximal-Prediction-3068	1	.	26438332	26438670	255,0,0


In [16]:
## now extract the FASTA sequence for the two peaks 
!fastaFromBed -fi /mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa -bed LIN28A.peaks.bed -fo LIN28A.peaks.fasta 

In [17]:
## examine the resulting FASTA file
%cat LIN28A.peaks.fasta 

>chr1:26436688-26441779
GGATCAATAACCCACACAGGTAGTTAGCTGTCTACGCAGAGCCTTCTATATTTTCCCTCTACTTTTGAGGGCGGGAAATAAGTCTTGGCTGACACCAGTGTTTCCAGCAAACAGCTGAAGACTTGACACTTGGATGGCACTCAGGACAAAAATCTGAACAGATTACAGAACCGTTGTTGTTTTTTTTTTTTAAGGGGAGGGTGTGTTACACCAGGAATGGGTATTCAAGGTAGACACATCCCAAAACAGACACTTCAGTGAGAAGAGGCAGAGGTgtttcctcacctgtaaaaacaggataataaaacctactcttgtggggtttggttaggattaagcaagatcatttatataaagacctaacccagtgccagacTCTCGGGAGAAGGGGCATAAAGGACAGAAAATAAAATGAGTTGATCTTTGAGCACAGCGACAGGGATTTTTTGTACATCGTGCCTTACAGAGCGAGCGCTGGGAGGAGGAATCATCACCCTGCTGAGAAGTAATGGGTGCAGAATAAATCGAGTAAGAAGGAATGGACTTTTCTGGCGGATGCTTACTGATCTAACAACTGGATGACTGAGGAATGCGGTCCTAAGTGACCCCGGGCTGGCTGTGGGGCCAACAGCGTGCGGGATCTGACTAACAGCGTGTGGCAGCCGAGCGACTGCAGCAGCGGCGGGGACCCCATTGACTGTCCAGCCCCGAGAGCGGCAGGAGCGCGGTGGATCCTGGCTCGGACCAAGGCCTCTCCCCTGCGCCGACTGCAGACCAACCCTCCACCAGGAGCCTCGCGGGGGGGCGCCGTGGGCACCGCAGCCAGGTGCAGGCGCCTCCGCGCCCTCACTCCCCACGCCGAGGAGGTGGCCCTGGCTCCCTGCGGCCGCAGCCGGGGAACAATGAGGCGCCGGCCTCTTTAAGGGAGATACCACCGCGCCGGCGGGGCAGGGGACGGCATCCTCCCAGATGATC

### Exploring the capabilities of fastaFromBed
The LIN28A example above illustrates the default behavior of the **intersectBed** and **fastaFromBed** commands. Refer to lecture 4 if you would like a refresher on the full set of functionality of the **intersectBed** command. The code snippets below provide mini-examples to explore **fastaFromBed** 

In [18]:
## create a simple fasta file 
!echo ">chr1" >  test.fa 
!echo "AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG" >> test.fa 
#examine the contents of the file 
!cat test.fa

## create a simple bed file 
!echo "chr1\t5\t10\tmyseq" >  test.bed 
#examine the contents of the file
!cat test.bed

>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
chr1	5	10	myseq


In [19]:
## first, the default behavior: 
!bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out
#examine the output
!cat test.fa.out

index file test.fa.fai not found, generating...
>chr1:5-10
AAACC


Using the -name option, one can set the FASTA header for each extracted sequence to be the “name” columns from the BED feature.

In [20]:
!bedtools getfasta -fi test.fa -bed test.bed -name -fo test.fa.out

#examine the output 
!cat test.fa.out


>myseq::chr1:5-10
AAACC


Using the -tab option, the -fo output file will be tab-delimited instead of in FASTA format.

In [21]:
!bedtools getfasta -fi test.fa -bed test.bed -tab -fo test.fa.out
#examine the output 
!cat test.fa.out

chr1:5-10	AAACC


bedtools getfasta will extract the sequence in the orientation defined in the strand column when the “-s” option is used.

In [22]:
#We re-write our test.bed file to include strand information: 
!echo "chr1\t20\t25\tforward\t1\t+" > test.bed 
!echo "chr1\t20\t25\treverse\t1\t-" >> test.bed 
!cat test.bed 

!bedtools getfasta -fi test.fa -bed test.bed -s -name -fo test.fa.out 
#examine the output 
!cat test.fa.out

chr1	20	25	forward	1	+
chr1	20	25	reverse	1	-
>forward::chr1:20-25(+)
CGCTA
>reverse::chr1:20-25(-)
TAGCG
