## Learning Goals
* Introduction to FASTQ and FASTA files
     * Master of bedtools command: ["getFastaFromBed"](http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html)

###  Linux command reference 
Link to [Unix Command Reference](../Unix_Basics.ipynb)

## Introduction to FASTA and FASTQ data formats

You have already seen data in the FASTA format. The first line contains the sequence label, preceded by ">". The second line contains the actual sequence bases (A,C,G,T): 

**>FORJUSP02AJWD1** 

**CCGTCAATTCATTTAAGTTTTAACCTT**

FASTQ format takes this a step further by including sequence quality information  in ASCII characters. 
<img src="images/fastq_fig.jpg",align="center"//>



In [1]:
## You can convert the ASCII-encoded quality values to numeric Q scores with the 'ord' function. You must subtract 33
## from the converted value to obtain a Q score

quality_ascii='A:99@::??@@::FFAA'
numerical=[ord(c)-33 for c in quality_ascii]
print(numerical)

[32, 25, 24, 24, 31, 25, 25, 30, 30, 31, 31, 25, 25, 37, 37, 32, 32]


## Bedtools getFastaFromBed

Bedtools provides the **getFastaFromBed** command to extract the FASTA sequence from a specific set of chromosome coordinates. 

In [2]:
!bedtools getfasta 


Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.24.0
Summary: Extract DNA sequences into a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta> 

Options: 
	-fi	Input FASTA file
	-bed	BED/GFF/VCF file of ranges to extract from -fi
	-fo	Output file (can be FASTA or TAB-delimited)
	-name	Use the name field for the FASTA header
	-split	given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
	-tab	Write output in TAB delimited format.
		- Default is FASTA format.

	-s	Force strandedness. If the feature occupies the antisense,
		strand, the sequence will be reverse complemented.
		- By default, strand information is ignored.

	-fullHeader	Use full fasta header.
		- By default, only the word before the first space or tab is used.



Alternatively, you can use the shortcut for the command: 

In [3]:
!fastaFromBed


Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.24.0
Summary: Extract DNA sequences into a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta> 

Options: 
	-fi	Input FASTA file
	-bed	BED/GFF/VCF file of ranges to extract from -fi
	-fo	Output file (can be FASTA or TAB-delimited)
	-name	Use the name field for the FASTA header
	-split	given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
	-tab	Write output in TAB delimited format.
		- Default is FASTA format.

	-s	Force strandedness. If the feature occupies the antisense,
		strand, the sequence will be reverse complemented.
		- By default, strand information is ignored.

	-fullHeader	Use full fasta header.
		- By default, only the word before the first space or tab is used.



The command requires an input FASTA file, a BED file containing your regions of interest, and an output FASTA file name. The reference file in our case is the **h19.fa** containing all DNA bases in the hg19 version of the human genome. You can access this file here:  
#TODO: Replace the nandi-specific path with the hg19.fa path on the class server 
```
/mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa 
```

In [4]:
%cat  /mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa | head -n10

cat: /mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa: No such file or directory


We need to store our LIN28A enhancer peak and promoter peak in a bed file, we can pipe the output from **intersectBed** to a file. 

In [5]:
## running intersectBed, like before, but this time we pipe the output to a new file 
!intersectBed -wa -a *promoter.bed -b LIN28A.coords.bed > LIN28A.peaks.bed 
!intersectBed -wa -a *enhancer.bed -b LIN28A.coords.bed >> LIN28A.peaks.bed 

##note the use of the ">>" syntax instead of the ">" syntax. 
## If your specified output file already exists, >> will append to the existing file rather than overwriting it. 

Error: Unable to open file *promoter.bed. Exiting.
Error: Unable to open file *enhancer.bed. Exiting.


In [6]:
## check to see that the LIN28A.peakds.bed file is properly formatted 
%cat LIN28A.peaks.bed

In [7]:
## now extract the FASTA sequence for the two peaks 
!fastaFromBed -fi /mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa -bed LIN28A.peaks.bed -fo LIN28A.peaks.fasta 

Error: The requested fasta database file (/mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa) could not be opened. Exiting!


In [8]:
## examine the resulting FASTA file
%cat LIN28A.peaks.fasta 

### Exploring the capabilities of fastaFromBed
The LIN28A example above illustrates the default behavior of the **intersectBed** and **fastaFromBed** commands. Refer to lecture 4 if you would like a refresher on the full set of functionality of the **intersectBed** command. The code snippets below provide mini-examples to explore **fastaFromBed** 

In [9]:
## create a simple fasta file 
!echo ">chr1" >  test.fa 
!echo "AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG" >> test.fa 
#examine the contents of the file 
!cat test.fa

## create a simple bed file 
!echo "chr1\t5\t10\tmyseq" >  test.bed 
#examine the contents of the file
!cat test.bed

>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
chr1	5	10	myseq


In [10]:
## first, the default behavior: 
!bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out
#examine the output
!cat test.fa.out

index file test.fa.fai not found, generating...
>chr1:5-10
AAACC


Using the -name option, one can set the FASTA header for each extracted sequence to be the “name” columns from the BED feature.

In [11]:
!bedtools getfasta -fi test.fa -bed test.bed -name -fo test.fa.out

#examine the output 
!cat test.fa.out


>myseq
AAACC


Using the -tab option, the -fo output file will be tab-delimited instead of in FASTA format.

In [12]:
!bedtools getfasta -fi test.fa -bed test.bed -tab -fo test.fa.out
#examine the output 
!cat test.fa.out

chr1:5-10	AAACC


bedtools getfasta will extract the sequence in the orientation defined in the strand column when the “-s” option is used.

In [13]:
#We re-write our test.bed file to include strand information: 
!echo "chr1\t20\t25\tforward\t1\t+" > test.bed 
!echo "chr1\t20\t25\treverse\t1\t-" >> test.bed 
!cat test.bed 

!bedtools getfasta -fi test.fa -bed test.bed -s -name -fo test.fa.out 
#examine the output 
!cat test.fa.out

chr1	20	25	forward	1	+
chr1	20	25	reverse	1	-
>forward
CGCTA
>reverse
TAGCG
