# Big Data for Biologists: Decoding Genomic Function- Class 3

## How can we predict the amino acid sequence for the protein product made from an mRNA sequence? 
## What are genomic coordinates and how are they used to designate the position of a gene in the human genome? 

##  Learning Objectives
***Students should be able to***
 <ol>
 <li><a href=#Aminoacidsequence>Use the genetic code to determine the amino acid sequence of the protein product of an mRNA sequence </a></li>
 <li><a href=#PythonDictionary>Make a Python dictionary for the genetic code (also called a look up table)</a> </li>
 <li><a href=#DefineFunction>Define and call a function in a Python script </a></li>
 <li><a href=#PredictProteinSequence>Predict a protein sequence from a processed mRNA sequence using a Python dictionary</a></li>
 <li><a href=#SaveFunction>Save functions to a .py file so they can be used in other programs </a></li> 
  <li><a href=#ReferenceGenome>Explain what a reference genome is </a></li>
 <li><a href=#GenomicCoordinates>Explain how genomic coordinates are used to designate the position of a gene or feature in the human reference genome </a></li>
 <li><a href=#BEDformat>Use genomic coordinates to make a file in BED format </a></li>
 <li><a href=#makeFASTAfromBED>Use the genomic analysis package Bedtools to make an mRNA sequence with exons from a gene sequence</a></li>



## How can we use the genetic code to determine the amino acid sequence of the protein product made from an mRNA sequence? <a name='Aminoacidsequence' />

In the last class, we wrote code to transcribe DNA to pre-mRNA and concluded by finding the start and stop codons in an mRNA sequence. Today we are going to look at the next step in gene expression, the translation of an mRNA sequence into protein. 

<img src="../Images/1-CentralDogma.png" style="width: 40%; height: 50%" align="center"//>

As a reminder,during translation, every three base pairs in an mRNA sequence past the start codon codes for one amino acid. These three base pair sequences are called codons. 

The start codon, as we saw previously is ATG which codes for the amino acid Methionine. Below is the **genetic code** for the rest of the amino acids as well as the three stop codons. 


<img src="../Images/3-Genetic Code.png" style="width: 40%; height: 40%" align="center"//>

 

In the final step of our last class we found three possible combinations of start and stop codons: 

* start codon: 60 stop codon: 390 orf length: 332
* start codon: 72 stop codon: 390 orf length: 320
* start codon: 442 stop codon: 448 orf length: 8

The actual combination of start and stop codons is often the combination that results in the longest sequence, but the true start and stop codon needs to be determined experimentally. 

In this case, the actual start codon for human insulin is at amino acid 60 in the mRNA sequence that we were using in the last class. 

Here is the mRNA sequence for the first eighteen residues from codon 60 of the insulin sequence from the previous class: 

AUGGCCCUGUGGAUGCGCCUC 

As an exercise, write out the amino acid sequence corresponding to the mRNA sequence. 

The amino acid sequence should be: 

MALTMRL

We are now going to learn additional Python tools to help us create the code to write out the amino acid sequence of a protein product that is made from an mRNA sequence.

## Making a python dictionary for the genetic code<a name='PythonDictionary' />

The python code we will be writing today has some similarities to scripts that we looked at in the last class. 

Last time, when we wrote out the complementary DNA sequence we made four substitutions and when we wrote out the mRNA sequence we made one substitution. 

To write out the amino acid sequence that will be produced from a mRNA sequence, we will need to make sixty four substitutions! 

To simplify the code, we will use a Python tool known as dictionaries or look-up tables. 

Python dictionaries let you define a number of substitutions in one line rather than as a series of lines. 

There are a few different ways to define a dictionary in Python. If you are interested, this  [link](https://docs.python.org/2/library/stdtypes.html#typesmapping) gives the complete syntax options. 

Before we define the dictionary for protein translation, we are going to practice writing a dictionary that we could use to write out a complementary DNA sequence.

We'll use the same code as last time but we will use a python dictionary to do the substitutions instead of using if statements. 

The syntax for creating the dictionary is: 

DNAdict={'A':'T','T':'A','G':'C','C':'G'}

DNAdict is the name of the dictionary.

For the first entry, the 'A' before the : is the element in the original sequence. The 'T' after the : is the element in the new sequence. 


In [None]:
#Write out the complementary sequence for a DNA sequence using a look up table
FASTAgenesequence=open('../class_1/data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
genesequence=''.join(genesequence)
genesequence=genesequence.replace('\n','') #this line removes the \n values from the genesequence

#This line defines the subsitutions that will be made when the dictionary is called 
DNAdict={'A':'T','T':'A','G':'C','C':'G'}

complementarysequence='' #this defines the variable 'complementarysequence'
for i in genesequence:
    #this line adds the dictionary element for the base pair in position i in genesequence to complementary sequence. 
    complementarysequence=complementarysequence+str(DNAdict[i])
print (complementarysequence[::-1])

Using the space below, start writing a python dictionary for the genetic code starting with the four entries for the upper left hand corner of the table in the figure above. 

In [None]:

##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 
geneticcode3let={'UUU':'Phe','UUC':'Phe','UUA':'Leu','UUG':'Leu',
      }

We will be using a complete dictionary for the genetic code below when we write out the protein sequence for an mRNA sequence, but first we will be covering one other tool for writing more complex scripts, creating functions. 

## Defining and calling functions in Python scripts <a name='DefineFunction' />

As you start to write more complex python scripts, a very helpful tool for any set of commands that accomplish a particular task and are used repeatedly either within a script or between scripts is to define a **function** for the series of commands. 

Once the function is defined the series of commands can be run with one line of code rather than multiple lines. 

An example of a task that we could put into a function is converting between DNA and RNA sequences (Ts to Us). 

We wrote the code for the conversion of DNA to RNA sequences in the last class. In this class we are going to write that as a function.   

Functions are defined using the **def** command followed by the name of the function and then any necessary inputs separated by commas. 

In the example below: 

    def write_mRNA_from_DNA(filename_FASTAgenesequence):
 
defines a function called "write_mRNA_from_DNA". 

and 

    RNAsequence=write_mRNA_from_DNA('../class_1/data/Human-Insulin NM_000207.2.txt')

calls the function "write mRNA_from_DNA". 

Within the function, the input sequence '../class_1/data/Human-Insulin NM_000207.2.txt' is assigned the name 'filename_FASTAgenesequence'.  

At the end, the "return" command defines the output of the function that is saved for the rest of the program. 

In [None]:
#define a function to write mRNA from DNA sequence in FASTA format

def write_mRNA_from_DNA(filename_FASTAgenesequence):
    FASTAgenesequence=open(filename_FASTAgenesequence,'r')
    genesequence=(FASTAgenesequence.readlines()[1:])
    genesequence=''.join(genesequence)
    genesequence=genesequence.replace('\n','')
    RNAsequence='' #this defines the variable 'complementarysequence'
    for i in genesequence:
        if i=='T':
            RNAsequence=RNAsequence+'U'
        else:
            RNAsequence=RNAsequence+ i
    return(RNAsequence)

RNAsequence=write_mRNA_from_DNA('../class_1/data/Human-Insulin NM_000207.2.txt')
print(RNAsequence)

## Predict a protein sequence from a processed RNA sequence using a Python dictionary<a name='PredictProteinSequence' />

We are now ready to write our script to predict a protein sequence from a processed mRNA sequence. 

In the code, we will call a function and use a dictionary. 

The code has three main sections. 
1. Convert processed mRNA sequence with Ts to Us using the write mRNA_from_DNA function
2. Define a complete python dictionary for the three letter genetic code
3. Iterate over the RNAsequence from the start to the stop codon and convert each codon to the corresponding amino acid defined in the dictionary. 

We will use basepair 60 as the start codon and 390 as the stop codon since that is the start and stop codon in the actual insulin sequence.  

Complete the code below to write out the protein sequence for an mRNA sequence. 
 

In [None]:
#Write out the protein sequence for a mRNA sequence

#calls the function defined above called write_mRNA_from_DNA
RNAsequence=

#defines the python dictionary for the three letter genetic code 
geneticcode3let={'UUU':'Phe','UUC':'Phe','UUA':'Leu','UUG':'Leu',
     'CUU':'Leu','CUC':'Leu','CUA':'Leu','CUG':'Leu',
     'AUU':'Ile','AUC':'Ile','AUA':'Ile','AUG':'Met',
     'GUU':'Val','GUC':'Val','GUA':'Val','GUG':'Val',
     'UCU':'Ser','UCC':'Ser','UCA':'Ser','UCG':'Ser',
     'CCU':'Pro','CCC':'Pro','CCA':'Pro','CCG':'Pro',
     'ACU':'Thr','ACC':'Thr','ACA':'Thr','ACG':'Thr',
     'GCU':'Ala','GCC':'Ala','GCA':'Ala','GCG':'Ala',
     'UAU':'Tyr','UAC':'Tyr','UAA':'Stop','UAG':'Stop',
     'CAU':'His','CAC':'His','CAA':'Gln','CAG':'Gln',
     'AAU':'Asn','AAC':'Asn','AAA':'Lys','AAG':'Lys',
     'GAU':'Asp','GAC':'Asp','GAA':'Glu','GAG':'Glu',
     'UGU':'Cys','UGC':'Cys','UGA':'Stop','UGG':'Trp',
     'CGU':'Arg','CGC':'Arg','CGA':'Arg','CGG':'Arg',
     'AGU':'Ser','AGC':'Ser','AGA':'Arg','AGG':'Arg',
     'GGU':'Gly','GGC':'Gly','GGA':'Gly','GGG':'Gly'}

#translates the RNAsequence into protein 

proteinseq=''

for i in range(59,390,3): 
    proteinseq=proteinseq+str(geneticcode3let[])
print ( )



##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 

#Write out the protein sequence for a mRNA sequence

#calls the function defined above called write_mRNA_from_DNA
RNAsequence=write_mRNA_from_DNA('../class_1/data/Human-Insulin NM_000207.2.txt') 

#defines the python dictionary for the three letter genetic code 
geneticcode3let={'UUU':'Phe','UUC':'Phe','UUA':'Leu','UUG':'Leu',
     'CUU':'Leu','CUC':'Leu','CUA':'Leu','CUG':'Leu',
     'AUU':'Ile','AUC':'Ile','AUA':'Ile','AUG':'Met',
     'GUU':'Val','GUC':'Val','GUA':'Val','GUG':'Val',
     'UCU':'Ser','UCC':'Ser','UCA':'Ser','UCG':'Ser',
     'CCU':'Pro','CCC':'Pro','CCA':'Pro','CCG':'Pro',
     'ACU':'Thr','ACC':'Thr','ACA':'Thr','ACG':'Thr',
     'GCU':'Ala','GCC':'Ala','GCA':'Ala','GCG':'Ala',
     'UAU':'Tyr','UAC':'Tyr','UAA':'Stop','UAG':'Stop',
     'CAU':'His','CAC':'His','CAA':'Gln','CAG':'Gln',
     'AAU':'Asn','AAC':'Asn','AAA':'Lys','AAG':'Lys',
     'GAU':'Asp','GAC':'Asp','GAA':'Glu','GAG':'Glu',
     'UGU':'Cys','UGC':'Cys','UGA':'Stop','UGG':'Trp',
     'CGU':'Arg','CGC':'Arg','CGA':'Arg','CGG':'Arg',
     'AGU':'Ser','AGC':'Ser','AGA':'Arg','AGG':'Arg',
     'GGU':'Gly','GGC':'Gly','GGA':'Gly','GGG':'Gly'}

#translates the RNAsequence into protein 

#In the range command: 
#The first number is the start codon using numbering starting at zero.
#The second number is the stop codon using numbering starting at zero, so that the stop codon is included. 
#The third number is the number of basepairs that are skipped every iteration since codons come in threes. 

proteinseq=''

for i in range(59,390,3): 
    proteinseq=proteinseq+str(geneticcode3let[RNAsequence[i:i+3]])
print (proteinseq)

In the code above we have started a script to write out the same protein sequence using one letter amino acid symbols. Add in the missing lines to write out the protein sequence using the one letter amino acid symbols. 

In [None]:
#calls the function defined above called write_mRNA_from_DNA
RNAsequence=write_mRNA_from_DNA('../class_1/data/Human-Insulin NM_000207.2.txt') 

#defines the python dictionary for the one letter genetic code 
geneticcode1let={'UUU':'F','UUC':'F','UUA':'L','UUG':'L',
     'CUU':'L','CUC':'L','CUA':'L','CUG':'L',
     'AUU':'I','AUC':'I','AUA':'I','AUG':'M',
     'GUU':'V','GUC':'V','GUA':'V','GUG':'V',
     'UCU':'S','UCC':'S','UCA':'S','UCG':'S',
     'CCU':'P','CCC':'P','CCA':'P','CCG':'P',
     'ACU':'T','ACC':'T','ACA':'T','ACG':'T',
     'GCU':'A','GCC':'A','GCA':'A','GCG':'A',
     'UAU':'Y','UAC':'Y','UAA':'*','UAG':'*',
     'CAU':'H','CAC':'H','CAA':'Q','CAG':'Q',
     'AAU':'N','AAC':'N','AAA':'K','AAG':'K',
     'GAU':'D','GAC':'D','GAA':'E','GAG':'E',
     'UGU':'C','UGC':'C','UGA':'*','UGG':'W',
     'CGU':'R','CGC':'R','CGA':'R','CGG':'R',
     'AGU':'S','AGC':'S','AGA':'R','AGG':'R',
     'GGU':'G','GGC':'G','GGA':'G','GGG':'G'}

 

##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 

geneticcode1let={'UUU':'F','UUC':'F','UUA':'L','UUG':'L',
     'CUU':'L','CUC':'L','CUA':'L','CUG':'L',
     'AUU':'I','AUC':'I','AUA':'I','AUG':'M',
     'GUU':'V','GUC':'V','GUA':'V','GUG':'V',
     'UCU':'S','UCC':'S','UCA':'S','UCG':'S',
     'CCU':'P','CCC':'P','CCA':'P','CCG':'P',
     'ACU':'T','ACC':'T','ACA':'T','ACG':'T',
     'GCU':'A','GCC':'A','GCA':'A','GCG':'A',
     'UAU':'Y','UAC':'Y','UAA':'*','UAG':'*',
     'CAU':'H','CAC':'H','CAA':'Q','CAG':'Q',
     'AAU':'N','AAC':'N','AAA':'K','AAG':'K',
     'GAU':'D','GAC':'D','GAA':'E','GAG':'E',
     'UGU':'C','UGC':'C','UGA':'*','UGG':'W',
     'CGU':'R','CGC':'R','CGA':'R','CGG':'R',
     'AGU':'S','AGC':'S','AGA':'R','AGG':'R',
     'GGU':'G','GGC':'G','GGA':'G','GGG':'G'}

proteinseq=''
for i in range(59,359,3):
    proteinseq=proteinseq+str(geneticcode1let[RNAsequence[i:i+3]])
print (proteinseq)

## Save functions to a .py file that can be imported into other programs<a name='SaveFunction' />

Once you have written a function or set of functions, it is helpful to be able to save the funcion(s) in a format that you can use it in other scripts. 

In Python, a file with a set of functions is called a **module**. Module files are saved with the extension .py and they can be called from other Python scripts using the import command. 
 
In the course of this class, you will learn about the vast resources of .py files that are publicly available and that you can use to view or analyze sequences or data without having to write algorithms from scratch. 

Below, we are going to write two functions: write_mRNA_fromDNA and write_protein_from_mRNA to a .py file called central_dogma_helpers.py.  

The first line defines the name of the .py file and provides the instructions to write the contents of the box to a file. 

We already copied the first function, write_mRNA_from_DNA, for you into the box. 

Fill in the code for the second function, write_protein_from_mRNA. You will need the code that you wrote above as well as your knowledge about how to format a function. 

In [None]:
#Writes the code from this box in the notebook into a file. 
%%writefile ../helpers/central_dogma_helpers.py

def write_mRNA_from_DNA(filename_FASTAgenesequence):
    FASTAgenesequence=open(filename_FASTAgenesequence,'r')
    genesequence=(FASTAgenesequence.readlines()[1:])
    genesequence=''.join(genesequence)
    genesequence=genesequence.replace('\n','')
    RNAsequence='' #this defines the variable 'complementarysequence'
    for i in genesequence:
        if i=='T':
            RNAsequence=RNAsequence+'U'
        else:
            RNAsequence=RNAsequence+ i
    return(RNAsequence)


##ANSWER -- REMOVE BEFORE GIVING TO STUDENTS ## 

def write_mRNA_from_DNA(filename_FASTAgenesequence):
    FASTAgenesequence=open(filename_FASTAgenesequence,'r')
    genesequence=(FASTAgenesequence.readlines()[1:])
    genesequence=''.join(genesequence)
    genesequence=genesequence.replace('\n','')
    RNAsequence='' #this defines the variable 'complementarysequence'
    for i in genesequence:
        if i=='T':
            RNAsequence=RNAsequence+'U'
        else:
            RNAsequence=RNAsequence+ i
    return(RNAsequence)



def write_protein_from_mRNA(RNAsequence):

#defines the python dictionary for the three letter genetic code 
    geneticcode3let={'UUU':'Phe','UUC':'Phe','UUA':'Leu','UUG':'Leu',
         'CUU':'Leu','CUC':'Leu','CUA':'Leu','CUG':'Leu',
         'AUU':'Ile','AUC':'Ile','AUA':'Ile','AUG':'Met',
         'GUU':'Val','GUC':'Val','GUA':'Val','GUG':'Val',
         'UCU':'Ser','UCC':'Ser','UCA':'Ser','UCG':'Ser',
         'CCU':'Pro','CCC':'Pro','CCA':'Pro','CCG':'Pro',
         'ACU':'Thr','ACC':'Thr','ACA':'Thr','ACG':'Thr',
         'GCU':'Ala','GCC':'Ala','GCA':'Ala','GCG':'Ala',
         'UAU':'Tyr','UAC':'Tyr','UAA':'Stop','UAG':'Stop',
         'CAU':'His','CAC':'His','CAA':'Gln','CAG':'Gln',
         'AAU':'Asn','AAC':'Asn','AAA':'Lys','AAG':'Lys',
         'GAU':'Asp','GAC':'Asp','GAA':'Glu','GAG':'Glu',
         'UGU':'Cys','UGC':'Cys','UGA':'Stop','UGG':'Trp',
         'CGU':'Arg','CGC':'Arg','CGA':'Arg','CGG':'Arg',
         'AGU':'Ser','AGC':'Ser','AGA':'Arg','AGG':'Arg',
         'GGU':'Gly','GGC':'Gly','GGA':'Gly','GGG':'Gly'}

#translates the RNAsequence into protein 

#In the range command: 
#The first number is the start codon using numbering starting at zero.
#The second number is the stop codon using numbering starting at zero, so that the stop codon is included. 
#The third number is the number of basepairs that are skipped every iteration since codons come in threes. 

    proteinseq=''

    for i in range(59,390,3): 
        proteinseq=proteinseq+str(geneticcode3let[RNAsequence[i:i+3]])
    return (proteinseq)

You should now be able to see a ../helpers/central_dogma_helpers.py file. Take a look at the file and make sure you see what you would expect. 

## What is a reference genome? <a name='ReferenceGenome' />
 
Now that you have learned some tools for how to work with single DNA, RNA and protein sequences, we are going to start to learn about genomics data. 

We will be working with data from the Human Genome Project as well as larger scale sequencing projects such as the 1000 Genomes Project.  

The Human Genome Project produced what is called a human **reference genome**, a publicly available, mostly complete sequence of the human genome that the scientific community agreed to use as a basis for comparison for new sequencing information. 

The sequence is "mostly complete" because some regions are difficult to sequence. The human genome still has some gaps.  

The initial reference genome was made from the sequences of a small number of individuals. Now, with more individuals having been sequenced, the reference genome captures more, but still not all of human genetic diversity. 

Researchers are still updating the reference genome. It is maintained by a consortium. You can find out more details [here](https://www.ncbi.nlm.nih.gov/grc/human).  

As an introduction to working with data from the Human Genome, we are going to look at how to find the sequence for human insulin that we have been looking at in the human reference genome.


## What are genomic coordinates and how are they used to designate the position of a gene in the human genome?<a name='GenomicCoordinates' />

The position of genes in the human reference genome are specified by their **genomic coordinates**. 

The genomic coordinates for a gene include the start position, stop position as well as the chromosome number. 

Whenever you are using genomic coordinates its important to also keep track of the version of the reference genome because these numbers change as the reference genome sequence is updated. 

Human Genomes typically have 23 pairs of chromosomes. Reference genomes are typically haploid, meaning that they   have sequencing information for 23 chromosomes, but not for separate pairs. 

<img src="../Images/3-Human Chromosomes.jpg" style="width: 40%; height: 50%" align="center"//>


## Use genomic coordinates to make a file in BED format<a name='BEDformat' />


The format for specifying genomic coordinates depends on the particular programs that you plan to use. There are several formats including BED, VCF, GFF/GTF. 

For this class, we will primarily be using a suite of programs called BED tools, so we will be using the BED format for genomic coordinates. 

BED files are text files that contain genomic coordinate information, and are typically given the .bed extension. The format of a BED file is specified here: https://genome.ucsc.edu/FAQ/FAQformat.html#format1. Only the first three columns are mandatory for bed files, and they contain the following information:

Columns in a BED file:
- Column 1: chromosome (this is designated as a number for chromosomes 1 to 22 and chrX or chrY for the sex chromosomes) 
- Column 2: start position (the beginning of the first base is indicated by the start position 0; the beginning of the 5th base is indicated by the start position "4")
- Column 3: end position (the end of the first base is indicated by the end position "1"; the end of the 5th base is indicated by the end position "5")

People are often confused by the fact that the same "base" is referred to by a different number depending on whether you are referring to the start or the end. A simple way to understand this is to realize that the positions are not referring to the numbering of the bases themselves, but to the boundary between bases, as illustrated in the figure below:

<img src="./array_slice_indexing.png">

As a reminder from our previous class, this convention is also consistent with how slicing in python works, as illustrated below:

In [None]:
dna_string = "ACCTG"
print(dna_string[0:4])
print(dna_string[1:5])
print(dna_string[0:5])

We can use Python to make a file in bed format. 

For this example, we are going to make a .bed file that allows us to print the exon regions in the human insulin gene from a fasta file of the human insulin gene. 

Originally, the exon boundaries would be determined experimentally, but we can obtain them from genomics resources such as this [link](https://www.ncbi.nlm.nih.gov/nuccore/161086962?report=genbank&from=4986&to=6416) from NCBI.

The exon boundaries for the human insulin gene that we looked at in classes one and two are: 
1-42
222-425
1213-1431
 

In [32]:
#Note that the exon locations have been adjusted to the zero-based numbering system described above
#\t is a tab character, and \n is a newline character
file1 = open('human_insulin_exon_boundaries.bed', 'w') #defines a file with the name "human_insulin_exons.bed"
#writes the lines of the file
file1.write("chr11\t0\t42\n") 
file1.write("chr11\t221\t425\n")
file1.write("chr11\t1212\t1431\n")
file1.close()



#NEED TO CHECK NUMBERS FOR REFERENCE GENOME-- these are GRCh37
#file1.write("chr11\t2181009\t2181227\n") 
#file1.write("chr11\t2182015\t2182218\n")
#file1.write("chr11\t2182398\t2182439\n")

#NEED TO CHECK NUMBERS FOR REFERENCE GENOME-- these are GRCh38
#file1.write("chr11\t2159779\t2159997\n") 
#file1.write("chr11\t2160785\t2160988\n")
#file1.write("chr11\t2161168\t2161209\n")

Let's view the lines in the created file:

In [33]:
file1 = open('human_insulin_exon_boundaries.bed', 'r')
file1_contents=file1.read()
print(file1_contents)

chr11	0	42
chr11	221	425
chr11	1212	1431



## Use the genomic analysis package Bedtools to make an mRNA sequence with exons from a gene sequence<a name='makeFASTAfromBED' />


We are now going to use the BED tools box to obtain the sequence of the exon regions for the human insulin gene from the human reference genome. 

We can use the output of the BED tool box to make the mRNA sequence by pasting the exon sequences together using Python. 

Bedtools provides the getFastaFromBed command to extract the FASTA sequence from a specific set of chromosome coordinates.

The FASTA sequences must contain chromosome information in the headers.  

For details on the syntax of the command see ["getFastaFromBed"](http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html)

The syntax for the full command is:
!bedtools getfasta 

A shortcut is: 
!fastaFromBed

The command requires an input FASTA file, a BED file containing your regions of interest, and an output FASTA file name. The reference file in our case is the h19.fa containing all DNA bases in the hg19 version of the human genome. You can access this file here:

#TODO: Replace the nandi-specific path with the hg19.fa path on the class server
/mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa



In [None]:
%cat  /mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa | head -n10

In [35]:
## first, the default behavior: 
!bedtools getfasta -fi data/Human-Insulin-NG_007114.1.txt -bed human_insulin_exons.bed -fo human_insulin_exons.fa.out
#examine the output
!cat human_insulin_exons.fa.out

##NEED TO REPLACE WITH Command for hg19.genome.fa 
#!bedtools getfasta -fi hg19.genome.fa  -bed human_insulin_exons.bed -fo human_insulin_exons.fa.out


>chr11:0-42
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAG
>chr11:221-425
ATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGG
>chr11:1212-1431
TGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCGCCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGC


We can use the startswith command in Python to take out the lines starting with '>'. Most of the other lines in this next script should all ready be familiar! 

In [62]:
#opens the FASTA sequences output from the bedtools command
file1 = open('human_insulin_exons.fa.out','r')
file1_contents=file1.readlines()

mRNA=[]
for f in range(len(file1_contents)):
    if file1_contents[f].startswith('>')==(False):
        #if the line does not start with > it gets written to mRNA
        mRNA.append(file1_contents[f])

#Removes the linebreaks from mRNA 
mRNA=''.join(file2)
mRNA=mRNA.replace('\n','')

#Writes out the mRNA sequence to a file
file3= open('human_insulin_mRNA.out','w')
file3=file3.write(mRNA)
         

AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCGCCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGC


If you compare the sequence you made above to the mRNA sequence that we got from NCBI in the last class they should be the same.  

As we have mentioned previously, it can be important in genomics studies to keep track of which strand the sequence you are looking at comes from. 

Strand information can also be provided in extra columns in a BED file. 

Adding to the bed file columns from above: 
Column 1: chromosome (this is designated as a number for chromosomes 1 to 22 and chrX or chrY for the sex chromosomes)
Column 2: start position (the beginning of the first base is indicated by the start position 0; the beginning of the 5th base is indicated by the start position "4")
Column 3: end position (the end of the first base is indicated by the end position "1"; the end of the 5th base is indicated by the end position "5")
Column 4: name defiens the BED feature eg. Exon number
Column 5: Is a score. We'll hear more about this later. 
Column 6: Is the strand which can be either '+' or '-' 

Bedtools getfasta extracts the sequence in the orientation defined in the strand column when the “-s” option is used.

In [27]:
#We re-write our test.bed file to include strand information: 

file2 = open('human_insulin_exon_boundaries_strand.bed', 'w') #defines a file with the name "human_insulin_exons.bed"
#writes the lines of the file
file2.write("chr11\t0\t42\tforward\t1\t+\n") 
file2.write("chr11\t221\t425\tforward\t2\t+\n")
file2.write("chr11\t1212\t1431\tforward\t3\t+\n")
file2.write("chr11\t0\t42\treverse\t1\t-\n") 
file2.write("chr11\t221\t425\treverse\t2\t-\n")
file2.write("chr11\t1212\t1431\treverse\t3\t-\n")

file2.close()

!cat human_insulin_exon_boundaries_strand.bed

!bedtools getfasta -fi data/Human-Insulin-NG_007114.1.txt -s -bed human_insulin_exon_boundaries_strand.bed -fo test.fa.out

#examine the output 
!cat test.fa.out

chr11	0	42	forward	1	+
chr11	221	425	forward	2	+
chr11	1212	1431	forward	3	+
chr11	0	42	reverse	1	-
chr11	221	425	reverse	2	-
chr11	1212	1431	reverse	3	-
>chr11:0-42(+)
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAG
>chr11:221-425(+)
ATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGG
>chr11:1212-1431(+)
TGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCGCCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGC
>chr11:0-42(-)
CTGCTTGATGGCCTCTTCTGATGCAGCCTGTCCTGGAGGGCT
>chr11:221-425(-)
CCTGCAGGTCCTCTGCCTCCCGGCGGGTCTTGGGTGTGTAGAAGAAGCCTCGTTCCCCGCACACTAGGTAGAGAGCTTCCACCAGGTGTGAGCCGCACAGGTGTTGGTTCACAAAGGCTGCGGCTGGGTCAGGTCCCCAGAGGGCCAGCAGCGCCAGCAGGGGCAGGAGGCGCATCCACAGGGCCATGGCAGAAGGACAGTGAT
>chr11:1212-1431(-)
GCTGGTTCAAGGGCTTTATTCCAT