In this recipe we'll explore how to perform six-frame translation on a nucleotide sequence. Briefly, when an unknown DNA sequence is obtained from the environment, there are six possible reading frames for translation. The six reading frames for a DNA sequence start at positions 1, 2, and 3 in the forward orientation and positions 1, 2, and 3 in the reverse orientation.

Six-frame translation can be used to find the possible proteins a nucleotide sequence might encode.

First let's define a single DNA sequence. This is a [cholera toxin](http://www.ebi.ac.uk/interpro/potm/2005_9/Page1.htm) seqeunce, which we obtained from [Genebank record EU828588.1](http://www.ncbi.nlm.nih.gov/nuccore/EU828588.1):

In [1]:
from skbio import DNASequence
dna = DNASequence("ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATACATACGCTAAATGATAAGATATTGTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAACTTTTCAAGTAGAAGTACCAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAAGATATAAAAAAGCCCAC")
dna

<DNASequence: ATGATTAAAT... (length: 392)>

Since this is a curated sequence from NCBI, let's add 25 random nucleotides to both ends of the sequence to simulate a DNA fragment obtained from the environment (e.g., from shotgun metagenomic sequencing), where we don't know ahead of time what protein it might encode (or if it even does).

In [2]:
import random
def random_dna_seq(n):
    return ''.join(random.choice('ACGT') for _ in range(n))

In [3]:
random_front = random_dna_seq(25)
random_back = random_dna_seq(25)
dna = DNASequence(random_front + str(dna) + random_back)
dna

<DNASequence: CCTAAGAGGA... (length: 442)>

Next let's create a genetic code object. The default genetic code in scikit-bio is the vertebrate nuclear genetic code, but others exist which contain minor differences (e.g., codons code for different amino acids, or the set of stop codons is slightly different) and can be obtained via the `genetic_code` factory. Since we're going to translate the cholera toxin DNA sequence (produced by the *Vibrio cholerae* bacterium), we'll use NCBI's [Bacterial, Archaeal and Plant Plastid Code](http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG11) (transl_table=11):

In [4]:
from skbio.sequence import genetic_code
gc = genetic_code(11)

To perform six-frame translation of our DNA sequence:

In [5]:
prot_seqs = gc.translate_six_frames(dna)
prot_seqs

[<ProteinSequence: PKRNSKLPDD... (length: 147)>,
 <ProteinSequence: LRGIRNYPMI... (length: 147)>,
 <ProteinSequence: *EEFEITR*L... (length: 146)>,
 <ProteinSequence: KATRGTLVRG... (length: 147)>,
 <ProteinSequence: RPRGALLCVG... (length: 147)>,
 <ProteinSequence: GHEGHSCAWA... (length: 146)>]

The six protein sequences represent each possible reading frame in the DNA sequence, but start and stop codons are not taken into account since we don't know what the actual reading frame is, so it may be desirable to next identify the sequences that are the most likely proteins.

First let's see how many *stop translation points* there are in each translation (they are denoted by an asterisk in the protein sequence, and correspond to the locations of stop codons in the DNA sequence). The real protein sequence is unlikely to have very many stop translation points. 

In [6]:
for i, seq in enumerate(prot_seqs):
    print("Sequence %d has %d stop codons." % (i + 1, seq.count('*')))

Sequence 1 has 12 stop codons.
Sequence 2 has 1 stop codons.
Sequence 3 has 10 stop codons.
Sequence 4 has 9 stop codons.
Sequence 5 has 6 stop codons.
Sequence 6 has 10 stop codons.


The second translated sequence has the fewest stop codons, so this seems a likely candidate for encoding a protein. Let's trim the sequence to include only the regions between a start and stop codon. This functionality does not yet exist in scikit-bio, though there are plans to rework the genetic code functionality in the near future to make this process easier. For now, let's define a function to find the this region in each protein sequence:

In [7]:
from skbio import ProteinSequence, SequenceCollection
def find_proteins(translated_seq):
    proteins = []
    curr_protein = []
    in_protein = False
    for c in translated_seq:
        if c == 'M':
            # start translation point, corresponds to a start codon
            in_protein = True
        if in_protein:
            if c == '*':
                # end translation point, corresponds to a stop codon
                proteins.append(ProteinSequence(curr_protein, id='%d' % (len(proteins) + 1)))
                curr_protein = []
                in_protein = False
            else:
                curr_protein.append(c)
    return SequenceCollection(proteins)

If we run this function on our candidate translation, we receive a `SequenceCollection` containing a single protein:

In [8]:
proteins = find_proteins(prot_seqs[1])
proteins

<SequenceCollection: n=1; mean +/- std length=124.00 +/- 0.00>

Let's inspect the protein in more detail:

In [9]:
print(proteins)

>1
MIKLKFGVFFTVLLSSAYAHGTPQNITDLCAEYHNTQIHTLNDKILSYTESLAGKREMAIITFKNGATFQVEVPGSQHIDSQKKAIERMKDTLRIAYLTEAKVEKLCVWNNKTPHAIAAISMAN



Note that the sequence starts with M (which is the amino acid encoded by ATG, the start codon in this genetic code) and that the stop translation character has been trimmed off. This region seems a likely candidate for a protein given its length.

A next step could be searching the protein against a reference database (e.g., by BLASTing them using NCBI's `blastp` tool). In the future, we may add external reference database searching to scikit-bio.