In [1]:
from Bio import SeqIO,SeqRecord,SeqUtils
from Bio.Seq import Seq

In [2]:
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
protein = messenger_rna.translate()
protein

Seq('MAIVMGR*KGAR*')

## Translation can also be done directly from the coding strand of the DNA

In [3]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG") 
protein = coding_dna.translate()
protein

Seq('MAIVMGR*KGAR*')

### Seq('MAIVMGR\*KGAR\*')

# * 

- Internal stop

By default, translation will use the standard genetic code (NCBI table id 1). Suppose we are dealing with a mitochondrial sequence.

 We need to tell the translation function to use the relevant genetic code instead:

In [4]:
protein_1 = coding_dna.translate(table='Vertebrate Mitochondrial')
protein_1

Seq('MAIVMGRWKGAR*')

In [5]:
# or specify the table number such as: 

protein_1 = coding_dna.translate(table=2)
protein_1

Seq('MAIVMGRWKGAR*')

### Translate the nucleotides up to the first in frame stop codon, and then stop 

In [6]:
protein_1 = coding_dna.translate(to_stop=True)
protein_1

Seq('MAIVMGR')

### If your sequence uses a non-standard start codon? This happens a lot in bacteria – for example the gene yaaX in E. coli K12:

In [7]:
gene = Seq(

    "GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"

    "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT"

    "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT"

    "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT"

    "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA"

)
gene_3 = gene[:len(gene) - len(gene)%3]

In [8]:
gene_3.translate(table='Bacterial')

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*')

### let’s just focus on two choices: the Standard translation table, and the translation table for Vertebrate Mitochondrial DNA.

In [14]:
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_name['Standard']
mito_table = CodonTable.unambiguous_dna_by_name['Vertebrate Mitochondrial']
mito_table.id,standard_table.id

#These tables are labeled with id[1] id[2] respectively.

print(standard_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

## Comparing Seq objects


Sequence comparison is actually a very complicated topic, and there is no easy way to decide if two sequences are equal.

The basic problem is the meaning of the letters in a sequence are context dependent - the letter “A” could be part of a DNA, RNA or protein sequence.

`Biopython` can track the molecule type, so comparing two `Seq objects` could mean considering this too.

- For example, Should a DNA fragment “ACG” and an RNA fragment “ACG” be equal? What about the peptide “ACG”? Or the Python string “ACG”?

In [17]:
seq1 = Seq("ACGT")
"ACGT" == seq1

True

- Sometimes, there could be existing sequence lenght but not the actual letter constructing it.

In [19]:
unknown_seq = Seq(None,10)

The Seq object thus created has a well-defined length. Any attempt to access the sequence contents, however, will raise an `UndefinedSequenceError`:

In [23]:
len(unknown_seq)
print(unknown_seq)

UndefinedSequenceError: Sequence content is undefined

## Sequences with partially defined sequence contents

Sometimes the sequence contents is defined for parts of the sequence only, and undefined elsewhere:

- the following excerpt of a MAF (Multiple Alignment Format)

- below is provided an alignment of human, chimp, macaque, mouse, rat, dog, and opossum genome sequences.

s hg38.chr7     117512683 36 + 159345973 TTGAAAACCTGAATGTGAGAGTCAGTCAAGGATAGT

s panTro4.chr7  119000876 36 + 161824586 TTGAAAACCTGAATGTGAGAGTCACTCAAGGATAGT

s rheMac3.chr3  156330991 36 + 198365852 CTGAAATCCTGAATGTGAGAGTCAATCAAGGATGGT

s mm10.chr6      18207101 36 + 149736546 CTGAAAACCTAAGTAGGAGAATCAACTAAGGATAAT

s rn5.chr4       42326848 36 + 248343840 CTGAAAACCTAAGTAGGAGAGACAGTTAAAGATAAT

s canFam3.chr14  56325207 36 +  60966679 TTGAAAAACTGATTATTAGAGTCAATTAAGGATAGT

s monDom5.chr8  173163865 36 + 312544902 TTAAGAAACTGGAAATGAGGGTTGAATGACAAACTT

In each row, the first number indicates the starting position (in zero-based coordinates) of the aligned sequence on the chromosome,

 followed by the size of the aligned sequence, the strand, the size of the full chromosome, and the aligned sequence.

A Seq object representing such a `partially defined sequence` can be created using a dictionary for the data argument, where the `keys` are the starting coordinates of the known sequence segments, and the values are the corresponding sequence contents

In [None]:
partial_seq1 = Seq({117512683: "TTGAAAACCTGAATGTGAGAGTCAGTCAAGGATAGT"}, length=159345973)


Seq({117512683: 'TTGAAAACCTGAATGTGAGAGTCAGTCAAGGATAGT'}, length=159345973)

**Extracting a subsequence from a partially define sequence may return a fully defined sequence, an undefined sequence, or a partially defined sequence, depending on the coordinates:**

In [38]:
partial_seq1_slice = partial_seq1[100:200]
partial_seq1_slice


Seq(None, length=100)