# Sequence objects

Two important objects: `Seq` and `SeqRecord`. `Seq` holds a sequence and is aware of the alphabet used. `SeqRecord` holds additional metadata for the sequence (biological annotation, etc.) Sequence-related objects and functions are in the [`Bio.Seq`](http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html) module.

## Seq

* Much like a standard string
* But with few additions e.g. the `.translate()` method


In [1]:
from Bio.Seq import Seq

s=Seq("ACGCGGCGTG")
print "s=", s
#Standard slicing works as usual
print s[:3]
print s[1:]
print s[::-1] #what will this do?


s= ACGCGGCGTG
ACG
CGCGGCGTG
GTGCGGCGCA


`print` pretty-prints the sequence for us, so

In [2]:
print s
print "ACGCGGCGTG"

ACGCGGCGTG
ACGCGGCGTG


gives us no indication that `s` is a little more complex object than a simple string. If you want to know more about an object, you can try this:

In [3]:
print repr(s)
print s.__class__ #two underscores not one
#help(s)
print s.alphabet

Seq('ACGCGGCGTG', Alphabet())
<class 'Bio.Seq.Seq'>
Alphabet()


So what is this `Alphabet()`? Sequences need to know their alphabet for many of the operations to work correctly (e.g. `translate()` makes no sense for protein sequence). Alphabets are in the [`Bio.Alphabet`](http://biopython.org/DIST/docs/api/Bio.Alphabet-module.html) module and the commonly used IUPAC alphabets are in [`Bio.Alphabet.IUPAC`](http://biopython.org/DIST/docs/api/Bio.Alphabet.IUPAC-module.html).

In [4]:
from Bio.Alphabet import IUPAC

print IUPAC.unambiguous_dna

IUPACUnambiguousDNA()


So let's try

In [5]:
s=Seq("ACGCGGCGTG")
print s.translate()

TRR




Oops. This warning is not nice. Head over to [Ville](https://ville.utu.fi) and fix that, i.e. trim `s` to the longest subsequence whose length is divisible by `3` and print the translation.

`Seq` objects have a number of methods in common with [Python strings](https://docs.python.org/2/library/stdtypes.html#string-methods). For example, they have the `count()` method. Try find a way to use it to calculate the *GC content* of the sequence, i.e. the proportion of G's and C's out of the total length of the sequence? For example the GC content of `ACGTATTA` is `0.25` (`2/8`). 

Little side-note on division. Probably the following does not do what you would expect intuitively:

In [6]:
print 2/3

0


That is because division is by default integer division. You need to make one of the operands into a real number (`float` type) for it to become real division. Best use the `float()` function that tries to make a `float` out of anything you feed into it:

In [7]:
print float(2)/3
print float(2/3) #this doesn't do what you think
print 2/3/1.5 #neither does this

0.666666666667
0.0
0.0


Now head over to [Ville](https://ville.utu.fi) and solve that exercise.

## Sequence comparison

*Note: this will behave differently on older BioPythons*

In [8]:
s1=Seq("ACGT")
s2=Seq("ACGT")
print s1==s2

True


Sequences are much like strings and so you can index them much like strings. For example, `s1[3]` is the fourth character. Now head over to Ville and try to figure out the Hamming distance exercise. Here's a little loop to get you started:

In [9]:
s1=Seq("ACGTACGT")

for i in range(3):
    print s1[i]

A
C
G


In [10]:
s1=Seq("ACGTACG")
s2=Seq("ACGACGT")

assert len(s1)==len(s2), (len(s1),len(s2))
for c1,c2 in zip(s1,s2):
    print c1,c2
    


A A
C C
G G
T A
A C
C G
G T


## Translation

In [11]:
from Bio.Alphabet import IUPAC
mrna=Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
translated=mrna.translate()
print translated
print repr(translated)

MAIVMGR*KGAR*
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))


In [12]:
import Bio.Alphabet

s=\
"""TTAAAAGATGCAATCTATCGTACTCGCACTT
TCCCTGGTTCTGGTCGCTCCCATGGCAGCACAGGCT
GCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAG
ATAGGCGATCGTGATAATCGTGGCTATTACTGGGAT
GGAGGTCACTGGCGCGTAACCACGGCTGGTGGAAAC
CATTATGAATGGCGAGGCAATCGCTGGCACCTACAC
GGACCGCCGCCACCGCCGCGCCACCATAAGAAAGCT
CCTCATGATCATCACGGCGGTCATGGTCCAGGCAAA
CATCACCGCTAA""".replace("\n","")

gene=Seq(s,Bio.Alphabet.generic_dna)
translation1=gene.translate(table="Bacterial")
print "Trans 1", translation1
translation2=gene.translate(table="Bacterial", to_stop=True)
print "Trans 2", translation2


Trans 1 LKDAIYRTRTFPGSGRSHGSTGCGNYVSPVSKITDRRS**SWLLLGWRSLARNHGWWKPL*MARQSLAPTRTAATAAPP*ESSS*SSRRSWSRQTSPL
Trans 2 LKDAIYRTRTFPGSGRSHGSTGCGNYVSPVSKITDRRS


The translation is driven by a genetic code table, of which there are [many](http://biopython.org/DIST/docs/api/Bio.Data.CodonTable-module.html). You can specify the table either by its name or by its number (check out the file linked in the module documentation). The tables are accessible via the various dictionaries in the CodonTable module:


In [13]:
import Bio.Data.CodonTable as CT

print "By name:", CT.generic_by_name
print "By ID:", CT.generic_by_id

By name: {'Ascidian Mitochondrial': <Bio.Data.CodonTable.NCBICodonTable object at 0x3504b50>, 'SGC9': <Bio.Data.CodonTable.NCBICodonTable object at 0x35043d0>, 'Coelenterate Mitochondrial': <Bio.Data.CodonTable.NCBICodonTable object at 0x3493950>, 'Protozoan Mitochondrial': <Bio.Data.CodonTable.NCBICodonTable object at 0x3493950>, 'Vertebrate Mitochondrial': <Bio.Data.CodonTable.NCBICodonTable object at 0x34933d0>, 'Plant Plastid': <Bio.Data.CodonTable.NCBICodonTable object at 0x3504650>, 'Thraustochytrium Mitochondrial': <Bio.Data.CodonTable.NCBICodonTable object at 0x3512a90>, 'Blepharisma Macronuclear': <Bio.Data.CodonTable.NCBICodonTable object at 0x3512090>, 'Mold Mitochondrial': <Bio.Data.CodonTable.NCBICodonTable object at 0x3493950>, 'Invertebrate Mitochondrial': <Bio.Data.CodonTable.NCBICodonTable object at 0x3493bd0>, 'Standard': <Bio.Data.CodonTable.NCBICodonTable object at 0x34930d0>, 'Trematode Mitochondrial': <Bio.Data.CodonTable.NCBICodonTable object at 0x3512590>, 'Scen

That is not too readable. Let's try this:

In [14]:
print "All names:", CT.generic_by_name.keys()
#Sometimes the help() method tells you something about the object you're dealing with
help(CT.generic_by_name["SGC9"])
print "Start codons of the SGC9 table:",CT.generic_by_name["SGC9"].start_codons

All names: ['Ascidian Mitochondrial', 'SGC9', 'Coelenterate Mitochondrial', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Plant Plastid', 'Thraustochytrium Mitochondrial', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Invertebrate Mitochondrial', 'Standard', 'Trematode Mitochondrial', 'Scenedesmus obliquus Mitochondrial', 'Euplotid Nuclear', 'Yeast Mitochondrial', 'Spiroplasma', 'Alternative Flatworm Mitochondrial', 'Ciliate Nuclear', 'SGC8', 'Alternative Yeast Nuclear', 'Hexamita Nuclear', 'SGC5', 'SGC4', 'SGC3', 'SGC2', 'SGC1', 'SGC0', 'Flatworm Mitochondrial', 'Dasycladacean Nuclear', 'Chlorophycean Mitochondrial', 'Mycoplasma', 'Pterobranchia Mitochondrial', 'Bacterial', 'Echinoderm Mitochondrial']
Help on NCBICodonTable in module Bio.Data.CodonTable object:

class NCBICodonTable(CodonTable)
 |  Method resolution order:
 |      NCBICodonTable
 |      CodonTable
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, id, names, table, start_c

Now that you know you can list all available tables using `Bio.Data.CodonTable.generic_by_name.keys()` you can also loop over these tables and solve the next exercise in Ville. Here's yet a little for loop to get you started.

In [15]:
import Bio.Data.CodonTable

for table_name in Bio.Data.CodonTable.generic_by_name.keys():
    print table_name

Ascidian Mitochondrial
SGC9
Coelenterate Mitochondrial
Protozoan Mitochondrial
Vertebrate Mitochondrial
Plant Plastid
Thraustochytrium Mitochondrial
Blepharisma Macronuclear
Mold Mitochondrial
Invertebrate Mitochondrial
Standard
Trematode Mitochondrial
Scenedesmus obliquus Mitochondrial
Euplotid Nuclear
Yeast Mitochondrial
Spiroplasma
Alternative Flatworm Mitochondrial
Ciliate Nuclear
SGC8
Alternative Yeast Nuclear
Hexamita Nuclear
SGC5
SGC4
SGC3
SGC2
SGC1
SGC0
Flatworm Mitochondrial
Dasycladacean Nuclear
Chlorophycean Mitochondrial
Mycoplasma
Pterobranchia Mitochondrial
Bacterial
Echinoderm Mitochondrial


## Complements

In [16]:
mrna=Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
print mrna.complement()
print translated.complement()

UACCGGUAACAUUACCCGGCGACUUUCCCACGGGCUAUC


ValueError: Proteins do not have complements!