# Sequence objects

Two important objects: `Seq` and `SeqRecord`. `Seq` holds a sequence and is aware of the alphabet used. `SeqRecord` holds additional metadata for the sequence (biological annotation, etc.) Sequence-related objects and functions are in the [`Bio.Seq`](http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html) module.

## Seq

* Much like a standard string
* But with few additions e.g. the `.translate()` method


In [1]:
from Bio.Seq import Seq

s=Seq("ACGCGGCGTG")
print "s=", s
#Standard slicing works as usual
print s[:3]
print s[1:]
print s[::-1] #what will this do?


s= ACGCGGCGTG
ACG
CGCGGCGTG
GTGCGGCGCA


`print` pretty-prints the sequence for us, so

In [2]:
print s

ACGCGGCGTG


gives us no indication that `s` is a little more complex object than a simple string. If you want to know more about an object, you can try this:

In [3]:
print repr(s)
print s.__class__ #two underscores not one
print s.alphabet

Seq('ACGCGGCGTG', Alphabet())
<class 'Bio.Seq.Seq'>
Alphabet()


So what is this `Alphabet()`? Sequences need to know their alphabet for many of the operations to work correctly (e.g. `translate()` makes no sense for protein sequence). Alphabets are in the [`Bio.Alphabet`](http://biopython.org/DIST/docs/api/Bio.Alphabet-module.html) module and the commonly used IUPAC alphabets are in [`Bio.Alphabet.IUPAC`](http://biopython.org/DIST/docs/api/Bio.Alphabet.IUPAC-module.html).

In [4]:
from Bio.Alphabet import IUPAC

print IUPAC.unambiguous_dna

IUPACUnambiguousDNA()


So let's try

In [5]:
print s.translate()

TRR




Oops. This warning is not nice. Head over to [Ville](https://ville.utu.fi) and fix that, i.e. trim `s` to the longest subsequence whose length is divisible by `3` and print the translation.

`Seq` objects have a number of methods in common with [Python strings](https://docs.python.org/2/library/stdtypes.html#string-methods). For example, they have the `count()` method. Try find a way to use it to calculate the *GC content* of the sequence, i.e. the proportion of G's and C's out of the total length of the sequence? For example the GC content of `ACGTATTA` is `0.25` (`2/8`). 

Little side-note on division. Probably the following does not do what you would expect intuitively:

In [6]:
print 2/3

0


That is because division is by default integer division. You need to make one of the operands into a real number (`float` type) for it to become real division. Best use the `float()` function that tries to make a `float` out of anything you feed into it:

In [7]:
print float(2)/3

0.666666666667


Now head over to [Ville](https://ville.utu.fi) and solve that exercise.

## Sequence comparison

*Note: this will behave differently on older BioPythons*

In [8]:
s1=Seq("ACGT")
s2=Seq("ACGT")
print s1==s2

True


## Translation

In [11]:
from Bio.Alphabet import IUPAC
mrna=Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
translated=mrna.translate()
print translated
print repr(translated)

MAIVMGR*KGAR*
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))


In [25]:
import Bio.Alphabet

s=\
"""TTAAAAGATGCAATCTATCGTACTCGCACTT
TCCCTGGTTCTGGTCGCTCCCATGGCAGCACAGGCT
GCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAG
ATAGGCGATCGTGATAATCGTGGCTATTACTGGGAT
GGAGGTCACTGGCGCGTAACCACGGCTGGTGGAAAC
CATTATGAATGGCGAGGCAATCGCTGGCACCTACAC
GGACCGCCGCCACCGCCGCGCCACCATAAGAAAGCT
CCTCATGATCATCACGGCGGTCATGGTCCAGGCAAA
CATCACCGCTAA""".replace("\n","")

gene=Seq(s,Bio.Alphabet.generic_dna)
translation1=gene.translate(table="Bacterial")
print "Trans 1", translation1
translation2=gene.translate(table="Bacterial", to_stop=True)
print "Trans 2", translation2


Trans 1 LKDAIYRTRTFPGSGRSHGSTGCGNYVSPVSKITDRRS**SWLLLGWRSLARNHGWWKPL*MARQSLAPTRTAATAAPP*ESSS*SSRRSWSRQTSPL
Trans 2 LKDAIYRTRTFPGSGRSHGSTGCGNYVSPVSKITDRRS


The translation is driven by a genetic code table, of which there are [many](http://biopython.org/DIST/docs/api/Bio.Data.CodonTable-module.html). You can specify the table either by its name or by its number (check out the file linked in the module documentation).

Now head over to VILLE and solve the which table exercise. Here's a little hint:

In [26]:
for i in xrange(1,25):
    print i

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


## Complements

In [13]:
print mrna.complement()
print translated.complement()

UACCGGUAACAUUACCCGGCGACUUUCCCACGGGCUAUC


ValueError: Proteins do not have complements!