# What is Biopython?

The Biopython Project is an international association of developers of freely available Python (http://www.python.org) tools for computational molecular biology. Python is an object oriented, interpreted, flexible language that is becoming increasingly popular for scientific computing. Python is easy to learn, has a very clear syntax and can easily be extended with modules written in C, C++ or FORTRAN.

source : http://biopython.org/DIST/docs/install/Installation.html

## What are the uses of Bio Python?


The main Biopython releases have lots of functionality, including:

1. The ability to parse bioinformatics files into Python utilizable data structures, including support for the following formats:
Blast output – both from standalone and WWW Blast
    * Clustalw
    * FASTA
    * GenBank
    * PubMed and Medline
    * ExPASy files, like Enzyme and Prosite
    * OP, including ‘dom’ and ‘lin’ files
    * UniGen
    * SwissProt
2. Code to deal with popular on-line bioinformatics destinations such as:
 * NCBI – Blast, Entrez and PubMed services
 
3. Tools for performing common operations on sequences, such as translation, transcription and weight calculations.

4. and Much more since Python Rocks


# Installation:

Setup environment and required packages.

refer requirements.txt

# Sequence objects:


In [5]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC


my_seq = Seq('AGTACACTGGT',IUPAC.unambiguous_dna)
my_seq


Seq('AGTACACTGGT', IUPACUnambiguousDNA())

In [8]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC


my_seq = Seq('AGTACACTGGT',IUPAC.protein)
my_seq

Seq('AGTACACTGGT', IUPACProtein())

In [9]:
my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
for letter in my_seq:
    print letter

G
A
T
C
G


In [28]:
my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
for no, letter in enumerate(my_seq):
    print no, letter

0 G
1 A
2 T
3 C
4 G


In [29]:
print(my_seq[2])

T


In [40]:
print(my_seq[1:3])

AT


In [31]:
print(my_seq[-1])

G


### Slicing Sequnces

In [36]:
print(my_seq[0::2])

GTG


In [37]:
print(my_seq[1::2])

AC


In [38]:
print(my_seq[2::2])

TG


## Playing around with sequences as strings:

In [18]:
'GATCGATGGGCCTATATAGGATCGAAAATCGC'.count("AA")

2

In [20]:
from Bio.Seq import Seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC').count("AA")

2

In [21]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
len(my_seq)

32

In [22]:
my_seq.count("G")

9

### Calculate  GC%

In [23]:
100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)

46.875

In [25]:
from Bio.SeqUtils import GC
my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
GC(my_seq)

46.875

### Concatenating or adding sequences

In [44]:
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
protein_seq = Seq("EVRNAK", IUPAC.protein)
dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
protein_seq + dna_seq

TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()

In [43]:
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
protein_seq = Seq("EVRNAK", IUPAC.protein)
dna_seq = Seq("ACGT", IUPAC.protein)
protein_seq + dna_seq

Seq('EVRNAKACGT', IUPACProtein())

In [50]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_nucleotide
from Bio.Alphabet import IUPAC, NucleotideAlphabet
from Bio.Alphabet.IUPAC import IUPACUnambiguousDNA
nuc_seq = Seq("GATCGATGC", generic_nucleotide)
dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
nuc_seq

Seq('GATCGATGC', NucleotideAlphabet())

In [52]:
dna_seq

Seq('ACGT', IUPACUnambiguousDNA())

In [53]:
nuc_seq + dna_seq


Seq('GATCGATGCACGT', NucleotideAlphabet())

### Comliment and Reverse Compliment

In [54]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
my_seq

Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())

In [55]:
my_seq.complement()

Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())

In [56]:
my_seq.reverse_complement()

Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

## Transcription

        DNA coding strand (aka Crick strand, strand +1)	 

    5’	ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG	3’
          |||||||||||||||||||||||||||||||||||||||	 
    3’	TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC	5’

        DNA template strand (aka Watson strand, strand −1)	 
         
                |	 
            Transcription	 
                ↓	 
 
    5’	AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG	3’
        Single stranded messenger RNA

In [57]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
coding_dna


Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

In [58]:
template_dna = coding_dna.reverse_complement()
template_dna

Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT', IUPACUnambiguousDNA())

In [59]:
coding_dna

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

In [60]:
messenger_rna = coding_dna.transcribe()
messenger_rna

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

### Back Transcription method:

In [61]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
messenger_rna

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

In [62]:
messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

## Translation

let’s translate this mRNA into the corresponding protein sequence.

In [80]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
messenger_rna

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

In [64]:
messenger_rna.translate()

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

In [71]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
coding_dna

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

In [72]:
coding_dna.translate()

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

In [73]:
coding_dna.translate(table="Vertebrate Mitochondrial")

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

In [74]:
coding_dna.translate(table=2)

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

In [75]:
coding_dna.translate()

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

In [76]:
coding_dna.translate(to_stop=True)

Seq('MAIVMGR', IUPACProtein())

In [77]:
coding_dna.translate(table=2)

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

In [78]:
coding_dna.translate(table=2, to_stop=True)

Seq('MAIVMGRWKGAR', IUPACProtein())

In [79]:
coding_dna.translate(table=2, stop_symbol="@")

Seq('MAIVMGRWKGAR@', HasStopCodon(IUPACProtein(), '@'))

In [81]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
            "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
            "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
            "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
            "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
            generic_dna)
gene.translate(table="Bacterial")

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In [82]:
gene.translate(table="Bacterial", to_stop=True)

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())

In [83]:
gene.translate(table="Bacterial", cds=True)

Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())

## Comparing Seq objects

In [84]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
seq2 = Seq("ACGT", IUPAC.ambiguous_dna)
str(seq1) == str(seq2)


True

In [85]:
str(seq1) == str(seq1)

True

In [93]:
# Two types will be checked 
# The comparison itself only the string of letters in the sequence is used:

from Bio.Seq import Seq
from Bio.Alphabet import generic_dna, generic_protein
dna_seq = Seq("ACGT", generic_dna)
prot_seq = Seq("ACGT", generic_protein)
dna_seq == prot_seq

True