<h1>biopython Introduction</h1>
<p>
For the whole documentation, see http://biopython.org/wiki/Biopython, for the long tutorial see http://biopython.org/DIST/docs/tutorial/Tutorial.html, for the cookbook see http://biopython.org/wiki/Category%3ACookbook.
</p>
<p>
biopython is a universal bioinformatic Python package, coordinated by the Open Bioinformatics Foundation (https://www.open-bio.org/wiki/Main_Page, also coordinates - amongst other projects - BioPerl http://bioperl.org/ and BioJava http://biojava.org/). If you use the Anaconda python distribution, the easiest way to install biopython is by using the Anaconda Navigator. If you use pip, you can install biopython with the command "pip install biopython".
</p>
<p>
Once you have installed biopython, you can import it to a Python script like this (not that you may have to import submodules of biopython to access all of its function):
</p>

In [1]:
import Bio

<h3>1. Storing DNA, RNA and Amino Acid Sequences</h3>

In [14]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

In [15]:
# An unambiguous DNA sequence
unambiguous = Seq("ACGT", IUPAC.unambiguous_dna)
print(unambiguous)
print(unambiguous.alphabet)
print(unambiguous.count("AC")) # Returns only non-overlapping occurences
print(len(unambiguous))

ACGT
IUPACUnambiguousDNA()
1
4


In [22]:
# An ambiguous DNA sequence
ambiguous = Seq("ACGT", IUPAC.ambiguous_dna)
ambiguous

Seq('ACGT', IUPACAmbiguousDNA())

In [23]:
# An RNA sequence
# An ambiguous DNA sequence
ambiguous = Seq("ACGU", IUPAC.ambiguous_rna)
ambiguous

Seq('ACGU', IUPACAmbiguousRNA())

In [12]:
# An amino acid sequence
protein = Seq("TRUMP", IUPAC.protein)
protein

Seq('TRUMP', IUPACProtein())

<h3>2. Translation and transcription</h3>

In [27]:
# Complementary DNA
sequence = Seq("AGCCCTYA", IUPAC.ambiguous_dna)
sequence.complement()

Seq('TCGGGART', IUPACAmbiguousDNA())

In [30]:
# Reverse-complementary DNA
sequence.reverse_complement()

Seq('TRAGGGCT', IUPACAmbiguousDNA())

In [40]:
# DNA->RNA
dna = Seq("AAATTTCCCGGG", IUPAC.unambiguous_dna)
rna = dna.transcribe()
rna

Seq('AAAUUUCCCGGG', IUPACUnambiguousRNA())

In [41]:
# RNA->Proteins
print(dna.translate()) # The DNA's or RNA's length be a multiple of 3
print(rna.translate())

KFPG
KFPG


<h3>3. Access to common databases</h3>

In [43]:
# NCBI Entrez...
from Bio import Entrez

In [46]:
# ...get all databases...
handle = Entrez.einfo()
result = handle.read()
handle.close()
result

'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">\n<eInfoResult>\n<DbList>\n\n\t<DbName>pubmed</DbName>\n\t<DbName>protein</DbName>\n\t<DbName>nuccore</DbName>\n\t<DbName>ipg</DbName>\n\t<DbName>nucleotide</DbName>\n\t<DbName>nucgss</DbName>\n\t<DbName>nucest</DbName>\n\t<DbName>structure</DbName>\n\t<DbName>sparcle</DbName>\n\t<DbName>genome</DbName>\n\t<DbName>annotinfo</DbName>\n\t<DbName>assembly</DbName>\n\t<DbName>bioproject</DbName>\n\t<DbName>biosample</DbName>\n\t<DbName>blastdbinfo</DbName>\n\t<DbName>books</DbName>\n\t<DbName>cdd</DbName>\n\t<DbName>clinvar</DbName>\n\t<DbName>clone</DbName>\n\t<DbName>gap</DbName>\n\t<DbName>gapplus</DbName>\n\t<DbName>grasp</DbName>\n\t<DbName>dbvar</DbName>\n\t<DbName>gene</DbName>\n\t<DbName>gds</DbName>\n\t<DbName>geoprofiles</DbName>\n\t<DbName>homologene</DbName>\n\t<DbName>medgen</DbName>\n\t<DbName>mesh</DbName>\n\t

In [49]:
# ...look up in pubmed
Entrez.email = ""
handle = Entrez.esearch(db="pubmed", term="Vibrio")
record = Entrez.read(handle)
record

{'Count': '26222', 'RetMax': '20', 'RetStart': '0', 'IdList': ['28878009', '28876229', '28875385', '28874446', '28870859', '28870858', '28870828', '28869845', '28867634', '28867630', '28867285', '28866277', '28864971', '28864654', '28863970', '28863889', '28863888', '28860258', '28860256', '28860073'], 'TranslationSet': [{'From': 'Vibrio', 'To': '"vibrio"[MeSH Terms] OR "vibrio"[All Fields]'}], 'TranslationStack': [{'Term': '"vibrio"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '18772', 'Explode': 'Y'}, {'Term': '"vibrio"[All Fields]', 'Field': 'All Fields', 'Count': '26224', 'Explode': 'N'}, 'OR', 'GROUP'], 'QueryTranslation': '"vibrio"[MeSH Terms] OR "vibrio"[All Fields]'}

In [2]:
# KEGG...
from Bio.KEGG import REST
from Bio.KEGG import Enzyme

In [10]:
# ...get pathways for Vibrio alginolyticus...
pathways = REST.kegg_list("pathway", "vag").read()

In [8]:
# ...select a pathway...
pathway_line = pathways.rstrip().split("\n")[0]
entry, description = pathway_line.split("\t")
entry

'path:vag00010'

In [12]:
# ...and analyze this pathway
pathway_file = REST.kegg_get(entry).read()

<h3>4. Sequence Alignments</h3>

In [30]:
# Simple Needleman-Wunsch (note: two cells below, a more complex alternative is shown)
from Bio import pairwise2
from Bio import SeqIO

seq1 = Seq("AAATTGGGCCGG", IUPAC.protein)
seq2 = Seq("AAAGGTTAAA", IUPAC.protein)

alignments = pairwise2.align.globalxx(seq1, seq2)
print(alignments[0]) # First optimal alignment

('AAATTGGGCCGG-----', 'AAA-------GGTTAAA', 5.0, 0, 17)


In [31]:
# Simple Smith-Watherman (note: two cells below, a more complex alternative is shown)
alignments2 = pairwise2.align.localxx(seq1, seq2)
print(alignments2[0]) # First optimal alignment

('AAATTGGGCCGG', 'AAA--GGTTAAA', 5.0, 0, 12)


In [28]:
# Better global alignment alternative with penalizing options
from Bio.SubsMat.MatrixInfo import blosum62

alignments3 = pairwise2.align.globalds(seq1, seq2, blosum62, -10, -0.5)
print(alignments3[0])

('AAATTGGGCCGG', 'AAA--GGTTAAA', 10.5, 0, 12)


In [29]:
# Better local alignment alternative with penalizing options
alignments4 = pairwise2.align.localds(seq1, seq2, blosum62, -10, -1)
print(alignments4[0])

('AAATTGGGCCGG', '--AAAGGTTAAA', 16.0, 2, 7)


<h3>5. BLAST</h3>

In [33]:
# The biopython option is not reliable. Use the command line instead, e.g.
blastn_output = run("blastn", "-query", "temp_in.txt", "-out", "temp_out.txt", "-db", "nt", "-outfmt", '6 sacc pident qstart qend evalue bitscore stitle sscinames sskingdoms', "-max_target_seqs", "5")

<h3>6. Sequence Motif Analysis</h3>

In [40]:
from Bio import motifs

In [43]:
# Generate nucleotide count report
seqset = [Seq("ACGTG"),
          Seq("ATGTT"),
          Seq("TTTTT")]
motif = motifs.create(seqset)
motif.counts

{'A': [2, 0, 0, 0, 0],
 'C': [0, 1, 0, 0, 0],
 'G': [0, 0, 2, 0, 1],
 'T': [1, 2, 1, 3, 2]}

In [44]:
# Access a line
motif.counts["T"]

[1, 2, 1, 3, 2]

In [46]:
# Get consensus
motif.consensus

Seq('ATGTT', IUPACUnambiguousDNA())

In [47]:
# Get anticonsensus
motif.anticonsensus

Seq('GGAGA', IUPACUnambiguousDNA())

In [49]:
# Get degenerate consensus
motif.degenerate_consensus

Seq('ATGTT', IUPACAmbiguousDNA())

In [50]:
# Create a logo (using a web service)
motif.weblogo("./biopython/motif.png")

<img src="./biopython/motif.png"></img>

PSB 2017