# BioPython

## Introduction

BioPython is an open-source python library used in bioinformatics. It consists of a variety of sub-modules for common bioinformatics tasks. It is a collection of python modules that provide functions to deal with DNA, RNA & Protein sequence operations such as reverse-complementing, finding motifs etc. It supports a variety of genetic databases like GenBank, SwissProt, FASTA etc.

## Installation

```sh
pip install biopython
```

To upgrade an old version:

```sh
pip install --upgrade biopython
```

## Sequence

In [1]:
from Bio.SeqIO import parse
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

In [4]:
with open("example.fasta", "r") as file:
    records = parse(file, 'fasta')
    for record in records:
        print(f"Id: {record.id}")
        print(f"Name: {record.name}")
        print(f"Description: {record.description}")
        print(f"Annotations: {record.annotations}")
        print(f"Sequence Data: {record.seq}")
        print("")

Id: HBB
Name: HBB
Description: HBB | Human beta-globin gene
Annotations: {}
Sequence Data: ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACCCTTAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGGTGAGTCTATGGGACGCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGTTAAGTTCATGTCATAGGAAGGGGAGAAGTAACAGGGTACAGTTTAGAATGGGAAACAGACGAATGATTGCATCAGTGTGGAAGTCTCAGGATCGTTTTAGTTTCTTTTATTTGCTGTTCATAACAATTGTTTTCTTTTGGTTATTTCTGTTCTTTTTTTTTTCTTCTCCGCAATTTTTACTATTATACTTAATGCCTTAACATTGTGTATAACAAAAGGAAATATCTCTGAGATACATTAAGTAACTTAAAAAAAAACTTTACACAGTCTGCCTAGTACATTACTATTTGGAATATATGTGTGCTTATTTGCATATTCATAATCTCCCTACTTTATTTTCTTTTATTTTTAATTGATACATAATCATTATACATATTTATGGGTTAAAGTGTAATGTTTTAATATGTGTACACAT

A sequence is a series of letters used to represent an organism protein, DNA or RNA. It is represented by `Seq` class defined in the `Bio.Seq` module.

In [5]:
from Bio.Seq import Seq

# A super simple sequence
seq = Seq('AGCT')
seq

Seq('AGCT')

Biopython exposes all the bioinformatics related configuration data through `Bio.Data` module. For example, `IUPACData.protein_letters` has the following possible letters

In [7]:
from Bio.Data import IUPACData
IUPACData.protein_letters

'ACDEFGHIKLMNPQRSTVWY'

## Basic Operations

In [8]:
# To get the first value in the sequence
seq = Seq("AGCTAGCT")
seq[0]

'A'

In [9]:
# To print the first two letters
seq[0:2]

Seq('AG')

In [10]:
# To get the length of the sequence
len(seq)

8

In [11]:
# To count a particular nucleotide
seq.count('A')

2

In [13]:
# To find a single letter or sequence of letters inside the given sequence
seq = Seq("AGUAGCACUGGU")
seq.find('G')

1

In [14]:
seq.find('GG')

9

## Advanced Operations

Nucleotide sequences can be reverse complimented to get new sequence. Also, the complemented sequence can be reverse complemented to get the original sequence. Biopython provides two methods to do this functionally - **complement** and **reverse-complement**.

In [15]:
seq = Seq("TCGAAGTCAGTC")
seq.complement()

Seq('AGCTTCAGTCAG')

In [17]:
rev = seq.reverse_complement()
rev

Seq('GACTGACTTCGA')

In [18]:
rev.reverse_complement() == seq

True

Genomic DNA base composition (GC content), is predicted to significantly affect genome functionality and species ecology. The GC content is the number of GC nucleotides divided by the total number of nucleotides. To get the GC nucleotide content

In [20]:
from Bio.SeqUtils import gc_fraction
seq = Seq("GACTGACTTCGAATAGC")
gc_fraction(seq)

0.47058823529411764

Transcription is the process of changing DNA sequence into RNA sequence. The actual biological transcription process is performing a reverse complement to get the mRNA considering the DNA as template strand. However, in bioinformatics, we typically work directly with the coding strand and we can get the mRNA sequence by changing the letter T to U

In [21]:
from Bio.Seq import Seq
from Bio.Seq import transcribe
seq = Seq("ATGCCGATCGTAT")
transcribe(seq)

Seq('AUGCCGAUCGUAU')

To reverse the transcription. U is changed back to T

In [22]:
rna = transcribe(seq)
rna.back_transcribe()

Seq('ATGCCGATCGTAT')

To get the DNA template strand, `reverse_complement` the back transcribed RNA

In [23]:
rna.back_transcribe().reverse_complement()

Seq('ATACGATCGGCAT')

Translation is the process of translating RNA sequences to protein sequence. Considering a RNA sequence

In [25]:
rna_seq = Seq("AUGGCCAUUGUAAU")
rna_seq.translate()



Seq('MAIV')

In [26]:
rna = Seq("AUGGCCAUUGUAAUGGGCCGGCUGAAAGGGUGCCCGA")
rna.translate()



Seq('MAIVMGRLKGCP')

The Genetic Codes page of NCBI provides full list of translation tables used by BioPython

In [28]:
from Bio.Data import CodonTable
table = CodonTable.unambiguous_dna_by_name["Standard"]
print(table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------