#BioPython

Biopython is a collection of freely available Python tools for computational molecular biology. It has parsers (helpers for reading) many common file formats used in bioinformatics tools and databases like BLAST, ClustalW, FASTA, GenBank, PubMed ExPASy, SwissProt, and many more. Biopython provides modules to connect to popular on-line services like NCBI’s Blast, Entrez and PubMed or ExPASy’s Swiss-Prot, UniProt and Prosite.

In [1]:
# if you work on Google Colab, uncomment the line below and run this cell
# !pip install -qq Bio

[K     |████████████████████████████████| 270 kB 19.7 MB/s 
[K     |████████████████████████████████| 2.6 MB 54.8 MB/s 
[?25h

## Working with sequences

Disputably (of course!), the central object in bioinformatics is the sequence. Most of the time when we think about sequences we have in my mind a string of letters like ‘AGTACACTGGT’. You can create such Seq object with this sequence as follows. 

In [33]:
from Bio.Seq import Seq
my_seq = Seq("ATGGCCATTGTAATGGGCCGCTAG")
my_seq

Seq('ATGGCCATTGTAATGGGCCGCTAG')

The `Seq` object differs from the Python string in the methods it supports. You can’t do this with a plain string:

In [34]:
my_seq.complement()

Seq('TACCGGTAACATTACCCGGCGATC')

In [35]:
my_seq.reverse_complement()

Seq('CTAGCGGCCCATTACAATGGCCAT')

In [36]:
my_seq.transcribe()

Seq('AUGGCCAUUGUAAUGGGCCGCUAG')

In [37]:
my_seq.transcribe().translate()

Seq('MAIVMGR*')

In [39]:
from Bio.SeqUtils import GC

GC(my_seq)

54.166666666666664

You can always convert `Seq` back to string.

In [10]:
str(my_seq)

'AGTACACTGGT'

The next most important class is the `SeqRecord` or Sequence Record. This holds a sequence (as a Seq object) with additional annotation including an identifier, name and description.

## Parsing sequence file formats with `Bio.SeqIO`

A large part of much bioinformatics work involves dealing with the many types of file formats designed to hold biological data. These files are loaded with interesting biological data, and a special challenge is parsing these files into a format so that you can manipulate them with some kind of programming language. However the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, and that formats may contain small subtleties which can break even the most well designed parsers.

We start with some sequence data from Lady Slipper Orchids (if you wonder why, have a look at some [Lady Slipper Orchids photos](https://www.flickr.com/search/?q=lady+slipper+orchid&s=int&z=t) on Flickr, or try a Google Image Search). If you open the lady slipper orchids FASTA file [ls_orchid.fasta](data/ls_orchid.fasta) in your favourite text editor, you’ll see that the file starts like this:

```
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...
```

It contains 94 records, each has a line starting with “>” (greater-than symbol) followed by the sequence on one or more lines. You can easily read them all in Python:

In [6]:
from Bio import SeqIO

for seq_record in SeqIO.parse("data/ls_orchid.fasta", "fasta"):
     print(seq_record.id)
     print(repr(seq_record.seq))
     print(len(seq_record))

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
748
gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT')
744
gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA')
733
gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC')
718
gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT')
730
gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA')
704
gi|2765649|emb|Z78524.1|CFZ78524
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC')
740
gi|2765648|emb|Z78523.1|CHZ78523
Seq('CGTAACCAGGTTTCCGT

Now let’s load the GenBank file [ls_orchid.gbk](data/ls_orchid.gbk) instead - notice that the code to do this is almost identical to the snippet used above for the FASTA file - the only difference is we change the filename and the format string:

In [7]:
for seq_record in SeqIO.parse("data/ls_orchid.gbk", "genbank"):
     print(seq_record.id)
     print(repr(seq_record.seq))
     print(len(seq_record))

Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
740
Z78532.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
753
Z78531.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
748
Z78530.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT')
744
Z78529.1
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA')
733
Z78527.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC')
718
Z78526.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT')
730
Z78525.1
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA')
704
Z78524.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC')
740
Z78523.1
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTGCGGCAGGATCATTGTTGAGACAGCAG...AAG')
709
Z78522.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...GAG')
700
Z78521.1
Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAGAATATATGATCGAGT...ACC')
726
Z78520.1
Seq('CGTAACAAGGTTTC

## Final exercise

1. Download all human coding DNA (cDNA) sequences, you can find them e.g. on Ensembl

In [42]:
!wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.abinitio.fa.gz

--2022-10-18 16:25:39--  http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.abinitio.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.139
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20601482 (20M) [application/x-gzip]
Saving to: ‘Homo_sapiens.GRCh38.cdna.abinitio.fa.gz.1’

Homo_sapiens.GRCh38   1%[                    ] 244.51K  5.05KB/s    in 2m 48s  

2022-10-18 16:28:32 (1.45 KB/s) - Connection closed at byte 250379. Retrying.

--2022-10-18 16:28:33--  (try: 2)  http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.abinitio.fa.gz
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 20601482 (20M), 20351103 (19M) remaining [application/x-gzip]
Saving to: ‘Homo_sapiens.GRCh38.cdna.abinitio.fa.gz.1’

gz.1             

2. Parse FASTA file and convert it into pandas `DataFrame` with columns `id` (Ensembl transcript id), `seq` (cDNA sequence), `GC` (GC percentage).

You can either unzip `Homo_sapiens.GRCh38.cdna.abinitio.fa.gz` file or use `gzip` library as suggested below.

In [1]:
import gzip
import pandas as pd
from Bio import SeqIO

with gzip.open("Homo_sapiens.GRCh38.cdna.abinitio.fa.gz", "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
## YOUR CODE HERE

SyntaxError: unexpected EOF while parsing (551286990.py, line 7)

3. Calculate mean GC percentage, visualize GC percentage by a histogram ([seaborn.histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html))

In [None]:
import seaborn as sns

## YOUR CODE HERE