### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 09 - Introduction to Biopython - Sequences

*Written by:* Mateusz Kaczyński

**This notebook provides a general introduction to Biopython library and provides an in-depth look at dealing with the sequences - the cornerstone of bioinformatics.**


### What is Biopython?

**Biopython** is a popular open-source toolbox for computational biology and bioinformatics. It contains tools and connectors for various resources and provides functions for running common bioinformatics tasks and parsing common bioinformatics data formats.


### Why Biopython?
**Biopython** is a de-facto standard for accessing a lot of the databases and tools making it easier to share your work and results. It simplifies the use of the libraries written in other languages and technologies as well as those hosted online. While it is possible to use those directly - Biopython simplifies the process and provides the components you. 
 


## Contents


1. [Sequence basics](#Sequence-basics)
2. [Transcription and translation](#Transcription-and-translation)
3. [Alignment](#Alignment)
4. [Downloading and reading FASTA files](#Downloading-and-reading-fasta-files)
5. [Discussion](#Discussion)
-----

#### Extra Resources:

- [Official Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) - A comprehensive guide to the library capabilities.
- [Biopython API documentation](https://biopython.org/docs/latest/api/index.html) - a long, detailed list of all methods and connectors provided by Biopython.
- [Rosalind](http://rosalind.info) - A bioinformatics learning platform that includes exercises.




#### Importing Biopython
While installed, **Biopython** can be included in the project by importing it as `Bio` module. It is not that uncommon for the library to be called differently than the imported name. 

In [127]:
import Bio
print("Module", Bio.__name__, "version", Bio.__version__)

Module Bio version 1.79


## Sequence basics
Managing sequences of DNA, RNA or peptides are a core part of any bioinformatics workflow. Being able to understand, transform and use them is one of the most beneficial uses of computers in biology.

Here is how we define a simple sequence.


In [128]:
from Bio.Seq import Seq
dna_sequence = Seq("ACTG")

You can see that the sequence can be created from a simple `string` structure representing the ordered items (in this case - DNA). However, it is no longer a simple data type.

Let's take a closer look at the Python representation.

In [129]:
print("The sequence compared with the original string:", dna_sequence == "ACTG")
print("The type of dna_sequence:", type(dna_sequence))

The sequence compared with the original string: True
The type of dna_sequence: <class 'Bio.Seq.Seq'>


### Objects and classes
We can see that, while we can compare it with the original `string` object without an issue, the type of the variable is `Seq`.

`dna_sequence` is an `object` of `class` `Seq`. 
    
Let's unfold those concepts:
  - `object` - a fundamental building block of a program - it can contain properties and functions. In Python almost anything is an `object`.
  - `class` - a blueprint for creating new objects, provides a way to mix data and functions
  - `Seq` - a particular `class` defined in `Bio.Seq` module.


When in doubt, we can use `help` function to help us understand how to use `Seq`s. 

In [130]:
help(dna_sequence)

Help on Seq in module Bio.Seq object:

class Seq(_SeqAbstractBaseClass)
 |  Seq(data, length=None)
 |  
 |  Read-only sequence object (essentially a string with biological methods).
 |  
 |  Like normal python strings, our basic sequence object is immutable.
 |  This prevents you from doing my_seq[5] = "A" for example, but does allow
 |  Seq objects to be used as dictionary keys.
 |  
 |  The Seq object provides a number of string like methods (such as count,
 |  find, split and strip).
 |  
 |  The Seq object also provides some biological methods, such as complement,
 |  reverse_complement, transcribe, back_transcribe and translate (which are
 |  not applicable to protein sequences).
 |  
 |  Method resolution order:
 |      Seq
 |      _SeqAbstractBaseClass
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __hash__(self)
 |      Hash of the sequence as a string for comparison.
 |      
 |      See Seq object comparison documentation (method ``__eq__`` in

### String-like behaviour
In the first workshop, we became familiar with how Python `string` - simple text representation objects behave. 

`Seq` class is built to follow the same methods to help with interchanging the information between the two types of data and present a more familiar interface to the user.

Let's verify this with some common methods, including indexing and slicing.

In [131]:
print("The length of the sentence:", len(dna_sequence))
print("AC found in the sequence:  ", "AC" in dna_sequence)
print("Sequence ends with 'TG':   ", dna_sequence.endswith("TG"))
print("The first base:            ", dna_sequence[0])
print("The last base:             ", dna_sequence[-1])
print("Slicing - first two bases: ", dna_sequence[0:2])
print("How many Adenines are in the sequence:", dna_sequence.count("A"))

The length of the sentence: 4
AC found in the sequence:   True
Sequence ends with 'TG':    True
The first base:             A
The last base:              G
Slicing - first two bases:  AC
How many Adenines are in the sequence: 1


What about case sensitivity? Does it matter if we represent Adenine with `a` or `A`? 

Turns out that yes - it matters a lot. Due to the recent changes in **Biopython** (removal of `Alphabet` concept), there is no easy way to verify or enforce the content of `Seq` objects. Users are required to perform their own verification.

Let's have a closer look.

In [132]:
dna_sequence_mixedcase = Seq("acTG")
print("Upper case sequence equal to mixed case sentence:       ", 
      dna_sequence == dna_sequence_mixedcase)
print("How about if we change the original to lower case?:     ", 
      dna_sequence.lower() == dna_sequence_mixedcase)
print("How about moving both to the same case? Are they equal?:", 
      dna_sequence.lower() == dna_sequence_mixedcase.lower())

Upper case sequence equal to mixed case sentence:        False
How about if we change the original to lower case?:      False
How about moving both to the same case? Are they equal?: True


To extend the sequence, we can use the familiar `string` syntax. 

Again, it does not matter if you use `string` or `Seq` class.

In [133]:
print("Extended with another sequence:", dna_sequence + Seq("TAA"))
print("Extended with a string        :", dna_sequence + "TGG")

Extended with another sequence: ACTGTAA
Extended with a string        : ACTGTGG


### Example sequence-specific functions 
In addition to `string`-like behaviour, `Seq` objects contain methods applicable in their specific domain. 

For the cases where there are gaps, `ungap` method helps to deal with them.

In [134]:
sequence_with_gaps = Seq("-".join(3*[str(dna_sequence)]))
print("Sentence with gaps   :", sequence_with_gaps)
print("Sentence without gaps:", sequence_with_gaps.ungap())

Sentence with gaps   : ACTG-ACTG-ACTG
Sentence without gaps: ACTGACTGACTG


**Biopython** contains various tools that can be used for sequence analysis, for example, GC content calculation can be imported like so. 

In [135]:
from Bio.SeqUtils import GC
print("The guanine-cytosine content of the sequence (%):", GC(dna_sequence))

The guanine-cytosine content of the sequence (%): 50.0


## Transcription and translation
One of the most powerful features of **Biopython** is switching from one representation to another, converting between seqeunce types.

Let's take a look at how to generate complementary and reverse sequences.

In [136]:
print("Sequence:          ", dna_sequence)
print("Complement:        ", dna_sequence.complement())
print("Reverse:           ", Seq("".join(reversed(dna_sequence))))
print("Reverse complement:", dna_sequence.reverse_complement())

Sequence:           ACTG
Complement:         TGAC
Reverse:            GTCA
Reverse complement: CAGT


Simple DNA-to-RNA transcription can be achieved with the following methods.

In [137]:
print("Sequence:              ", dna_sequence)
print("Transcribed:           ", dna_sequence.transcribe())
print("Reverse complement RNA:", dna_sequence.reverse_complement_rna())

Sequence:               ACTG
Transcribed:            ACUG
Reverse complement RNA: CAGU


The reversal of the process - getting from RNA to DNA is also supported.

In [138]:
transcribed = dna_sequence.transcribe()
print("Transcibed:      ", transcribed)
print("Back transcribed:", transcribed.back_transcribe())

Transcibed:       ACUG
Back transcribed: ACTG


**Biopython** makes translation from RNA (or directly DNA) to aminoacid sequence a very easy task. 

*Note that different codon tables are also supported.*

In [139]:
sequence = Seq("ACGCGACGA")
sequence.translate()

Seq('TRR')

### Using `Seq` methods directly on `string`s. 
It is possible to ignore `Seq` class abstraction and import and execute relevant methods directly on simple `string`s. 

In [140]:
from Bio.Seq import transcribe
print("Transcription directly from 'ACGT' string, without using Seq object:", transcribe("ACGT"))

Transcription directly from 'ACGT' string, without using Seq object: ACGU


## Alignment

One of the most common uses of bioinformatics tools is sequence alignment. 
**Biopython** provides support for various command-line and online alignment tools. The details can be found [here](https://biopython.org/docs/latest/api/Bio.Align.Applications.html). 

For this section of this workshop, we will use the built-in, out-of-the-box alignment method contained in `pairwise2` module. It allows aligning 2 sequences at a time.

We will first import it, run a global (full sequence) alignment and present relevant results.

In [141]:
from Bio import pairwise2

# Notice we can supply either Seq's or strings directly.
global_alignments = pairwise2.align.globalxx(Seq("ACGT"), "ACGC")
for ga in global_alignments:
    print(ga)

Alignment(seqA='ACGT-', seqB='ACG-C', score=3.0, start=0, end=5)
Alignment(seqA='ACGT', seqB='ACGC', score=3.0, start=0, end=4)


This representation is not really user friendly. 

Let's try to make it look better.

In [142]:
# Print the first available alignment. 
# The * converts the Alignment objects into a list of parameters required by the format_alignment function.
print(pairwise2.format_alignment(*global_alignments[0]))

ACGT-
|||  
ACG-C
  Score=3



Similar to the global alignment, we can use `pairwise2` to perform local alignment of 2 sequences.

In [143]:
local_alignments = pairwise2.align.localxx(Seq("ACGT"), "ACGC")
print(pairwise2.format_alignment(*local_alignments[0]))

1 ACG
  |||
1 ACG
  Score=3



Multiple sequence alignment is suitable for aligning 3 or more sequences at a time. 

*Note: for MSA, sequences are expected to be of the same length.* 

In [144]:
from Bio.Align import MultipleSeqAlignment, AlignInfo
from Bio.SeqRecord import SeqRecord

seqs = [
    "ACGTACGT",
    "ACGTGCGC",
    "ACGTA--T",
    "CCGTACGG",
    "A-GTACCC",
    "ACGTA--T",
    "CTG-ACG-",
    "AGGTACG-"
]
aligned = MultipleSeqAlignment([SeqRecord(Seq(s)) for s in seqs])

Aligning sequences does not represent too much value on its own. 

We may want to learn what statistics can be retrieved from it.

In [145]:
align_info = AlignInfo.SummaryInfo(aligned)
help(align_info)

Help on SummaryInfo in module Bio.Align.AlignInfo object:

class SummaryInfo(builtins.object)
 |  SummaryInfo(alignment)
 |  
 |  Calculate summary info about the alignment.
 |  
 |  This class should be used to calculate information summarizing the
 |  results of an alignment. This may either be straight consensus info
 |  or more complicated things.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, alignment)
 |      Initialize with the alignment to calculate information on.
 |      
 |      ic_vector attribute. A list of ic content for each column number.
 |  
 |  dumb_consensus(self, threshold=0.7, ambiguous='X', require_multiple=False)
 |      Output a fast consensus sequence of the alignment.
 |      
 |      This doesn't do anything fancy at all. It will just go through the
 |      sequence residue by residue and count up the number of each type
 |      of residue (ie. A or G or T or C for DNA) in all sequences in the
 |      alignment. If the percentage of the most common 

In [146]:
print("The simple consensus:", align_info.dumb_consensus())
print("Alignment score for particular bases at a given position.")
print("Dumb consensus on the y axis up-to-down, bases on the x-axis.")
print(align_info.pos_specific_score_matrix())

The simple consensus: ACGTACGX
Alignment score for particular bases at a given position.
Dumb consensus on the y axis up-to-down, bases on the x-axis.
    A   C   G   T
A  6.0 2.0 0.0 0.0
C  0.0 5.0 1.0 1.0
G  0.0 0.0 8.0 0.0
T  0.0 0.0 0.0 7.0
A  7.0 0.0 1.0 0.0
C  0.0 6.0 0.0 0.0
G  0.0 1.0 5.0 0.0
X  0.0 2.0 1.0 3.0



## Downloading and reading FASTA files

FASTA is a simple, plain-text-based sequence format for representing sequences, which supports multiple records contained in the same file. An example:

>\> MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

We can see that in addition to the raw aminoacid sequence there is some metadata denoting where the sequence came from and what it represents.

For our example, we will be using [CFTR gene](https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000001626;r=7:117287120-117715971), encoding [a protein](https://www.uniprot.org/uniprot/P13569).

First, let's download ther relevant FASTA file. 

In [147]:
from urllib.request import urlretrieve 

CFTR_FASTA_path = "data/ENSG00000001626.fasta"
result_location, http_response = urlretrieve("https://rest.ensembl.org/sequence/id/ENSG00000001626.fasta", CFTR_FASTA_path)
print("Downloaded file to: ", result_location)

print("\nDownload metadata and statistics:")
print(http_response)

Downloaded file to:  data/ENSG00000001626.fasta

Download metadata and statistics:
Vary: Content-Type
Vary: Origin
Content-Type: text/x-fasta; charset=UTF-8
Date: Tue, 23 Nov 2021 18:45:51 GMT
X-RateLimit-Limit: 55000
X-RateLimit-Reset: 849
X-Runtime: 1.262950
Connection: close
X-RateLimit-Period: 3600
X-RateLimit-Remaining: 54998
Content-Length: 436062




The above message may look complicated - it contains a lot of information related to the transfer and the type of the file. 

To open the file and read relevant information from it, we use `SeqIO` module.

In [148]:
# Reading FASTA files. Note that each file can contain multiple records.
from Bio import SeqIO
records = [r for r in SeqIO.parse(CFTR_FASTA_path, "fasta")]
print("Total records extracted:", len(records))
print("The first record:       ", records[0])
print("Type of records:        ", type(records[0]))
help(records[0])

Total records extracted: 1
The first record:        ID: ENSG00000001626.16
Name: ENSG00000001626.16
Description: ENSG00000001626.16 chromosome:GRCh38:7:117287120:117715971:1
Number of features: 0
Seq('AGGCGGATCACAAGTTCATGAGATCGAGACCATCTTGGCCAACATGGTGAGACC...ACA')
Type of records:         <class 'Bio.SeqRecord.SeqRecord'>
Help on SeqRecord in module Bio.SeqRecord object:

class SeqRecord(builtins.object)
 |  SeqRecord(seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None, features=None, annotations=None, letter_annotations=None)
 |  
 |  A SeqRecord object holds a sequence and information about it.
 |  
 |  Main attributes:
 |   - id          - Identifier such as a locus tag (string)
 |   - seq         - The sequence itself (Seq object or similar)
 |  
 |  Additional attributes:
 |   - name        - Sequence name, e.g. gene name (string)
 |   - description - Additional text (string)
 |   - dbxrefs     - List of database cross references (list o

We can see that `SeqRecord`, a class used to represent items read from FASTA files is relatively big and contains placeholders for a lot of possible information, including references, description, annotations etc.

To get the actual `Seq` object we access `seq` property.

In [149]:
record = records[0]
print("Sequence length:", len(record.seq))
print("GC content:", GC(record.seq))

Sequence length: 428852
GC content: 36.895245912342716


## Discussion

In this notebook we have discussed the basic ways sequences, the bread-and-butter of bioinformatics, can be constructed and operated on, including allowing for transcription and translation. We have seen  how **Biopython** supports sequence aligning - both pairwise and MSA. We have looked at how to obtain and open FASTA files to retrieve the records with associated metadata. 

This notebook only touches on the tools and integrations provided with **Biopython** library, the plethora of formats and tools that can be used to store and process sequences and alignment data.

This notebook relied on very simple toy sequences. In practice, this is unrealistic. Try going through this notebook using longer, more representative sequences and experimenting with the code.

The exercise section that accompanies this notebook will allow you to use the code here in more practical scenarios.