### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 05 - Introduction to Biopython - Sequences

*Written by:* Mateusz Kaczyński

**This notebook provides a general introduction to Biopython library and an in-depth look at dealing with the sequences - the cornerstone of bioinformatics.**


### What is Biopython?

**Biopython** is a popular open-source toolbox for computational biology and bioinformatics. It contains tools and connectors for various resources and provides functions for running common bioinformatics tasks and parsing common bioinformatics data formats.


### Why Biopython?
**Biopython** is a de-facto standard for accessing a lot of the databases and tools making it easier to share your work and results. It simplifies the use of the libraries written in other languages and technologies as well as those hosted online. While it is possible to use those directly - Biopython simplifies the process and provides the components you. 
 


## Contents


1. [Sequence basics](#Sequence-basics)
2. [Transcription and translation](#Transcription-and-translation)
3. [Alignment](#Alignment)
4. [Downloading and reading FASTA files](#Downloading-and-reading-fasta-files)
5. [Discussion](#Discussion)
-----

#### Extra Resources:

- [Official Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) - A comprehensive guide to the library capabilities.
- [Biopython API documentation](https://biopython.org/docs/latest/api/index.html) - a long, detailed list of all methods and connectors provided by Biopython.
- [Rosalind](http://rosalind.info) - A bioinformatics learning platform that includes exercises.




#### Importing Biopython
While installed, **Biopython** can be included in the project by importing it as `Bio` module. It is not that uncommon for the library to be called differently than the imported name. 

In [None]:
import Bio
print("Module", Bio.__name__, "version", Bio.__version__)

## Sequence basics
Managing sequences of DNA, RNA or peptides are a core part of any bioinformatics workflow. Being able to understand, transform and use them is one of the most beneficial uses of computers in biology.

Here is how we define a simple sequence.


In [None]:
from Bio.Seq import Seq
dna_sequence = Seq("ACTG")

You can see that the sequence can be created from a simple `string` structure representing the ordered items (in this case - DNA). However, it is no longer a simple data type.

Let's take a closer look at the Python representation.

In [None]:
print("The sequence compared with the original string:", dna_sequence == "ACTG")
print("The type of dna_sequence:", type(dna_sequence))

### Objects and classes
We can see that, while we can compare it with the original `string` object without an issue, the type of the variable is `Seq`.

`dna_sequence` is an `object` of `class` `Seq`. 
    
Let's unfold those concepts:
  - `object` - a fundamental building block of a program - it can contain properties and functions. In Python almost anything is an `object`.
  - `class` - a blueprint for creating new objects, provides a way to mix data and functions
  - `Seq` - a particular `class` defined in `Bio.Seq` module.


When in doubt, we can use `help` function to help us understand how to use `Seq`s. 

In [None]:
help(dna_sequence)

### String-like behaviour
In the first workshop, we became familiar with how Python `string` - simple text representation objects behave. 

`Seq` class is built to follow the same methods to help with interchanging the information between the two types of data and present a more familiar interface to the user.

Let's verify this with some common methods, including indexing and slicing.

In [None]:
print("The length of the sentence:", len(dna_sequence))
print("AC found in the sequence:  ", "AC" in dna_sequence)
print("Sequence ends with 'TG':   ", dna_sequence.endswith("TG"))
print("The first base:            ", dna_sequence[0])
print("The last base:             ", dna_sequence[-1])
print("Slicing - first two bases: ", dna_sequence[0:2])
print("How many Adenines are in the sequence:", dna_sequence.count("A"))

What about case sensitivity? Does it matter if we represent Adenine with `a` or `A`? 

Turns out that yes - it matters a lot. Due to the recent changes in **Biopython** (removal of `Alphabet` concept), there is no easy way to verify or enforce the content of `Seq` objects. Users are required to perform their own verification.

Let's have a closer look.

In [None]:
dna_sequence_mixedcase = Seq("acTG")
print("Upper case sequence equal to mixed case sentence:       ", 
      dna_sequence == dna_sequence_mixedcase)
print("How about if we change the original to lower case?:     ", 
      dna_sequence.lower() == dna_sequence_mixedcase)
print("How about moving both to the same case? Are they equal?:", 
      dna_sequence.lower() == dna_sequence_mixedcase.lower())

To extend the sequence, we can use the familiar `string` syntax. 

Again, it does not matter if you use `string` or `Seq` class.

In [None]:
print("Extended with another sequence:", dna_sequence + Seq("TAA"))
print("Extended with a string        :", dna_sequence + "TGG")

### Example sequence-specific functions 
In addition to `string`-like behaviour, `Seq` objects contain methods applicable in their specific domain. 

To combine multiple `string`s (e.g. a `list`), we can use `join` function. It will bind together the contents with the original string as a separator. Take a look at how it is used to generate a longer sequence with gaps.

For the cases where there are gaps, `ungap` method helps to deal with them.

In [None]:
sequence_with_gaps = Seq("-".join(3*[str(dna_sequence)]))
print("Sentence with gaps   :", sequence_with_gaps)
print("Sentence without gaps:", sequence_with_gaps.ungap())

**Biopython** contains various tools that can be used for sequence analysis, for example, GC content calculation can be imported like so. 

In [None]:
from Bio.SeqUtils import GC
print("The guanine-cytosine content of the sequence (%):", GC(dna_sequence))

## Transcription and translation
One of the most powerful features of **Biopython** is switching from one representation to another, converting between seqeunce types.

Let's take a look at how to generate complementary and reverse sequences.

In [None]:
print("Sequence:          ", dna_sequence)
print("Complement:        ", dna_sequence.complement())
print("Reverse:           ", Seq("".join(reversed(dna_sequence))))
print("Reverse complement:", dna_sequence.reverse_complement())

Simple DNA-to-RNA transcription can be achieved with the following methods.

In [None]:
print("Sequence:              ", dna_sequence)
print("Transcribed:           ", dna_sequence.transcribe())
print("Reverse complement RNA:", dna_sequence.reverse_complement_rna())

The reversal of the process - getting from RNA to DNA is also supported.

In [None]:
transcribed = dna_sequence.transcribe()
print("Transcibed:      ", transcribed)
print("Back transcribed:", transcribed.back_transcribe())

**Biopython** makes translation from RNA (or directly DNA) to aminoacid sequence a very easy task. 

*Note that different codon tables are also supported.*

In [None]:
sequence = Seq("ACGCGACGA")
sequence.translate()

#### Using `Seq` methods directly on `string`s. 
It is possible to ignore `Seq` class abstraction and import and execute relevant methods directly on simple `string`s. 

In [None]:
from Bio.Seq import transcribe
print("Transcription directly from 'ACGT' string, without using Seq object:", transcribe("ACGT"))

## Alignment

One of the most common uses of bioinformatics tools is sequence alignment. 
**Biopython** provides support for various command-line and online alignment tools. The details can be found [here](https://biopython.org/docs/latest/api/Bio.Align.Applications.html). 

For this section of this workshop, we will use the built-in, out-of-the-box alignment method contained in `pairwise2` module. It allows aligning 2 sequences at a time.

We will first import it, run a global (full sequence) alignment and present relevant results.

In [None]:
from Bio import pairwise2

# Notice we can supply either Seq's or strings directly.
global_alignments = pairwise2.align.globalxx(Seq("ACGT"), "ACGC")
for ga in global_alignments:
    print(ga)

`score` represents a quality of the alignment. By default it translates into the number of elements that matched exactly. However, the match score and penalty for gaps can be adjusted when necessary.

This representation is not really user friendly. 

Let's try to make it look better.

In [None]:
# Print the first available alignment. 
# The * converts the Alignment objects into a list of parameters required by the format_alignment function.
print(pairwise2.format_alignment(*global_alignments[0]))

Similar to the global alignment, we can use `pairwise2` to perform local alignment of 2 sequences.

In [None]:
local_alignments = pairwise2.align.localxx(Seq("ACGT"), "ACGC")
print(pairwise2.format_alignment(*local_alignments[0]))

Multiple sequence alignment is suitable for aligning 3 or more sequences at a time. 

MSA requires a list of `SeqRecord`s. This class, in addition to the raw `Seq` allows for storing some of the metadata - name, description, annotations etc.

*Note: for MSA, sequences are expected to be of the same length.* 

In [None]:
from Bio.Align import MultipleSeqAlignment, AlignInfo
from Bio.SeqRecord import SeqRecord

seqs = [
    "ACGTACGT",
    "ACGTGCGC",
    "ACGTA--T",
    "CCGTACGG",
    "A-GTACCC",
    "ACGTA--T",
    "CTG-ACG-",
    "AGGTACG-"
]

seq_records = []
for s in seqs:
    seq_records.append(SeqRecord(Seq(s)))
                       
aligned = MultipleSeqAlignment(seq_records)

Aligning sequences does not represent too much value on its own. 

We may want to learn what statistics can be retrieved from it.

In [None]:
align_info = AlignInfo.SummaryInfo(aligned)
help(align_info)

In [None]:
print("The simple consensus:", align_info.dumb_consensus())
print("Alignment score for particular bases at a given position.")
print("Dumb consensus on the y axis up-to-down, bases on the x-axis.")
print(align_info.pos_specific_score_matrix())

## Downloading and reading FASTA files

FASTA is a simple, plain-text-based sequence format for representing sequences, which supports multiple records contained in the same file. An example:

>\> MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

We can see that in addition to the raw aminoacid sequence there is some metadata denoting where the sequence came from and what it represents.

For our example, we will be using [CFTR gene](https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000001626;r=7:117287120-117715971), encoding [a protein](https://www.uniprot.org/uniprot/P13569).

First, let's download ther relevant FASTA file. 

In [None]:
from urllib.request import urlretrieve 

CFTR_FASTA_path = "data/ENSG00000001626.fasta"
result_location, http_response = urlretrieve("https://rest.ensembl.org/sequence/id/ENSG00000001626.fasta", CFTR_FASTA_path)
print("Downloaded file to: ", result_location)

print("\nDownload metadata and statistics:")
print(http_response)

The above message may look complicated - it contains a lot of information related to the transfer and the type of the file. 

To open the file and read relevant information from it, we use `SeqIO` module.

In [None]:
# Reading FASTA files. Note that each file can contain multiple records.
from Bio import SeqIO

records = []
for r in SeqIO.parse(CFTR_FASTA_path, "fasta"):
    records.append(r)

print("Total records extracted:", len(records))
print("The first record:       ", records[0])
print("Type of records:        ", type(records[0]))
help(records[0])

To get the actual `Seq` object of `SeqRecord` we access `seq` property.

In [None]:
record = records[0]
print("Sequence length:", len(record.seq))
print("GC content:", GC(record.seq))

## Discussion

In this notebook we have discussed the basic ways sequences, the bread-and-butter of bioinformatics, can be constructed and operated on, including allowing for transcription and translation. We have seen  how **Biopython** supports sequence aligning - both pairwise and MSA. We have looked at how to obtain and open FASTA files to retrieve the records with associated metadata. 

This notebook only touches on the tools and integrations provided with **Biopython** library, the plethora of formats and tools that can be used to store and process sequences and alignment data.

This notebook relied on very simple toy sequences. In practice, this is unrealistic. Try going through this notebook using longer, more representative sequences and experimenting with the code.

The exercise section that accompanies this notebook will allow you to use the code here in more practical scenarios.

Click [here](#Contents) to go back to the contents.