### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 05 - Handling sequences with Biopython

*Written by:* Mateusz Kaczyński

**This notebook introduces the Biopython library, focusing on in-depth sequence handling—the cornerstone of bioinformatics.**


### What is Biopython?

**Biopython** is a widely-used, open-source toolkit for computational biology and bioinformatics. It includes tools, resource connectors, and functions for executing standard bioinformatics tasks and parsing common data formats.

*Parsing is just a fancier word for interpreting or analyzing data to extract meaningful information.*

### Why Biopython?

**Biopython** is the de facto standard for accessing numerous bioinformatics databases and tools, making it easier to share your work and results. It simplifies interaction with libraries written in other languages and technologies, as well as those hosted online. While it is possible to use these resources directly, Biopython streamlines the process and provides the essential components you need.

-----

## Contents

1. [Sequence basics](#Sequence-basics)
2. [Transcription and translation](#Transcription-and-translation)
3. [Alignment](#Alignment)
4. [Downloading and reading FASTA files](#Downloading-and-reading-fasta-files)
5. [Discussion](#Discussion)

-----

### Extra resources:

- [Official Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) - A comprehensive guide to the capabilities of the library.
- [Biopython API documentation](https://biopython.org/docs/latest/api/index.html) - A long, detailed list of all methods and connectors provided by Biopython.
- [Rosalind](http://rosalind.info) - A bioinformatics learning platform that includes exercises.

-----

### Installing Biopython
In order to use the library locally, we first need to install it. Most Python installations come with `pip` tool that can be run directly from the notebook. It will download and install relevant packages from [Python Package Index website (PyPI)](https://pypi.org)  

In [None]:
!pip install Biopython

### Importing Biopython
When installed, **Biopython** can be included in the project by importing it as the `Bio` module. 

It is not uncommon for the library to be called differently than the imported name. In this case, **Biopython** is referenced as **Bio**.

In [None]:
import Bio
print("Module", Bio.__name__, "version", Bio.__version__)

-----
## Sequence basics
Managing DNA, RNA, or peptide sequences is central to any bioinformatics workflow. The ability to understand, analyse, and transform these sequences is one of the most fundamental applications of computing in biology.

Here’s how to define a basic sequence.

In [None]:
from Bio.Seq import Seq
dna_sequence = Seq("ACTG")
dna_sequence

`Seq` is a submodule within `Bio`, specifically designed to handle biological sequences. The `Seq` class represents a sequence of nucleotides or amino acids and provides various methods for analysing and manipulating these sequences.

You can see that a sequence can be created from a basic string that represents the ordered elements–in this case, a DNA sequence. However, it is no longer treated as a simple data type.

Let’s examine this more closely with Python’s representation.

In [None]:
print("The sequence compared with the original string:", dna_sequence == "ACTG")
print("The type of dna_sequence:", type(dna_sequence))

The first line checks if `dna_sequence` is equivalent to the original string `"ACTG"`, demonstrating that the content is the same even though the data type is different.

The second line displays the type of `dna_sequence`, which will show it is a `Bio.Seq.Seq` object rather than a standard Python string, highlighting the enhanced functionality provided by the `Seq` class.

#### Objects and classes
We can see that, while `dna_sequence` can be compared with the original string without any issues, its type is actually `Seq`. This means that `dna_sequence` is an object of the `Seq` class. 
    
Let's unpack these key concepts:
  - **Object** - A fundamental building block in programming that can hold data (properties) and functions (methods). In Python, almost everything is an object, from numbers to complex data types.
  - **Class** - A blueprint or template for creating new objects. A class allows us to group data and related functions together, defining how each instance (object) should behave.
  - `Seq` - A particular class defined in the `Bio.Seq` module within Biopython, designed to handle biological sequences.

When in doubt, we can use the `help()` function to understand how to use `Seq`. This will display detailed information about the class, including available methods, usage examples, and descriptions of its functionalities.

In [None]:
help(dna_sequence)

#### String-like behaviour

In the first session, we explored how Python strings (simple objects representing text) behave. 

The `Seq` class is designed to mimic these string methods, making it easier to interchange information between strings and sequences while providing a familiar interface for the user.

Let’s verify this by using some common string methods, including indexing and slicing.

In [None]:
print("Length of the sentence:   ", len(dna_sequence))
print("AC found in the sequence?:", "AC" in dna_sequence)
print("Sequence ends with 'TG'?: ", dna_sequence.endswith("TG"))
print("The first base:           ", dna_sequence[0])
print("The last base:            ", dna_sequence[-1])
print("First two bases:          ", dna_sequence[0:2])
print("Adenines in the sequence: ", dna_sequence.count("A"))

Does case sensitivity matter when representing bases in a sequence, like `"A"` versus `"a"` for adenine?

Yes, it does. The `Seq` class is case-sensitive, meaning that uppercase and lowercase letters are treated as different characters. It is best to convert all bases to either uppercase or lowercase before performing comparisons or analyses. `Seq` objects provide methods like `.upper()` and `.lower()` to make this easy.

In [None]:
dna_sequence_mixedcase = Seq("acTG")

print("Is the uppercase sequence equal to the mixed case sequence?   ",
      dna_sequence == dna_sequence_mixedcase)

print("What if we change the original sequence to lowercase?:        ",
      dna_sequence.lower() == dna_sequence_mixedcase)

print("What if we make both sequences the same case? Are they equal?:",
      dna_sequence.lower() == dna_sequence_mixedcase.lower())

To extend the sequence, we can use the familiar string syntax as `Seq` objects in Biopython behave similarly to Python strings in this context.

In [None]:
print("Extended with another sequence:", dna_sequence + Seq("TAA"))
print("Extended with a string        :", dna_sequence + "TGG")

For the cases where there are gaps, we can remove them using the `.replace()` method.

In [None]:
sequence_with_gaps = Seq("ACGT-ACGT-ACGT")
print("Sentence with gaps   :", sequence_with_gaps)
print("Sentence without gaps:", sequence_with_gaps.replace("-", ""))

#### Sequence analysis 

**Biopython** offers a variety of tools for sequence analysis. For example, GC content calculation, a common metric in genomics, can be accessed directly from Biopython.
[The online documentation](https://biopython.org/docs/dev/api/Bio.SeqUtils.html) provides more details.

In [None]:
from Bio.SeqUtils import gc_fraction
print("The guanine-cytosine content of the sequence:", gc_fraction(dna_sequence)*100, "%")

## Transcription and translation
In addition to their string-like behaviour, `Seq` objects have methods specific to bioinformatics tasks.

One of the most powerful features of **Biopython** is its ability to switch between representations, allowing seamless conversion between different sequence types.

Let’s explore how to generate complementary and reverse sequences.

In [None]:
print("Sequence:          ", dna_sequence)
print("Complement:        ", dna_sequence.complement())
print("Reverse:           ", Seq("".join(reversed(dna_sequence))))
print("Reverse complement:", dna_sequence.reverse_complement())

Simple DNA-to-RNA transcription can be achieved with the following methods.

In [None]:
print("Sequence:              ", dna_sequence)
print("Transcribed:           ", dna_sequence.transcribe())
print("Reverse complement RNA:", dna_sequence.reverse_complement_rna())

The reversal of the process - getting from RNA to DNA is also supported.

In [None]:
transcribed = dna_sequence.transcribe()
print("Transcribed     :", transcribed)
print("Back transcribed:", transcribed.back_transcribe())

**Biopython** simplifies the process of translating RNA (or even directly from DNA) into an amino acid sequence.

*Note: Different codon tables are also supported, making it flexible for various organisms and use cases.*

In [None]:
sequence = Seq("ACGCGACGA")
sequence.translate()

#### Using `Seq` methods directly on strings. 
It’s possible to bypass the `Seq` class abstraction and directly import and use relevant methods on simple strings.

In [None]:
from Bio.Seq import transcribe
print('Transcription directly from "ACGT" string, without using a Seq object:', transcribe("ACGT"))

## Alignment

One of the most common uses of bioinformatics tools is sequence alignment.

We will use Biopython’s built-in, out-of-the-box [pairwise aligner](https://biopython.org/docs/latest/api/Bio.Align#Bio.Align.PairwiseAligner), which allows us to align two sequences at a time.

We’ll start by importing the `PairwiseAligner`, running a full sequence alignment, and displaying the relevant results.

In [None]:
from Bio.Align import PairwiseAligner

# Initialise the `PairwiseAligner`
aligner = PairwiseAligner()

# Notice we can provide either `Seq` objects or simple strings as input
alignments = aligner.align(Seq("ACGT"), "ACGC")

# Display alignment results
for alignment in alignments:
    print("Score:", alignment.score)
    print(alignment)

The `score` attribute represents the quality of the alignment. By default, it indicates the number of elements that matched exactly. However, both the match score and the penalty for gaps can be adjusted as needed to customize the alignment criteria.

**Multiple Sequence Alignment (MSA)** is used when aligning 3 or more sequences simultaneously.

For MSA, a list of `SeqRecord` objects is required. The `SeqRecord` class, unlike the basic `Seq` class, allows for additional metadata storage, such as:
- **Name**: A short identifier for the sequence.
- **Description**: A longer explanation or label.
- **Annotations**: Key-value pairs for storing various details.

*Note: For MSA, all sequences are expected to be of the same length.*

In [None]:
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord

# Define a list of sequences as strings
seqs = [
    "ACGTACGT",
    "ACGTGCGC",
    "ACGTA--T",
    "CCGTACGG",
    "A-GTACCC",
    "ACGTA--T",
    "CTG-ACG-",
    "AGGTACG-"
]

# Convert each string in the `seqs` list into a `SeqRecord` object containing a `Seq`
seq_records = [SeqRecord(Seq(s)) for s in seqs]

# Create a `MultipleSeqAlignment` object with the `SeqRecord` objects
aligned = MultipleSeqAlignment(seq_records)

# No output is expected for this cell

We can access the sequence records in the `aligned` object by their index, similar to accessing elements in a `list` or `tuple`.

In [None]:
# Access the first sequence record
print(aligned[0], "\n")

# Access the second sequence record
print(aligned[1], "\n")

# Access the last sequence record
print(aligned[-1])

We can also access specific parts of the alignment by slicing both rows and columns, using 2-dimensional slicing.

In [None]:
# Access the first three sequences in the alignment
print("First three sequences:\n", aligned[:3], "\n")

# Access the first four positions (columns) across all sequences
print("First four columns:\n", aligned[:, :4], "\n")

# Access rows 1 to 3 and columns 2 to 6
print("Rows 1 to 3, columns 2 to 6:\n", aligned[1:4, 2:6])

Aligning sequences alone provides limited value. To gain meaningful insights, we can extract various statistics from the alignment.

For example, Biopython’s `AlignInfo` module offers methods to summarize alignment information and retrieve useful statistics.

In [None]:
from Bio.Align import AlignInfo

align_info = AlignInfo.SummaryInfo(aligned)

# No output is expected from this cell

To explore the functions available in the `AlignInfo` module, we can use the `help()` function to view its documentation and learn more about its methods and attributes.

In [None]:
help(align_info)

Let’s try out some of the useful methods provided by `AlignInfo` to retrieve insights from the alignment.

In [None]:
print("The simple consensus:", align_info.dumb_consensus())
print("Alignment score for particular bases at a given position:")
print(align_info.pos_specific_score_matrix())

## Downloading and reading FASTA files

FASTA is a simple, plain-text format commonly used for storing sequence data. Each sequence in a FASTA file includes:
- A **header line** starting with `>`, containing metadata such as sequence ID, name, and description.
- The **sequence itself** in subsequent lines, represented by single-letter codes for nucleotides or amino acids.

#### Example FASTA Format

>\> MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

In addition to the amino acid sequence, the header line provides metadata about the sequence's origin and description.

#### Example sequence

For our example, we will use the [CFTR protein](https://www.uniprot.org/uniprot/P13569), associated with cystic fibrosis. Mutations in the CFTR gene are linked to this condition.

First, let’s download the `P13569.fasta` file.

1. Go to the UniProt link for the CFTR protein: [Cystic fibrosis transmembrane conductance regulator protein | Uniprot](https://www.uniprot.org/uniprot/P13569).
2. Click on the **Download** button next to **Tools**.
3. Under **Format**, select **FASTA (canonical)**, then click **Generate URL for API**. Finally, click the **Copy** button.

You should now have the link: `"https://rest.uniprot.org/uniprotkb/P13569.fasta"`. The code for downloading `.fasta` files when you have the URL is shown below.

In [None]:
from urllib.request import urlretrieve 

# Specify the destination path for the downloaded file
CFTR_FASTA_path = "P13569.fasta"

# Download the FASTA file from the UniProt URL and save it to the specified path
result_location, http_response = urlretrieve("https://rest.uniprot.org/uniprotkb/P13569.fasta", CFTR_FASTA_path)

# Confirm the download location
print("Downloaded file to:", result_location)

# Print metadata and statistics related to the download
print("\nDownload metadata and statistics:")
print(http_response)

The `urlretrieve` function from Python’s `urllib.request` module allows you to download a file from a given URL and save it locally. It takes two main arguments, i.e. `"https://rest.uniprot.org/uniprotkb/P13569.fasta"` and `CFTR_FASTA_path`, and returns a tuple containing two elements, i.e. `result_location` and `http_response`.
- The `CFTR_FASTA_path` specifies the filename and path where the `.fasta` file will be saved.
- The `result_location` confirms the path to the downloaded file.
- The `http_response` contains metadata about the download, such as headers, which may include status information and server response details.

To open the file and read relevant information from it, we use `SeqIO` module from Biopython.

In [None]:
from Bio import SeqIO

# Note that each file can contain multiple records
# Initialise an empty list to store the records
records = []

# Parse the FASTA file and add each record to the list
for r in SeqIO.parse(CFTR_FASTA_path, "fasta"):
    records.append(r)

# Print information about the extracted records
print("Total records:", len(records), "\n")
print("First record:") 
print(records[0])
print("\nType of first record:", type(records[0]))

- `SeqIO.parse`: Reads the file in FASTA format and returns an iterable of `SeqRecord` objects, one for each sequence in the file.
- `records` list: Collects each `SeqRecord` object, allowing us to access the sequences and associated metadata later.
- ***Output Details***:
  - **Total records**: Displays the count of records in the file, indicating if there are multiple sequences.
  - **First record**: Shows the first `SeqRecord` object to provide a sample of the data structure.
  - **Type of first record**: Confirms the type of the first item on the list, which should be `SeqRecord`.

To get the actual `Seq` object from a `SeqRecord`, we access the `seq` property.

In [None]:
# Access the first record from the list of records
record = records[0]

# Print the length of the sequence
print("Sequence length:", len(record.seq), "\n")

# Print the sequence itself
print(record.seq)

## Folding

We have covered the initial steps of the ***central dogma of molecular biology***: transcribing DNA into RNA and then translating it into a protein sequence. The final step, i.e. converting a sequence into a protein structure, is known as **protein folding**. This process is considerably more complex and computationally expensive to predict.

One method that has recently gained prominence is **[AlphaFold](https://www.nature.com/articles/s41586-021-03819-2)**. It predicts protein structures based on known folding patterns and sequence alignments, using deep learning techniques to achieve remarkable accuracy. 
The **[2024 Nobel Prize in Chemistry](https://www.nature.com/articles/d41586-024-03214-7)** was awarded to John Jumper and Demis Hassabis of Google DeepMind for developing this AI tool, and to David Baker for his work on computational protein design. Their contributions have revolutionised biology and hold the potential to transform drug discovery.

You can make your own predictions by using [the official AlphaFold notebook](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb#scrollTo=rowN0bVYLe9n) on Google Colab.

## Discussion

In this notebook, we explored fundamental ways to construct and manipulate sequences, including transcription and translation. We examined how **Biopython** facillitates *sequence alignment*, both pairwise and multiple sequence alignment (MSA), and demonstrated how to *download and read FASTA files* to access sequence records and metadata.

This notebook provides only an introduction to the extensive tools and integrations available within the **Biopython** library, which supports a wide range of formats and methods for storing and processing sequence and alignment data.

Please note, this notebook uses very simple sequences. In real-world applications, sequences are often much longer and more complex. We encourage you to revisit this notebook with longer sequences and to experiment with the code to gain a deeper understanding.

You can now move on to the exercise notebook which will allow you to use the code introduced here in more practical scenarios.

If you want to learn more there are some extra external resources linked at the beginning of this notebook. You can click [here](#Contents) to go back to the top.