### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 07 - Handling protein data with Biopython (supplementary material)

*Written by:* Mateusz Kaczyński

**This notebook covers handling protein data with Biopython, as well as analysis, property predictions, and similarity searches.**

-----

## Contents

1. [Basic analysis](#Basic-analysis)
2. [Property prediction](#Property-prediction)
3. [BLAST](#BLAST)
4. [PDB files](#PDB-files)
5. [Discussion](#Discussion)

-----

### Extra resources:

- [Official Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) - A comprehensive guide to the capabilities of the library.
- [Biopython API documentation](https://biopython.org/docs/latest/api/index.html) - A long, detailed list of all methods and connectors provided by Biopython.
- [Rosalind](http://rosalind.info) - A bioinformatics learning platform that includes exercises.

-----

### Installing Biopython

In [None]:
# No need to run if you have already installed Biopython when going through the previous notebooks.
!pip install Biopython

### Importing required modules and functions

In [None]:
import Bio
print("Module", Bio.__name__, "version", Bio.__version__)
from urllib.request import urlretrieve
from Bio import SeqIO

-----
## Basic analysis

**Biopython** offers a range of tools for protein analysis.

In this example, we will work with the [CFTR protein](https://www.uniprot.org/uniprot/P13569), which is associated with cystic fibrosis. Mutations in the CFTR gene are known to cause this condition.

For more details on CFTR, you can refer to the following resources:
- **Ensembl**: [CFTR Gene Summary on Ensembl](https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000001626;r=7:117287120-117715971)
- **UniProt**: [CFTR Protein Entry on UniProt](https://www.uniprot.org/uniprot/P13569)

First, we will make a `data_supplementary` directory and then download the corresponding FASTA file to extract the sequence.

In [None]:
!mkdir data_supplementary

In [None]:
# Download the FASTA file for the CFTR protein and save it to the specified path
urlretrieve("https://www.uniprot.org/uniprot/P13569.fasta", "data_supplementary/P13569.fasta")

# Use the `next()` function to retrieve the first sequence record from the FASTA file
cftr_aa = next(SeqIO.parse("data/P13569.fasta", "fasta"))

# Print the retrieved sequence record
print(cftr_aa)

**Biopython** provides the `ProteinAnalysis` class, which offers a range of tools for analyzing protein sequences.

This class includes functionality for tasks such as:
- Calculating molecular weight
- Determining amino acid composition
- Estimating isoelectric points
- Predicting secondary structure elements

By using `ProteinAnalysis`, you can perform comprehensive protein analysis with minimal code.


In [None]:
from Bio.SeqUtils.ProtParam import ProteinAnalysis
analysis = ProteinAnalysis(str(cftr_aa.seq))

# No output is expected from this cell

To delve into the full functionality of the `ProteinAnalysis` class, you can use `help()` function. Uncomment the next line to see what other information can be obtained from `analysis` object.

In [None]:
# help(analysis)

Let’s generate a quick summary of the amino acid composition in the protein sequence.

Here, we’ll use the `ProteinAnalysis` class to get the count of each amino acid, and we’ll use `pprint` (PrettyPrint) to format the output, making the dictionary easier to read.

In [None]:
import pprint

count_of_aas = analysis.count_amino_acids()
print("Count of particular aminoacids:")
print(count_of_aas)

print("\nCount of particular aminoacids using PrettyPrint:")
pprint.pprint(count_of_aas)

Let's take a look at some protein properties available.

`"{:.2f}"` is used to print only a `float` number to the first two decimal places.

In [None]:
print("Molecular weight    :", "{:.2f}".format(analysis.molecular_weight()))
print("\nCharge at a given pH:", "{:.2f}".format(analysis.charge_at_pH(5.8)))
print("\nIsoelectric point   :", "{:.2f}".format(analysis.isoelectric_point()))
in_helix, in_turn, in_sheet = analysis.secondary_structure_fraction()
print(
    "\nFractions of amino acids associated with different secondary structures\n"\
    "               Helix: {:.2f}\n"\
    "               Turn: {:.2f}\n"\
    "               Sheet: {:.2f}".format(in_helix, in_turn, in_sheet)
)

We can use Biopython’s helper functions to create custom statistics. 

For example, let’s calculate the content of **branched-chain amino acids (BCAAs)**—leucine (L), isoleucine (I), and valine (V)—in the protein. These amino acids are often important for protein structure and function.

In [None]:
total_number_of_LIV_aas = 0

# For each amino acid L, I, and V, add the count from the dictionary
for aa in ["L", "I", "V"]:
    total_number_of_LIV_aas += count_of_aas[aa]

print("BCAA content:", (total_number_of_LIV_aas / len(cftr_aa))*100, "%")

## Property prediction

In this section, we will analyse the hydrophobicity of the protein.

The [Kyte-Doolittle scale](https://doi.org/10.1016/0022-2836(82)90515-0) is a valuable tool for predicting a protein’s hydropathic character. This scale is based on experimentally derived properties of amino acids, with higher values indicating greater hydrophobicity (i.e., a tendency to repel water). Hydrophobic regions identified using this scale can give insights into protein folding, structure, and potential membrane-binding regions.

In [None]:
# Define the Kyte-Doolittle hydrophobicity scale for each amino acid
Kyte_and_Doolittle_scale = {
    "A": 1.8,  "C": 2.5,  "D": -3.5, "E": -3.5, "F": 2.8, 
    "G": -0.4, "H": -3.2, "I": 4.5,  "K": -3.9, "L": 3.8,
    "M": 1.9,  "N": -3.5, "P": -1.6, "Q": -3.5, "R": -4.5,
    "S": -0.8, "T": -0.7, "V": 4.2,  "W": -0.9, "Y": -1.3
}

# Amino acid sequence of the CFTR protein
sequence = """
MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLEREWDRE
LASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLLGRIIASYDPDNKEERSIA
IYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLIYKKTLKLSSRVLDKISIGQL
VSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWELLQASAFCGLGFLIVLALFQAGL
GRMMMKYRDQRAGKISERLVITSEMIENIQSVKAYCWEEAMEKMIENLRQTELKLTRKAA
YVRYFNSSAFFFSGFFVVFLSVLPYALIKGIILRKIFTTISFCIVLRMAVTRQFPWAVQT
WYDSLGAINKIQDFLQKQEYKTLEYNLTTTEVVMENVTAFWEEGFGELFEKAKQNNNNRK
TSNGDDSLFFSNFSLLGTPVLKDINFKIERGQLLAVAGSTGAGKTSLLMVIMGELEPSEG
KIKHSGRISFCSQFSWIMPGTIKENIIFGVSYDEYRYRSVIKACQLEEDISKFAEKDNIV
LGEGGITLSGGQRARISLARAVYKDADLYLLDSPFGYLDVLTEKEIFESCVCKLMANKTR
ILVTSKMEHLKKADKILILHEGSSYFYGTFSELQNLQPDFSSKLMGCDSFDQFSAERRNS
ILTETLHRFSLEGDAPVSWTETKKQSFKQTGEFGEKRKNSILNPINSIRKFSIVQKTPLQ
MNGIEEDSDEPLERRLSLVPDSEQGEAILPRISVISTGPTLQARRRQSVLNLMTHSVNQG
QNIHRKTTASTRKVSLAPQANLTELDIYSRRLSQETGLEISEEINEEDLKECFFDDMESI
PAVTTWNTYLRYITVHKSLIFVLIWCLVIFLAEVAASLVVLWLLGNTPLQDKGNSTHSRN
NSYAVIITSTSSYYVFYIYVGVADTLLAMGFFRGLPLVHTLITVSKILHHKMLHSVLQAP
MSTLNTLKAGGILNRFSKDIAILDDLLPLTIFDFIQLLLIVIGAIAVVAVLQPYIFVATV
PVIVAFIMLRAYFLQTSQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFGRQPYFETLFHK
ALNLHTANWFLYLSTLRWFQMRIEMIFVIFFIAVTFISILTTGEGEGRVGIILTLAMNIM
STLQWAVNSSIDVDSLMRSVSRVFKFIDMPTEGKPTKSTKPYKNGQLSKVMIIENSHVKK
DDIWPSGGQMTVKDLTAKYTEGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRL
LNTEGEIQIDGVSWDSITLQQWRKAFGVIPQKVFIFSGTFRKNLDPYEQWSDQEIWKVAD
EVGLRSVIEQFPGKLDFVLVDGGCVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPVT
YQIIRRTLKQAFADCTVILCEHRIEAMLECQQFLVIEENKVRQYDSIQKLLNERSLFRQA
ISPSDRVKLFPHRNSSKCKSKPQIAALKEETEEEVQDTRL
""".replace("\n", "")  # Remove newline characters

# No output is expected from this cell

We will use a **sliding window approach** to calculate a smooth hydrophobicity profile across the protein sequence.

> For a pre-defined window size `n`, at each position in the sequence, we average its current hydrophobicity value along with the values of the `(n-1)/2` preceding and following amino acids.

You can think of this approach as a fixed-size rectangle that moves across the sequence, averaging values within the window to get a mean hydrophobicity for each section of the protein. This helps reveal broader hydrophobic and hydrophilic regions, smoothing out minor fluctuations.

The `enumerate` function will be used here to generate tuples containing index positions and corresponding values, making it easy to track the position in the sequence and the associated hydrophobicity score.

In [None]:
window_size = 11
hydrophobicity = []  # The hydrophobicity value at a given point in the sequence

# The `enumerate(sequence)` will return a tuple of both the index and the amino acid at each position
for i, aa in enumerate(sequence):  
    window_start = int(i - (window_size-1)/2)
    window_end = int(i + (window_size-1)/2)+1
    # If `window_start` is less than 0 or `window_end` goes beyond the sequence length, 
    # `window_hydrophobicity` is set to `None` since the window cannot be fully contained
    # within the sequence
    if window_start < 0 or window_end > len(sequence):
        window_hydrophobicity = None 
    else:
        # For valid windows, `aas_in_window` retrieves the amino acids within the current window
        aas_in_window = sequence[window_start:window_end]
        # Each amino acid's hydrophobicity score is fetched from `Kyte_and_Doolittle_scale`,
        # a dictionary mapping each amino acid to its hydrophobicity score, and the average hydrophobicity 
        # for the window is calculated by summing the hydrophobicity scores and dividing by `window_size`
        window_hydrophobicity = sum([Kyte_and_Doolittle_scale[aa] for aa in aas_in_window]) / window_size
    # The calculated `window_hydrophobicity` value is appended to the `hydrophobicity` list
    hydrophobicity.append(window_hydrophobicity)

print("We have calculated the hydrophobicity for {} positions".format(len(hydrophobicity)))
print("Hydrophobicity list:")
print(hydrophobicity[:15], "...", hydrophobicity[-15:])

# Note that this is slightly different than GRAVY from the reference paper
# `h if h else 0 for h in hydrophobicity` is used to handle None values
print("Average hydrophobicity:", "{:.4f}".format(sum(h if h else 0 for h in hydrophobicity) / len(hydrophobicity))) 

Now, let’s plot the hydrophobicity along the sequence to visually identify hydrophobic and hydrophilic regions.

We’ll initialise `matplotlib` for visualisation within notebook cells, then plot the `hydrophobicity` list that was generated using the sliding window approach. This [Hydrophilicity Plot link](https://en.wikipedia.org/wiki/Hydrophilicity_plot) may be helpful for reference.

In [None]:
# Enable inline plotting in Jupyter notebooks
%matplotlib inline

import matplotlib.pyplot as plt

# Set plot resolution
plt.rcParams['figure.dpi'] = 80

plt.plot(hydrophobicity)
plt.title("Hydrophobicity per sequence region using window size {}".format(window_size))
plt.xlabel("Position in sequence")
plt.ylabel("Hydrophobicity score")
plt.show()

## BLAST

The **Basic Local Alignment Search Tool (BLAST)** allows users to find similar regions across protein sequences and retrieve the most similar matches. 

**Biopython** provides tools for running BLAST searches both locally (using command-line tools) and remotely via online computation services. In this section, we will use NCBI's BLAST cloud services.

*Note: Running BLAST searches can be computationally intensive, especially with large sequence databases, so expect any queries to take several minutes.*

The **Biopython** BLAST module includes two main classes:
 - **`NCBIWWW`**: Used to issue BLAST queries to the remote NCBI server.
 - **`NCBIXML`**: Used to parse BLAST results (returned in XML format) into an object that can be easily processed in code.

In [None]:
from Bio.Blast import NCBIWWW, NCBIXML

# Submit the BLAST query to NCBI's remote server
query_handle = NCBIWWW.qblast(
    "blastp",             # Use "blastp" for protein BLAST
    database="nr",        # Search against the "nr" (non-redundant) protein sequence database
    sequence=cftr_aa.seq  # Query sequence from the CFTR protein
)

# Parse the XML-formatted BLAST results from NCBI
# Using `next()` to get the first (or only) BLAST record from the parsed results
blast_results = next(NCBIXML.parse(query_handle))

print("BLAST finished")

In order to visualise the results, we could simply iterate over them, printing the relevant information.

In BLAST results, each `alignment` can contain multiple High-scoring Segment Pairs (HSPs), accessible via `alignment.hsps`. An **HSP** represents a specific local alignment between the query and database sequence.

- **Alignment Properties** (`alignment.title`, `alignment.accession`, `alignment.length`): These properties describe the entire matched sequence in the database and are common across all HSPs within that alignment.
  
- **HSP-specific Properties** (`hsp.expect` for E-value and `hsp.identities` for Identity): These values are specific to each HSP and represent the alignment score and similarity of each high-scoring segment. Since there can be multiple HSPs within an alignment, each HSP has its own `E-value` and `Identity` values.

In the code, `.hsp` is only used for **E-value** and **Identity** because they are segment-specific properties, while the rest of the properties apply to the overall alignment.

Here, we only include the first HSP for each alignment, focusing on a single top alignment for each sequence hit rather than showing every HSP. This approach is suitable when only the top alignment is relevant per sequence hit.

In [None]:
# List to store summary information for each alignment
blast_summary = []

# Iterate through alignments, focusing on the first HSP only
for alignment in blast_results.alignments:
    hsp = alignment.hsps[0]
    blast_summary.append({
        "Alignment title": alignment.title,
        "Accession code": alignment.accession,
        "Alignment length": alignment.length,
        # Extract relevant information for the first HSP
        "E-value": alignment.hsps[0].expect,
        "Identity": alignment.hsps[0].identities
    })

# Print summary information for each alignment
for result in blast_summary:
    print("Alignment title:", result["Alignment title"])
    print("Accession code:", result["Accession code"])
    print("Alignment length:", result["Alignment length"])
    print("E-value:", result["E-value"])
    print("Identity:", result["Identity"])
    print()

We can also use `pandas` library to display the results. This code builds a `pandas` DataFrame from the BLAST alignment results. 

In [None]:
import pandas as pd

df = pd.DataFrame([
    {
        "Alignment title": alignment.title, 
        "Accession code": alignment.accession, 
        "Hit definition": alignment.hit_def, 
        "Alignment length": alignment.length, 
        "E-value": alignment.hsps[0].expect,
        "Identity": alignment.hsps[0].identities
    }
for alignment in blast_results.alignments
])

df

## PDB files

PDB files contain detailed 3D representations of proteins, which may be experimentally derived or predicted. These files allow us to study a protein's structure at the atomic level, including its spatial arrangements and interactions.

In this example, we will download and parse a [PDB file of the protein encoded by the CFTR gene](https://www.rcsb.org/structure/6O1V). Software like PyMOL can then be used to visualise and analyse the 3D structure in detail.

1. Visit the [RCSB PDB database](https://www.rcsb.org/) and search for the entry by its ID, `6O1V`, which represents only one of the many available structures for the protein solved via electron microscopy (EM).
2. Under **Download Files**, right-click on **PDB Format** and select **Copy Link**.

You should now have the link: `"https://files.rcsb.org/download/6O1V.pdb"`. The code for downloading `.pdb` files when you already know the URL is shown below.

In [None]:
from Bio.PDB.PDBParser import PDBParser

# Download the PDB file for the CFTR protein
result_location, http_response = urlretrieve("https://files.rcsb.org/download/6O1V.pdb", "data_supplementary/6O1V.pdb")
print("File downloaded to:", result_location)

# Parse the PDB file
parser = PDBParser()
structure = parser.get_structure("6O1V", "data_supplementary/6O1V.pdb")

Warnings like the one above are common when reading PDB files, often due to minor deviations from the PDB standard in generated files. **Biopython** offers an option for strict parsing, which would turn these warnings into errors, but this is typically impractical, as minor discrepancies are common.

PDB files use a hierarchical structure to represent protein data, organized as follows:
- **Structure**: The top level of the hierarchy, representing the full protein structure.
- **Model**: Nested within the structure; represents different models of the protein if provided.
- **Chain**: Each structure can contain multiple chains (in this case, two chains).
- **Residue**: Part of a chain, each representing an amino acid.
- **Atom**: Contains the atomic coordinates for each residue.

For more information on PDB structure and representation, see [this guide](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction).

We’ll now traverse the parsed PDB file and calculate some basic statistics, focusing on:
- total count of tryptophan (`"TRP"`) residues,
- and total number of carbon atoms.

In [None]:
total_TRP_residues = 0
total_carbon_atoms = 0

for model in structure:
    for chain in model:
        for residue in chain:
            if residue.resname == "TRP":
                total_TRP_residues += 1
            for atom in residue:
                if atom.element=="C":
                    total_carbon_atoms += 1 

print("Total number of tryptophan residues:", total_TRP_residues)
print("Total number of carbon atoms:       ", total_carbon_atoms)

To calculate the *bounding box* around the protein structure, we can use the atomic coordinates provided in the PDB file. The *bounding box* is defined by the minimum and maximum coordinates in each spatial dimension (x, y, and z).

The code below initialises extreme values for the minimum and maximum coordinates and iterates over all atoms in the structure. For each atom, it checks if the coordinate values are lower or higher than the current extremes and updates the *bounding box* values accordingly.

In [None]:
# Initialise extreme values for minimum and maximum coordinates
min_atom_coord = [1000, 1000, 1000]
max_atom_coord = [-1000, -1000, -1000]

# Traverse each atom to update the minimum and maximum coordinates
for model in structure:
    for chain in model:
        for residue in chain:
            for atom in residue:
                coord = atom.coord
                # Check each dimension (x, y, z)
                # The `dim` takes values 0, 1, and 2 for x, y, and z, respectively
                # The `val` contains the actual cooridnate value of the atom in the respective dimension
                for dim, val in enumerate(coord):
                    if val < min_atom_coord[dim]:
                        min_atom_coord[dim] = val
                    elif val > max_atom_coord[dim]:
                        max_atom_coord[dim] = val

# Display the bounding box coordinates
print("Minimum coordinates for each dimension:", min_atom_coord)
print("Maximum coordinates for each dimension:", max_atom_coord)

The `Bio.PDB` module in **Biopython** provides additional utilities for working with **PDB** files, including features to acquire, save, transform, and superimpose protein structures. It also supports working with **mmCIF** files, an alternative format for macromolecular structures.

These tools allow for filtering and adjusting the structure data before conducting deeper analysis with 3D visualization tools like PyMOL. With Biopython’s utilities, you can prepare and manipulate structural data efficiently for downstream applications.

## Discussion

This notebook introduced protein-related functionalities in **Biopython**, covering how to perform analyses and search for related proteins in a programmable, repeatable, and scalable way.

**Biopython** is a comprehensive library with an extensive range of functionalities. It provides tested algorithms and connections to databases, accelerating bioinformatics research. However, due to its breadth *(we could easily spend this time just exploring each module)*, we have focused on a small subset of its capabilities.

You can now move on to the supplementary exercise notebook.

If you want to learn more there are some extra external resources linked at the beginning of this notebook. You can click [here](#Contents) to go back to the top.