### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 08 - Supplementary exercises (Session 2)

*Written by:* Mateusz Kaczyński

**This notebook contains exercises on basic protein analysis with the Biopython toolkit.**

Try to complete the exercises before the next session and feel free to refer back to the content in the previous notebooks to help you complete the tasks.

You should work through the tasks consecutively.

Remember to save your changes.

-----

## Contents

1. [Task 3](#Task-3) – Plotting relative mutability
2. [Task 4](#Task-4) – BLAST and analyse

-----

#### Installation

In [None]:
# No need to run if you have already installed Biopython when going through the previous notebooks.
!pip install Biopython

#### Imports

Some imports you may, or may not need to complete the tasks.

In [None]:
# Run this cell before you attempt the exercises
%matplotlib notebook
import matplotlib.pyplot as plt
import pandas as pd
from urllib.request import urlretrieve 
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Blast import NCBIWWW, NCBIXML
from Bio.SeqUtils.ProtParam import ProteinAnalysis

## Task 3

#### Plotting relative mutability

In this exercise, we’ll use the relative amino acid mutability scale as described from *Dayhoff M.O., Schwartz R.M., Orcutt B.C. in "Atlas of Protein Sequence and Structure", Vol.5, Suppl.3 (1978)*. This scale provides experimentally derived mutation probabilities for each amino acid, relative to alanine (Ala=100).

1. Download a FASTA file for any protein of interest from [UniProt](https://uniprot.org). *(Alternatively, you can provide the sequence manually).*
2. Plot the relative mutability across the protein sequence using a sliding window of 15 amino acids.

##### Notes
- Use the relative mutability scale values with alanine set as the baseline (Ala=100).
- A sliding window will help smooth out fluctuations and provide a clearer view of mutable regions.

In [None]:
aa_relative_mutability = {
    "A": 100, "C": 20,  "D": 106, "E": 102, "F": 41, 
    "G": 49,  "H": 66,  "I": 96,  "K": 56,  "L": 40,
    "M": 94,  "N": 134, "P": 56,  "Q": 102, "R": 65,
    "S": 120, "T": 97,  "V": 74,  "W": 18,  "Y": 41
}
# Write your solution here, adding more cells if necessary.

## Task 4

#### BLAST and analyse

```python
query_sequence = """
EVSIIQSMGYRNRAKRLLQSEPENPSLQETSLSVQLSNLGTVRTLRTKQRIQPQKTSVYI
ELGSDSSEDTVNKATYCSVGDQELLQITPQGTRDEISLDSAKKAACEFSETDVTNTEHHQ
PSNNDLNTTEKRAAERHPEKYQGSSVSNLHVEPCGTNTHASSLQHENSSLLLTKDRMNVE
KAEFCNKSKQPGLARSQHNRWAGSKETCNDRRTPSTEKKVDLNADPLCERKEWNKQKLPC
"""
```
1. Run the provided protein sequence against NCBI’s non-redundant protein sequence database.
2. Download the first two available protein sequence hits. *(If you’re not sure how to do this in Python, you can manually copy and paste the sequence).*
3. Calculate the molecular weight of both. Identify which one is heavier.
4. Using the `aminoacid_relative_mutability` dictionary from the previous exercise, determine which of the two proteins is more prone to mutation.

##### Notes
- **Molecular weight calculation**: Use Biopython’s `ProteinAnalysis` class to compute the molecular weights of both sequences.
- **Constructing the FASTA download URL using Entrez Programming Utilities (E-utilities)**: Have a look at the [help pages](https://www.ncbi.nlm.nih.gov/books/NBK25500/#_chapter1_Downloading_Full_Records_).

In [None]:
query_sequence = """
EVSIIQSMGYRNRAKRLLQSEPENPSLQETSLSVQLSNLGTVRTLRTKQRIQPQKTSVYI
ELGSDSSEDTVNKATYCSVGDQELLQITPQGTRDEISLDSAKKAACEFSETDVTNTEHHQ
PSNNDLNTTEKRAAERHPEKYQGSSVSNLHVEPCGTNTHASSLQHENSSLLLTKDRMNVE
KAEFCNKSKQPGLARSQHNRWAGSKETCNDRRTPSTEKKVDLNADPLCERKEWNKQKLPC
"""
# Write your solution here, adding more cells if necessary.