### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 06 - Exercises (Session 2)

*Written by:* Mateusz Kaczyński

**This notebook contains exercises to help you understand the concepts introduced during Session 2 of the Python workshop. The exercises are designed to give you practical experience in applying these tools to bioinformatics tasks.**

Try to complete the exercises before the next session and feel free to refer back to the content in the previous notebooks to help you complete the tasks.

You should work through the tasks consecutively.

Remember to save your changes.

-----

## Contents
1. [Task 1](#Task-1) – Following the central dogma
2. [Task 2](#Task-2) – Sequence alignment

-----

#### Installation

In [None]:
# No need to run if you have already installed Biopython when going through the previous notebook.
!pip install Biopython

#### Imports

Some imports you may, or may not need to complete the tasks.

In [None]:
# Run this cell before you attempt the exercises
from urllib.request import urlretrieve
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import PairwiseAligner, MultipleSeqAlignment, AlignInfo

## Task 1

#### Following the central dogma

Given the following DNA sequence:

> ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAG\
CCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTG\
CGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGG\
CAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCC\
TGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAA\
CTACTGCAAC

1. Transcribe the DNA sequence to RNA.
2. Translate the RNA sequence to a protein sequence.
3. Modify the DNA sequence by replacing the nucleotide at position 12 (0-based index) from `G` to `A`.
4. Translate the modified DNA sequence to a protein sequence.
5. Comment on the results and discuss any potential caveats.

<details>
    <summary>Hint</summary>
    <pre>Seq(sequence_string)</pre>
</details>

<details>
    <summary>Another hint</summary>
    <pre>new_sequence = sequence_string[:11] + "A" + sequence_string[12:]</pre>
</details>

<details>
    <summary>Example solution</summary>
    <pre>
    seq = Seq(sequence_string)
    print("Original sequence - Translated:", seq.translate())
    mutated_string = sequence_string[:11] + "A" + sequence_string[12:]
    seq_mutated = Seq(mutated_string)
    print("Mutated sequence - Translated:", seq_mutated.translate())
    </pre>
</details>

In [None]:
sequence_string = """\
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAG\
CCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTG\
CGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGG\
CAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCC\
TGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAA\
CTACTGCAAC\
"""
# Write your solution here, adding more cells if necessary

## Task 2

#### Sequence alignment

Below are URLs for SARS-CoV-2 virus ***spike protein*** sequences:

- **Wild-type/reference protein**: [https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=YP_009724390.1&rettype=fasta](https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=YP_009724390.1&rettype=fasta)
- **Alpha variant**: [https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=QWE88920.1&rettype=fasta](https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=QWE88920.1&rettype=fasta)
- **Delta variant**: [https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=QWK65230.1&rettype=fasta](https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=QWK65230.1&rettype=fasta)

Using these protein sequences, complete the following steps:

1. Download each sequence and save them as FASTA files.
2. Read the sequences into Biopython as `Seq` objects.
3. Determine and print the length of each sequence.
4. Perform and print (global) pairwise alignments, along with the alignment scores, between the reference and one of the variants.
5. Align the alpha and delta variants and compare the alignment results against those of the reference. What insights does this provide about these two lineages?
6. *(Optional)* Perform a multiple sequence alignment (MSA) on all sequences. Note that they must be of the same length for MSA.
 
<details>
    <summary>Hint</summary>
    <pre>urlretrieve("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=YP_009724390.1&rettype=fasta", "data/reference.fasta")</pre>
</details>

<details>
    <summary>Another hint</summary>
    <pre>reference = next(SeqIO.parse("data/reference.fasta", "fasta"))</pre>
</details>

<details>
    <summary>Example solution</summary>
<pre>
print("# 1. Download and save data")
def download_file(url, location):
    result_location, http_response = urlretrieve(url, location)
    print("Downloaded " + url + " to " + result_location)

file_reference = "reference.fasta"
file_alpha = "alpha.fasta"
file_delta = "delta.fasta"

download_file("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=YP_009724390.1&rettype=fasta", file_reference)
download_file("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=QWE88920.1&rettype=fasta", file_alpha)
download_file("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=QWK65230.1&rettype=fasta", file_delta)

print("# 2. Read as Biopython Seq objects")
def read_first_sequence(fasta_filepath):
    records = []
    for r in SeqIO.parse(fasta_filepath, "fasta"):
        records.append(r)
    record = records[0]
    return record.seq

seq_reference = read_first_sequence(file_reference)
seq_alpha = read_first_sequence(file_alpha)
seq_delta = read_first_sequence(file_delta)

print("# 3. How long are the sequences?")
print("Length of the reference sequence    :", len(seq_reference))
print("Length of the Alpha variant sequence:", len(seq_alpha))
print("Length of the Delta variant sequence:", len(seq_delta))

print("# 4. Pairwise alignment - reference and Alpha variant")
aligner = PairwiseAligner()

reference_alpha_alignments = aligner.align(seq_reference, seq_alpha)

print("Top alignment score - reference and Alpha variant:", reference_alpha_alignments[0].score)

print("# 5. Pairwise alignments between 3 sequences")
reference_delta_alignments = aligner.align(seq_reference, seq_delta)
alpha_delta_alignments = aligner.align(seq_alpha, seq_delta)

print("Reference-Alpha alignment score:", reference_alpha_alignments[0].score)
print("Reference-Delta alignment score:", reference_delta_alignments[0].score)
print("Alpha-Delta alignment score    :", alpha_delta_alignments[0].score)    
</pre>

</details>

In [None]:
# Write your solution here, adding more cells if necessary