### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="https://github.com/MEDC0106/PythonWorkshop/blob/main/resources/static/Banner.png?raw=1" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 05 - Handling sequences with Biopython

*Written by:* Mateusz Kaczyński

**This notebook introduces the Biopython library, focusing on in-depth sequence handling—the cornerstone of bioinformatics.**


### What is Biopython?

**Biopython** is a widely-used, open-source toolkit for computational biology and bioinformatics. It includes tools, resource connectors, and functions for executing standard bioinformatics tasks and parsing common data formats.

*Parsing is just a fancier word for interpreting or analyzing data to extract meaningful information.*

### Why Biopython?

**Biopython** is the de facto standard for accessing numerous bioinformatics databases and tools, making it easier to share your work and results. It simplifies interaction with libraries written in other languages and technologies, as well as those hosted online. While it is possible to use these resources directly, Biopython streamlines the process and provides the essential components you need.

-----

## Contents

1. [Sequence basics](#Sequence-basics)
2. [Transcription and translation](#Transcription-and-translation)
3. [Alignment](#Alignment)
4. [Downloading and reading FASTA files](#Downloading-and-reading-fasta-files)
5. [Discussion](#Discussion)

-----

### Extra resources:

- [Official Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) - A comprehensive guide to the capabilities of the library.
- [Biopython API documentation](https://biopython.org/docs/latest/api/index.html) - A long, detailed list of all methods and connectors provided by Biopython.
- [Rosalind](http://rosalind.info) - A bioinformatics learning platform that includes exercises.

-----

### Installing Biopython
In order to use the library locally, we first need to install it. Most Python installations come with `pip` tool that can be run directly from the notebook. It will download and install relevant packages from [Python Package Index website (PyPI)](https://pypi.org)  

In [4]:
!pip install Biopython

Collecting Biopython
  Downloading biopython-1.86-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Downloading biopython-1.86-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Biopython
Successfully installed Biopython-1.86


### Importing Biopython
When installed, **Biopython** can be included in the project by importing it as the `Bio` module.

It is not uncommon for the library to be called differently than the imported name. In this case, **Biopython** is referenced as **Bio**.

In [5]:
import Bio
print("Module", Bio.__name__, "version", Bio.__version__)

Module Bio version 1.86


-----
## Sequence basics
Managing DNA, RNA, or peptide sequences is central to any bioinformatics workflow. The ability to understand, analyse, and transform these sequences is one of the most fundamental applications of computing in biology.

Here’s how to define a basic sequence.

In [6]:
from Bio.Seq import Seq
dna_sequence = Seq("ACTG")
dna_sequence

Seq('ACTG')

`Seq` is a submodule within `Bio`, specifically designed to handle biological sequences. The `Seq` class represents a sequence of nucleotides or amino acids and provides various methods for analysing and manipulating these sequences.

You can see that a sequence can be created from a basic string that represents the ordered elements–in this case, a DNA sequence. However, it is no longer treated as a simple data type.

Let’s examine this more closely with Python’s representation.

In [7]:
print("The sequence compared with the original string:", dna_sequence == "ACTG")
print("The type of dna_sequence:", type(dna_sequence))

The sequence compared with the original string: True
The type of dna_sequence: <class 'Bio.Seq.Seq'>


The first line checks if `dna_sequence` is equivalent to the original string `"ACTG"`, demonstrating that the content is the same even though the data type is different.

The second line displays the type of `dna_sequence`, which will show it is a `Bio.Seq.Seq` object rather than a standard Python string, highlighting the enhanced functionality provided by the `Seq` class.

#### Objects and classes
We can see that, while `dna_sequence` can be compared with the original string without any issues, its type is actually `Seq`. This means that `dna_sequence` is an object of the `Seq` class.
    
Let's unpack these key concepts:
  - **Object** - A fundamental building block in programming that can hold data (properties) and functions (methods). In Python, almost everything is an object, from numbers to complex data types.
  - **Class** - A blueprint or template for creating new objects. A class allows us to group data and related functions together, defining how each instance (object) should behave.
  - `Seq` - A particular class defined in the `Bio.Seq` module within Biopython, designed to handle biological sequences.

When in doubt, we can use the `help()` function to understand how to use `Seq`. This will display detailed information about the class, including available methods, usage examples, and descriptions of its functionalities.

In [5]:
help(dna_sequence)

Help on Seq in module Bio.Seq object:

class Seq(_SeqAbstractBaseClass)
 |  Seq(data: str | bytes | bytearray | Bio.Seq._SeqAbstractBaseClass | Bio.Seq.SequenceDataAbstractBaseClass | dict | None, length: int | None = None)
 |
 |  Read-only sequence object (essentially a string with biological methods).
 |
 |  Like normal python strings, our basic sequence object is immutable.
 |  This prevents you from doing my_seq[5] = "A" for example, but does allow
 |  Seq objects to be used as dictionary keys.
 |
 |  The Seq object provides a number of string like methods (such as count,
 |  find, split and strip).
 |
 |  The Seq object also provides some biological methods, such as complement,
 |  reverse_complement, transcribe, back_transcribe and translate (which are
 |  not applicable to protein sequences).
 |
 |  Method resolution order:
 |      Seq
 |      _SeqAbstractBaseClass
 |      abc.ABC
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __hash__(self)
 |      Hash of the sequ

#### String-like behaviour

In the first session, we explored how Python strings (simple objects representing text) behave.

The `Seq` class is designed to mimic these string methods, making it easier to interchange information between strings and sequences while providing a familiar interface for the user.

Let’s verify this by using some common string methods, including indexing and slicing.

In [8]:
print("Length of the sentence:   ", len(dna_sequence))
print("AC found in the sequence?:", "AC" in dna_sequence)
print("Sequence ends with 'TG'?: ", dna_sequence.endswith("TG"))
print("The first base:           ", dna_sequence[0])
print("The last base:            ", dna_sequence[-1])
print("First two bases:          ", dna_sequence[0:2])
print("Adenines in the sequence: ", dna_sequence.count("A"))

Length of the sentence:    4
AC found in the sequence?: True
Sequence ends with 'TG'?:  True
The first base:            A
The last base:             G
First two bases:           AC
Adenines in the sequence:  1


Does case sensitivity matter when representing bases in a sequence, like `"A"` versus `"a"` for adenine?

Yes, it does. The `Seq` class is case-sensitive, meaning that uppercase and lowercase letters are treated as different characters. It is best to convert all bases to either uppercase or lowercase before performing comparisons or analyses. `Seq` objects provide methods like `.upper()` and `.lower()` to make this easy.

In [9]:
dna_sequence_mixedcase = Seq("acTG")

print("Is the uppercase sequence equal to the mixed case sequence?   ",
      dna_sequence == dna_sequence_mixedcase)

print("What if we change the original sequence to lowercase?:        ",
      dna_sequence.lower() == dna_sequence_mixedcase)

print("What if we make both sequences the same case? Are they equal?:",
      dna_sequence.lower() == dna_sequence_mixedcase.lower())

Is the uppercase sequence equal to the mixed case sequence?    False
What if we change the original sequence to lowercase?:         False
What if we make both sequences the same case? Are they equal?: True


To extend the sequence, we can use the familiar string syntax as `Seq` objects in Biopython behave similarly to Python strings in this context.

In [10]:
print("Extended with another sequence:", dna_sequence + Seq("TAA"))
print("Extended with a string        :", dna_sequence + "TGG")

Extended with another sequence: ACTGTAA
Extended with a string        : ACTGTGG


For the cases where there are gaps, we can remove them using the `.replace()` method.

In [9]:
sequence_with_gaps = Seq("ACGT-ACGT-ACGT")
print("Sentence with gaps   :", sequence_with_gaps)
print("Sentence without gaps:", sequence_with_gaps.replace("-", ""))

Sentence with gaps   : ACGT-ACGT-ACGT
Sentence without gaps: ACGTACGTACGT


#### Sequence analysis

**Biopython** offers a variety of tools for sequence analysis. For example, GC content calculation, a common metric in genomics, can be accessed directly from Biopython.
[The online documentation](https://biopython.org/docs/dev/api/Bio.SeqUtils.html) provides more details.

In [10]:
from Bio.SeqUtils import gc_fraction
print("The guanine-cytosine content of the sequence:", gc_fraction(dna_sequence)*100, "%")

The guanine-cytosine content of the sequence: 50.0 %


## Transcription and translation
In addition to their string-like behaviour, `Seq` objects have methods specific to bioinformatics tasks.

One of the most powerful features of **Biopython** is its ability to switch between representations, allowing seamless conversion between different sequence types.

Let’s explore how to generate complementary and reverse sequences.

In [11]:
print("Sequence:          ", dna_sequence)
print("Complement:        ", dna_sequence.complement())
print("Reverse:           ", Seq("".join(reversed(dna_sequence))))
print("Reverse complement:", dna_sequence.reverse_complement())

Sequence:           ACTG
Complement:         TGAC
Reverse:            GTCA
Reverse complement: CAGT


Simple DNA-to-RNA transcription can be achieved with the following methods.

In [12]:
print("Sequence:              ", dna_sequence)
print("Transcribed:           ", dna_sequence.transcribe())
print("Reverse complement RNA:", dna_sequence.reverse_complement_rna())

Sequence:               ACTG
Transcribed:            ACUG
Reverse complement RNA: CAGU


The reversal of the process - getting from RNA to DNA is also supported.

In [11]:
transcribed = dna_sequence.transcribe()
print("Transcribed     :", transcribed)
print("Back transcribed:", transcribed.back_transcribe())

Transcribed     : ACUG
Back transcribed: ACTG


**Biopython** simplifies the process of translating RNA (or even directly from DNA) into an amino acid sequence.

*Note: Different codon tables are also supported, making it flexible for various organisms and use cases.*

In [14]:
sequence = Seq("ACGCGACGA")
sequence.translate()

Seq('TRR')

#### Using `Seq` methods directly on strings.
It’s possible to bypass the `Seq` class abstraction and directly import and use relevant methods on simple strings.

In [15]:
from Bio.Seq import transcribe
print('Transcription directly from "ACGT" string, without using a Seq object:', transcribe("ACGT"))

Transcription directly from "ACGT" string, without using a Seq object: ACGU


## Alignment

One of the most common uses of bioinformatics tools is sequence alignment.

We will use Biopython’s built-in, out-of-the-box [pairwise aligner](https://biopython.org/docs/latest/api/Bio.Align#Bio.Align.PairwiseAligner), which allows us to align two sequences at a time.

We’ll start by importing the `PairwiseAligner`, running a full sequence alignment, and displaying the relevant results.

In [12]:
from Bio.Align import PairwiseAligner

# Initialise the `PairwiseAligner`
aligner = PairwiseAligner()

# Notice we can provide either `Seq` objects or simple strings as input
alignments = aligner.align(Seq("ACGT"), "ACGC")

# Display alignment results
for alignment in alignments:
    print("Score:", alignment.score)
    print(alignment)

Score: 3.0
target            0 ACGT 4
                  0 |||. 4
query             0 ACGC 4



The `score` attribute represents the quality of the alignment. By default, it indicates the number of elements that matched exactly. However, both the match score and the penalty for gaps can be adjusted as needed to customize the alignment criteria.

**Multiple Sequence Alignment (MSA)** is used when aligning 3 or more sequences simultaneously.

For MSA, a list of `SeqRecord` objects is required. The `SeqRecord` class, unlike the basic `Seq` class, allows for additional metadata storage, such as:
- **Name**: A short identifier for the sequence.
- **Description**: A longer explanation or label.
- **Annotations**: Key-value pairs for storing various details.

*Note: For MSA, all sequences are expected to be of the same length.*

In [13]:
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord

# Define a list of sequences as strings
seqs = [
    "ACGTACGT",
    "ACGTGCGC",
    "ACGTA--T",
    "CCGTACGG",
    "A-GTACCC",
    "ACGTA--T",
    "CTG-ACG-",
    "AGGTACG-"
]

# Convert each string in the `seqs` list into a `SeqRecord` object containing a `Seq`
seq_records = [SeqRecord(Seq(s)) for s in seqs]

# Create a `MultipleSeqAlignment` object with the `SeqRecord` objects
aligned = MultipleSeqAlignment(seq_records)

# No output is expected for this cell

We can access the sequence records in the `aligned` object by their index, similar to accessing elements in a `list` or `tuple`.

In [14]:
# Access the first sequence record
print(aligned[0], "\n")

# Access the second sequence record
print(aligned[1], "\n")

# Access the last sequence record
print(aligned[-1])

ID: <unknown id>
Name: <unknown name>
Description: <unknown description>
Number of features: 0
Seq('ACGTACGT') 

ID: <unknown id>
Name: <unknown name>
Description: <unknown description>
Number of features: 0
Seq('ACGTGCGC') 

ID: <unknown id>
Name: <unknown name>
Description: <unknown description>
Number of features: 0
Seq('AGGTACG-')


We can also access specific parts of the alignment by slicing both rows and columns, using 2-dimensional slicing.

In [15]:
# Access the first three sequences in the alignment
print("First three sequences:\n", aligned[:3], "\n")

# Access the first four positions (columns) across all sequences
print("First four columns:\n", aligned[:, :4], "\n")

# Access rows 1 to 3 and columns 2 to 6
print("Rows 1 to 3, columns 2 to 6:\n", aligned[1:4, 2:6])

First three sequences:
 Alignment with 3 rows and 8 columns
ACGTACGT <unknown id>
ACGTGCGC <unknown id>
ACGTA--T <unknown id> 

First four columns:
 Alignment with 8 rows and 4 columns
ACGT <unknown id>
ACGT <unknown id>
ACGT <unknown id>
CCGT <unknown id>
A-GT <unknown id>
ACGT <unknown id>
CTG- <unknown id>
AGGT <unknown id> 

Rows 1 to 3, columns 2 to 6:
 Alignment with 3 rows and 4 columns
GTGC <unknown id>
GTA- <unknown id>
GTAC <unknown id>


Aligning sequences alone provides limited value. To gain meaningful insights, we can extract various statistics from the alignment.

Biopython’s `AlignInfo` module historically provided tools such as `SummaryInfo` for examining multiple sequence alignments, for example, extracting columns, computing consensus sequences, or calculating information content.

However `SummaryInfo` is now **deprecated**. If you inspect it with `help()`, Biopython clearly indicates that the class is no longer recommended and only exposes a minimal set of older methods.

In [16]:
from Bio.Align import AlignInfo

align_info = AlignInfo.SummaryInfo(aligned)
help(align_info)

Help on SummaryInfo in module Bio.Align.AlignInfo object:

class SummaryInfo(builtins.object)
 |  SummaryInfo(alignment)
 |
 |  Calculate summary info about the alignment.  (DEPRECATED)
 |
 |  This class should be used to calculate information summarizing the
 |  results of an alignment. This may either be straight consensus info
 |  or more complicated things.
 |
 |  Methods defined here:
 |
 |  __init__(self, alignment)
 |      Initialize with the alignment to calculate information on.
 |
 |      ic_vector attribute. A list of ic content for each column number.
 |
 |  get_column(self, col)
 |      Return column of alignment.
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  __dict__
 |      dictionary for instance variables
 |
 |  __weakref__
 |      list of weak references to the object




>>> align_info = AlignInfo.SummaryInfo(msa)
>>> sequence = align_info.get_column(1)

please use

>>> alignment = msa.alignment  # to get a new-style Alignment object
>>> sequence = alignment[:, 1]

Here, `msa` is a MultipleSeqAlignment object and `alignment` is an
`Alignment` object.


Instead, modern Biopython encourages working directly with the new-style `alignment` object. This object allows you to extract alignment statistics using intuitive, NumPy-like slicing.

In [17]:
# Get the Alignment object from the MultipleSeqAlignment
alignment = aligned.alignment     # aligned is your MultipleSeqAlignment

column = alignment[:, 5]   # retrieves column 5
column

'CC-CC-CC'

In older versions, `AlignInfo.SummaryInfo()` would have allowed you to compute a “dumb” consensus sequence or inspect per-position scoring matrices directly within Biopython. For example, you could call:

```python
print("The simple consensus:", align_info.dumb_consensus())
print("Alignment score for particular bases at a given position:")
print(align_info.pos_specific_score_matrix())
```

If you want to compute consensus sequences, position-specific statistics, or more sophisticated alignment summaries, the recommended approach today is to use another external library such as [**scikit-bio**](https://scikit.bio/docs/latest/index.html), which provides robust tools for working with multiple sequence alignments.

After installing it with `!pip install scikit-bio`, you can run code like the snippet below to generate pairwise alignments, construct MSAs, and compute a majority-rule consensus using maintained and up-to-date functionality.

```python
from skbio import Protein, TabularMSA
from skbio.alignment import global_pairwise_align_protein

# Convert strings to Protein objects
records = [Protein(s) for s in seqs]

# Build TabularMSA object
msa = TabularMSA(records)

# Print the alignment
print(msa)

# Extract a column
print("Column 5:", msa[:, 5])

# Compute a simple majority-rule consensus
consensus = msa.consensus()
print("Consensus:", consensus)
```

## Downloading and reading FASTA files

FASTA is a simple, plain-text format commonly used for storing sequence data. Each sequence in a FASTA file includes:
- A **header line** starting with `>`, containing metadata such as sequence ID, name, and description.
- The **sequence itself** in subsequent lines, represented by single-letter codes for nucleotides or amino acids.

#### Example FASTA Format

>\> MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

In addition to the amino acid sequence, the header line provides metadata about the sequence's origin and description.

#### Example sequence

For our example, we will use the [CFTR protein](https://www.uniprot.org/uniprot/P13569), associated with cystic fibrosis. Mutations in the CFTR gene are linked to this condition.

First, let’s download the `P13569.fasta` file.

1. Go to the UniProt link for the CFTR protein: [Cystic fibrosis transmembrane conductance regulator protein | Uniprot](https://www.uniprot.org/uniprot/P13569).
2. Click on the **Download** button next to **Tools**.
3. Under **Format**, select **FASTA (canonical)**, then click **Generate URL for API**. Finally, click the **Copy** button.

You should now have the link: `"https://rest.uniprot.org/uniprotkb/P13569.fasta"`. The code for downloading `.fasta` files when you have the URL is shown below.

In [18]:
from urllib.request import urlretrieve

# Specify the destination path for the downloaded file
CFTR_FASTA_path = "P13569.fasta"

# Download the FASTA file from the UniProt URL and save it to the specified path
result_location, http_response = urlretrieve("https://rest.uniprot.org/uniprotkb/P13569.fasta", CFTR_FASTA_path)

# Confirm the download location
print("Downloaded file to:", result_location)

# Print metadata and statistics related to the download
print("\nDownload metadata and statistics:")
print(http_response)

Downloaded file to: P13569.fasta

Download metadata and statistics:
Vary: accept,accept-encoding,x-uniprot-release,x-api-deployment-date, User-Agent
Cache-Control: public, max-age=43200
x-cache: miss cached
Content-Type: text/plain;format=fasta
Access-Control-Allow-Credentials: true
Access-Control-Expose-Headers: Link, X-Total-Results, X-UniProt-Release, X-UniProt-Release-Date, X-API-Deployment-Date
X-API-Deployment-Date: 11-December-2025
Strict-Transport-Security: max-age=31536000; includeSubDomains
Date: Wed, 07 Jan 2026 06:49:24 GMT
Access-Control-Max-Age: 1728000
X-UniProt-Release: 2025_04
Access-Control-Allow-Origin: *
Accept-Ranges: bytes
Connection: close
Access-Control-Allow-Methods: GET, PUT, POST, DELETE, PATCH, OPTIONS
Access-Control-Allow-Headers: DNT,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization
Content-Length: 1621
X-UniProt-Release-Date: 15-October-2025




The `urlretrieve` function from Python’s `urllib.request` module allows you to download a file from a given URL and save it locally. It takes two main arguments, i.e. `"https://rest.uniprot.org/uniprotkb/P13569.fasta"` and `CFTR_FASTA_path`, and returns a tuple containing two elements, i.e. `result_location` and `http_response`.
- The `CFTR_FASTA_path` specifies the filename and path where the `.fasta` file will be saved.
- The `result_location` confirms the path to the downloaded file.
- The `http_response` contains metadata about the download, such as headers, which may include status information and server response details.

To open the file and read relevant information from it, we use `SeqIO` module from Biopython.

In [19]:
from Bio import SeqIO

# Note that each file can contain multiple records
# Initialise an empty list to store the records
records = []

# Parse the FASTA file and add each record to the list
for r in SeqIO.parse(CFTR_FASTA_path, "fasta"):
    records.append(r)

# Print information about the extracted records
print("Total records:", len(records), "\n")
print("First record:")
print(records[0])
print("\nType of first record:", type(records[0]))

Total records: 1 

First record:
ID: sp|P13569|CFTR_HUMAN
Name: sp|P13569|CFTR_HUMAN
Description: sp|P13569|CFTR_HUMAN Cystic fibrosis transmembrane conductance regulator OS=Homo sapiens OX=9606 GN=CFTR PE=1 SV=3
Number of features: 0
Seq('MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLE...TRL')

Type of first record: <class 'Bio.SeqRecord.SeqRecord'>


- `SeqIO.parse`: Reads the file in FASTA format and returns an iterable of `SeqRecord` objects, one for each sequence in the file.
- `records` list: Collects each `SeqRecord` object, allowing us to access the sequences and associated metadata later.
- ***Output Details***:
  - **Total records**: Displays the count of records in the file, indicating if there are multiple sequences.
  - **First record**: Shows the first `SeqRecord` object to provide a sample of the data structure.
  - **Type of first record**: Confirms the type of the first item on the list, which should be `SeqRecord`.

To get the actual `Seq` object from a `SeqRecord`, we access the `seq` property.

In [20]:
# Access the first record from the list of records
record = records[0]

# Print the length of the sequence
print("Sequence length:", len(record.seq), "\n")

# Print the sequence itself
print(record.seq)

Sequence length: 1480 

MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLLGRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHHIGMQMRIAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFVWIAPLQVALLMGLIWELLQASAFCGLGFLIVLALFQAGLGRMMMKYRDQRAGKISERLVITSEMIENIQSVKAYCWEEAMEKMIENLRQTELKLTRKAAYVRYFNSSAFFFSGFFVVFLSVLPYALIKGIILRKIFTTISFCIVLRMAVTRQFPWAVQTWYDSLGAINKIQDFLQKQEYKTLEYNLTTTEVVMENVTAFWEEGFGELFEKAKQNNNNRKTSNGDDSLFFSNFSLLGTPVLKDINFKIERGQLLAVAGSTGAGKTSLLMVIMGELEPSEGKIKHSGRISFCSQFSWIMPGTIKENIIFGVSYDEYRYRSVIKACQLEEDISKFAEKDNIVLGEGGITLSGGQRARISLARAVYKDADLYLLDSPFGYLDVLTEKEIFESCVCKLMANKTRILVTSKMEHLKKADKILILHEGSSYFYGTFSELQNLQPDFSSKLMGCDSFDQFSAERRNSILTETLHRFSLEGDAPVSWTETKKQSFKQTGEFGEKRKNSILNPINSIRKFSIVQKTPLQMNGIEEDSDEPLERRLSLVPDSEQGEAILPRISVISTGPTLQARRRQSVLNLMTHSVNQGQNIHRKTTASTRKVSLAPQANLTELDIYSRRLSQETGLEISEEINEEDLKECFFDDMESIPAVTTWNTYLRYITVHKSLIFVLIWCLVIFLAEVAASLVVLWLLGNTPLQDKGNSTHSRNNSYAVIITSTSSYYVFYIYVGVADTLLAMGFFRGLPLVHTLITVSKILHHKMLHSVLQAPMSTLNTLKAGGILNRF

## Folding

We have covered the initial steps of the ***central dogma of molecular biology***: transcribing DNA into RNA and then translating it into a protein sequence. The final step, i.e. converting a sequence into a protein structure, is known as **protein folding**. This process is considerably more complex and computationally expensive to predict.

One method that has recently gained prominence is **[AlphaFold](https://www.nature.com/articles/s41586-021-03819-2)**. It predicts protein structures based on known folding patterns and sequence alignments, using deep learning techniques to achieve remarkable accuracy.
The **[2024 Nobel Prize in Chemistry](https://www.nature.com/articles/d41586-024-03214-7)** was awarded to John Jumper and Demis Hassabis of Google DeepMind for developing this AI tool, and to David Baker for his work on computational protein design. Their contributions have revolutionised biology and hold the potential to transform drug discovery.

You can make your own predictions by using [the official AlphaFold notebook](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb#scrollTo=rowN0bVYLe9n) on Google Colab.

## Discussion

In this notebook, we explored fundamental ways to construct and manipulate sequences, including transcription and translation. We examined how **Biopython** facillitates *sequence alignment*, both pairwise and multiple sequence alignment (MSA), and demonstrated how to *download and read FASTA files* to access sequence records and metadata.

This notebook provides only an introduction to the extensive tools and integrations available within the **Biopython** library, which supports a wide range of formats and methods for storing and processing sequence and alignment data.

Please note, this notebook uses very simple sequences. In real-world applications, sequences are often much longer and more complex. We encourage you to revisit this notebook with longer sequences and to experiment with the code to gain a deeper understanding.

You can now move on to the exercise notebook which will allow you to use the code introduced here in more practical scenarios.

If you want to learn more there are some extra external resources linked at the beginning of this notebook. You can click [here](#Contents) to go back to the top.