# Machine learning su aplicación en Biología

__Inteligencia Artificial Aplicada al Análisis de Datos Biológicos__

_MeIA_

`2025`

## AlphaFold: The Revolution in Protein Structure Prediction

# A Brief History of Protein Structural Studies

Proteins are composed of a sequence of amino acids. This sequence determines their structural properties, which in turn govern their biological functions within the cell.

<img src="./Figures/Seq_protein.png" width="600" height="600"/>

## Secondary Structures 


<img src="./Figures/Seq_protein-2.png" width="800" height="800"/>

## Structural Motifs

<img src="./Figures/Motifs.png" width="800" height="800"/>

## How can we determine if a protein has a structural motif? Let’s examine the case of `MarA`. (https://www.uniprot.org/uniprotkb/P0ACH5/entry):

<img src="./Figures/MarA.png" width="800" height="800"/>



# Therefore, studying protein structure allows us to:

1. Design more effective and selective drugs;

2. Investigate protein misfolding in diseases such as Alzheimer’s disease, Parkinson’s disease, type I diabetes, senile dementia, amyotrophic lateral sclerosis (ALS), Huntington’s disease, and others;

3. Assess the impact of amino acid mutations—for example, the Mpro mutation in SARS-CoV and SARS-CoV-2;

4. Apply bioengineering for enzymatic catalysis

# The key question is: how can we obtain protein models to investigate their structure?

https://www.rcsb.org/

<img src="./Figures/PDB.png" width="800" height="800"/>


In some cases, protein models are obtained experimentally using X-ray crystallography, NMR (Nuclear Magnetic Resonance) spectroscopy, or more recently through cryo-EM (cryo-electron microscopy).

However, __protein crystallization is not always feasible, nor is it consistently possible to obtain sufficiently purified samples for structural characterization.__

## If a protein's structure is predominantly determined by its amino acid sequence, how can we predict its structure from that sequence alone?

This is a fundamental challenge in structural biology. Given the problem's complexity, the CASP (Critical Assessment of Structure Prediction) competition was established to evaluate computational methods for protein structure prediction.

During CASP14 (2020), AlphaFold achieved breakthrough accuracy in predicting protein structures, marking a transformative moment for the field (https://predictioncenter.org/casp14/zscores_final.cgi).

<div style="background-color: #e6f3ff; padding: 10px; border: 1px solid #0077cc; border-radius: 5px;">
    <strong>📌 Important:</strong>  
    In CASP14, <strong>AlphaFold</strong> was the top-ranked protein structure prediction method by a large margin, producing predictions with high accuracy. While the system still has some limitations, the CASP results suggest AlphaFold has immediate potential to help us understand the structure of proteins and advance biological research.
</div>


<img src="./Figures/AlphaFold.png" width="800" height="800"/>

AlphaFold represents an innovative IA approach that integrates biophysical knowledge of protein structures with deep learning architecture, leveraging multiple sequence alignments in its algorithm design. For detailed methodology, we recommend consulting the original publication: 

Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2

<img src="./Figures/AlphaFold-2.1.png" width="1000" height="1000"/>

# AlphaFold DB provides open access to over 200 million protein structure predictions to accelerate scientific research.

<img src="./Figures/AlphaFold-3.png" width="800" height="800"/>





# AlphaFold's breakthrough was so significant for protein science that it was honored with the Nobel Prize in Chemistry.

<img src="./Figures/AlphaFold-4.png" width="800" height="800"/>

<div style="background-color: #e6f3ff; padding: 10px; text-align: center;">
    <strong>AlphaFold Breakthrough:</strong>  
    "Protein structure prediction revolutionized by AI."
</div>

<div style="background-color: #e6f3ff; padding: 10px; text-align: center;">
   🧬 <strong>"I hope when we look back on AlphaFold, it will be the first proof point of AI’s incredible potential.”</strong> 
 🧬 </div>

# Using Protein Visualization Tools

We will now utilize the `nglview` library for protein visualization. To install the library, simply run:

<div style="background: #f5f5f5; padding: 8px; border-left: 3px solid #555; font-family: monospace;">
pip install nglview
</div>

In [8]:
# pip install nglview

In [9]:
import os
import nglview as nv

In [10]:
filepath = os.path.join('PDB_files', '5uit.cif')
view = nv.show_file(filepath)
view

NGLWidget()

In [11]:
filepath2 = os.path.join('PDB_files', '6bfn.cif')
view2 = nv.show_file(filepath2)
view2

NGLWidget()

# ColabFold: making protein folding accessible to all

Mirdita, M., Schütze, K., Moriwaki, Y. et al. ColabFold: making protein folding accessible to all. Nat Methods 19, 679–682 (2022). https://doi.org/10.1038/s41592-022-01488-1


https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb

# Respiratory complex I from Escherichia coli

Respiratory complex I is a multi-subunit membrane protein complex that reversibly couples NADH oxidation and ubiquinone reduction with proton translocation against transmembrane potential. Complex I from Escherichia coli is among the best functionally characterized complexes, but its structure remains unknown, hindering further studies to understand the enzyme coupling mechanism.

Piotr Kolata, Rouslan G Efremov (2021). Structure of Escherichia coli respiratory complex I reconstituted into lipid nanodiscs reveals an uncoupled conformation. eLife. 10:e68710.

<img src="./Figures/PDB_7nyr.png" width="800" height="800"/>

<img src="./Figures/Fig_Nuo_complex-final-3.png" width="800" height="800"/>



In [12]:
filepath3 = os.path.join('PDB_files', '7nyr.cif')
view = nv.show_file(filepath3)
view

NGLWidget()

# The goal of this example is to reconstruct part of this complex using ColabFold

\> I
MIKEIINVVHGTFTQLRSLVMIFGHAFRKRDTLQYPEEPVYLPPRYRGRIVLTRDPDGEERCVACNLCAVACPVGCISLQKAETEDGRWYPEFFRINFSRCIFCGLCEEACPTTAIQLTPDFEMGEFKRQDLVYEKHDLLISGPGKNPDYNYYRVAGMAIAGKPKGAAQNEAEPINVKSLLP

\> H
MSWLTPALVTIILTVVKAIVVLLAVVICGALLSWVERRLLGLWQDRYGPNRVGPFGAFQLGADMVKMFFKEDWTPPFADKMIFTLAPVIAMGALLVAFAIVPITPTWGVADLNIGILFFFAMAGLTVYAVLFAGWSSNNKFALLGSLRASAQTISYEVFLALSLMGIVAQVGSFNMRDIVQYQIDNVWFIIPQFFGFCTFIIAGVAVTHRHPFDQPEAEQELADGYHIEYAGMKWGMFFVGEYIGIVLVSALLATLFFGGWHGPFLDTLPWLSFFYFAAKTGFFIMLFILIRASLPRPRYDQVMAFSWKVCLPLTLINLLVTGALVLAAAQ

\> C
MTADSALYIPPYKADDQDIVVELNSRFGAETFTVQPTRTGMPVLWVPRERLIEVLTFLRQVPKPYVMLYDLHGVDERLRTHRRGLPSADFSVFYHLMSLERNSDVMIKVALSERDLNLPTATRIWPNANWYEREVWDMYGITFTGHPHLTRMLMPPTWQGHPLRKDYPARATEFDPYSLSAAKQDLEQEALRFKPEDWGMKRHGENEDYMFLNLGPNHPSAHGAFRIILQLDGEEIIDCVPEIGYHHRGAEKMAERQSWHSFIPYTDRIDYLGGVMNNLPYVLSVEKLAGIKVPQRVDVIRIMMAEFFRILNHLLYLGTYIQDVGAMTPVFFTFTDRQRAYKVVEAITGFRLHPAWYRIGGVAHDLPRGWDKLVREFLDWMPKRLDEYETAALKNSILRGRTIGVAQYNTKEALEWGTTGAGLRATGCDFDLRKARPYSGYENFEFEVPLAHNGDAYDRCMVKMGEMRQSLRIIEQCLKNMPEGPYKADHPLTTPPPKERTLQHIETLITHFLQVSWGPVMPANEAFQMIEATKGINSYYLTSDGSTMSYRTRIRTPSFAHLQQIPSVINGSMIADLIAYLGSIDFVMADVDR

\> G
MATIHVDGKTLEVDGADNLLQACLSLGLDIPYFCWHPALGSVGACRQCAVKQYTDENDKRGRLVMSCMTPATDNTWISIEDEEAKQFRASVVEWLMTNHPHDCPVCEEGGHCHLQDMTVMTGHNERRYRFTKRTHQNQELGPFIAHEMNRCIACYRCVRYYKDYAGGTDLGVYGAHDNVYFGRVEDGVLESEFSGNLTEVCPTGVFTDKTHSERYNRKWDMQFAPSICHGCSSGCNISPGERYGEIRRIENRYNGSVNHYFLCDRGRFGYGYVNREDRPRQPLLVLSKQKLSLDGALDQAAALLKERKVVGIGSPRASLESNFALRELVGEGNFYSGINEGELDRLRLILQVMQEGPLPVPSIRDIEDHDAVFVLGEDLTQTAARIALALRQSVKGKAVEMAADMKVQPWLDAAVKNIAQHAQNPLFIASVSATRLDDVAEETVHAAPDDLARLGFAVAHAIDPSAPAVADLDPQAQAFAQRIADALLVAKRPLVVSGNSLGNKALIEAAANIAKALKQREKNGSISLVVGEANSLGLALFGGDSVEAALERLTSGQADAVVVLENDLYRRTDAARVDAALAAARVVIVADHQQTATTAKAHLVLPAASFAEGDGTLVSQEGRAQRFFQVFDPTYYDAKNMVREGWRWLHAIHSTLQGKRVDWTQLDHVTEAVAEAKPILAGIRDAAPAASFRIKGLKLAREPHRYSGRTAMRANISVHEPRTPQDIDSAFAFSMEGYSGSQEDRQQIPFAWSPGWNSPQAWNKFQDEVGGHLRAGDPGVRLIEPKGEGLDWFQAVPVPFSAKADSWKVVPLYHLFGSEENSSRAAPIQQRIPETYVALSKEDADRLGVNDGATLGFQLKGQALRLPLRIDEQLGAGLIGLPVGFAGIPAAIAGCSVEGLQEAAQ

\> A
MPNPAELAAHHWGFAAFLLGVVGLLAFMLGVSSLLGSKAFGRSKNEPFESGIVPTGGARLRLSAKFYLVAMLFVIFDVEALFLFAWSVSVRESGWAGLIEATIFIAILLAGLVYLWRIGALDWAPESRRKRQAKLKQ


## We will select chains I and H to build the model.

Now, lets check the results.

In [13]:
# Load aligned structures
view = nv.show_file("align_AF_7nyr.pdb")
view

NGLWidget()