# Geometric Deep Learning for Protein Structure Data with PyTorch Lightning

## Schedule
- `10:00 - 10:15`: Introduction
- `10:15 - 11:00`: [Notebook 1 - Proteins as Graphs]()
- `11:00 - 11:30`: _Break_
- `11:30 - 12:30`: [Notebook 2 - Graph Datasets and DataLoaders]()
- `12:30 - 13:30`: _Lunch_
- `13:30 - 13:45`: Introduction to geometric deep learning
- `13:45 - 15:00`: [Notebook 3 - Geometric Deep Learning]()
- `15:00 - 15:30`: _Break_
- `15:30 - 16:30`: [Notebook 4 - Training and Tracking]()
- `16:30 - 17:00`: Wrap-up

## Background knowledge

- Protein structures
- Deep learning
- Graph neural networks


### Why graphs?
- **Graphs** are a natural way to represent **interactions** between entities where the task at hand is affected both by local neighboring connections and global graph topology.

- **Proteins** are made up of amino acids that are connected by chemical bonds, and **contributions from "neighboring" atoms can affect the properties of a given atom** to drive protein-protein binding, protein folding, and other biological processes that make up the protein's function.

- When analyzing protein-protein binding, aspects such as residue-residue interactions, residue-solvent interactions, and conformational changes contribute to the entropic and enthalpic factors that drive the binding process.
-
- These interactions can be represented as a graph, where nodes represent atoms or amino acids and edges represent interactions between them.

<div style="text-align: center; margin-right: 0; margin-left: auto; margin-right: auto;">
    <img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fpj.2014.33/MediaObjects/41428_2014_Article_BFpj201433_Fig1_HTML.jpg?as=webp" style="width: 300px;"/>
</div>

### Why deep learning?
- Deep learning methods can produce data-driven features from the input representation (**feature extraction / feature learning**), useful in learning from complex, high-dimensional data for tasks where the exact features and their relationships are not known.

![https://www.youtube.com/watch?v=LeeUzusWz5g](https://i0.wp.com/semiengineering.com/wp-content/uploads/2018/01/MLvsDL.png?resize=733%2C405&ssl=1)

- Different deep learning architectures can cope with different **unstructured data representations** (i.e not arranged as vectors of features) such as text sequences, speech signals, images and graphs.

![](https://sebastianraschka.com/images/blog/2022/deep-learning-for-tabular-data/unstructured-structured.jpeg)

### Why geometric deep learning?
- **Geometric deep learning** is a subfield of deep learning that focuses on learning from data that is represented as graphs or manifolds.

![](https://i.ytimg.com/vi/LeeUzusWz5g/maxresdefault.jpg)

- It is particularly useful for learning from data that has a **non-Euclidean structure** such as social networks, 3D shapes, and molecule/protein structures.
- These models can preserve both **local geometric relations** (e.g., the immediate connections between nodes in a graph or neighboring residues) and **global topological features** (e.g., the overall shape or structure of a protein), which are crucial for understanding the underlying properties of the data.
- Many geometric data types, like graphs representing protein interactions, are **sparse** in nature. Geometric deep learning models can efficiently handle such sparsity, learning significant insights from limited interactions, which is often challenging for traditional models.


## Objectives
Develop a code-base for exploring, training and evaluating graph deep learning models using protein structures as input for a residue-level prediction task.
- Learn how to featurize protein structures as graphs using [Graphein]()
- Understand the data loading and processing pipeline for graph datasets using [PyTorch Geometric]()
- Learn how to implement graph neural networks using [PyTorch Geometric]()
- Understand the typical deep learning training and evaluation loops using [PyTorch Lightning]()

## Task and Dataset

- **Given an input protein chain, predict for each residue whether or not it belongs to a protein-protein interface.**
- The dataset (in `dataset.txt`) is a subset of the [MaSIF-site dataset](https://www.nature.com/articles/s41592-019-0666-6). 
- Each line is a PDB ID and a chain. We'll use these to extract residues at the interface with other chains and label them as positive examples. All other residues are negative examples.

## Tips

- Use the `??` operator to get the documentation of a function or class in Jupyter.
- Play around with different parameters for the functions and classes to understand their behavior.
- Many of the classes involved in deep learning are "abstract classes" that provide a blueprint for other classes to inherit from. These are of the form `class MyClass(ABC):`. Abstract classes often have methods that need to be implemented by the inheriting class. In practice, this just means that there are a set of functions (which have a fixed name and fixed input arguments) that you need to implement in your class, and you can find out what these are by looking at the documentation or source code of the abstract class. Apart from this, you can add any other methods or attributes to your class as you see fit.

## Notebooks
1. Data Preparation
    - Constructing graphs from protein structures (Notebook 1_1)
    - Exploring pytorch geometric graphs, constructing dataset and data module (Notebook 1_2)
    - Load and batch data for training (Notebook 1_2)
2. Model Development
    - Explore off-the-shelf graph-based models (Notebook 2_1)
    - Add custom layers and losses (Notebook 2_1)
    - Implement training and validation (Notebook 2_1 & 2_2)
3. Model Evaluation
    - Log, visualise and track model performance (Notebook 2_2)
    - Save and load trained models and checkpoints 
    - Make and save predictions
    - Constructing configuration files and run compare models with different parameters (Notebook 2_3)