Structbio2022 - Practical day 01 (08/11/2022)
=============================================

Part 02
-------

Document written by [Adrián Diaz](mailto:adrian.diaz@vub.be) & [David Bickel](mailto:david.bickel@vub.be).

**Vrije Universiteit Brussel**

### Objectives

- Learn how to extract the structures from the ColabFold results
- Visualize the best predicted model (PDB file)
- Analyze metrics

#### Example

The first example we are going to review is the complex `CcdB:CcdB:Gyr:Gyr`. Then you will use this Jupyter Notebook to analyze your own results.


### Output structure

ColabFold generates different file types during the prediction process:

- PDB files containing the 3D structure
  - **relaxed PDB**:  PDB format text files containing the predicted structures, after performing an Amber relaxation procedure on the unrelaxed structure predictions (see Jumper et al. 2021, Suppl. Methods 1.8.6 for details).
  - unrelaxed PDB: PDB format text files containing the predicted structures, exactly as outputted by each model.
- PAE matrices: Predicted Aligned Error matrix in PNG format
- pLDDT plot: Confidence metric plot in PNG format
- Logs in txt format
- Numpy arrays
- Alignment files: in A3M format

In [None]:
import os, subprocess

base_dir = os.path.expandvars('$VSC_SCRATCH')
structbio_dir = os.path.join(base_dir, "structbio2022")
example_dir = os.path.join(structbio_dir, "day_01/examples/ccdb_ccdb_gyr_gyr")

In [None]:
process = subprocess.Popen(['ls', example_dir],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
retcode        = process.poll()

print(f"stdout (exit code={retcode}):")
for line in stdout.decode().split("\n"):
    print(line)

print(f"stderr (exit code={retcode}):")
for line in stderr.decode().split("\n"):
    print(line)

#### About ranked models

To rank model confidence, we use predicted LDDT (`pLDDT`) scores (see Jumper et al. 2021, Suppl. Methods 1.9.6 for details) in monomer predictions, while for complexes it's done by:

```
iPtmScore * 0.8 + ptmScore * 0.2
```

AlphaFold-Multimer algorithm generates an interface pTM score (ipTM) in addition to the pTM score which is another measure of the predicted structure accuracy generated by AlphaFold.

In conclusion, we are going to work with:

- Ranked #1 models
- Relaxed models

## Dependencies
By running the following cell, you will install the required external libraries

In [None]:
from IPython.display import Image
import ipywidgets as widgets
import math

import nglview as nv
from nglview.color import ColormakerRegistry

## Input file
The following cell contains the path to the predicted structure in PDB file

In [None]:
pdb_filepath = os.path.join(structbio_dir, 'day_01/examples/ccdb_ccdb_gyr_gyr/multimer_ccdB_ccdB_gyrA_gyrA_relaxed_rank_1_model_3.pdb')

## Visualization
The following cell contains the code to visualize the structure on this notebook

In [None]:
cm = ColormakerRegistry
cm.add_scheme_func('plddt','''
 this.atomColor = function (atom) {
     if (atom.bfactor < 50) {
       return 0xff7d45 // very low confidence
     } else if (atom.bfactor < 70) {
       return 0xffdb13 // low
     } else if (atom.bfactor < 90) {
       return 0x65cbf3 // high
     } else {
       return 0x0053d6 // very high
     }
 }
''')

### Visualize chains

In [None]:
view = nv.show_structure_file(pdb_filepath, default_representation=False)
view._remote_call("setSize", target="Widget", args=["1024px", "768px"])

view.add_cartoon("protein", color_scheme="chainid")
view.remove_spacefill()
view.add_surface(opacity=0.15)
view.center() # Center and zoom molecule
view

## Reviewing the metrics

### Confidence Score

AlphaFold produces a per-residue confidence score (pLDDT) between 0 and 100. Some regions below 50 pLDDT may be unstructured in isolation. The pLDDT confidence measure is stored in the B-factor field of the output PDB files (although unlike a B-factor, higher pLDDT is better, so care must be taken when using for tasks such as molecular replacement).
 
- **Very high** (pLDDT > 90)
- **Confident** (90 > pLDDT > 70)
- **Low** (70 > pLDDT > 50)
- **Very low** (pLDDT < 50)

In [None]:
image_path = os.path.join(structbio_dir, 'day_01/examples/ccdb_ccdb_gyr_gyr/multimer_ccdB_ccdB_gyrA_gyrA_plddt.png')
Image(filename=image_path)

In [None]:
view = nv.show_structure_file(pdb_filepath, default_representation=False)
view._remote_call("setSize", target="Widget", args=["1024px", "768px"])

view.add_cartoon("protein", color_scheme="plddt")
view.remove_spacefill()
view.center() # Center and zoom molecule
view

### Inter-domain accuracy

#### About Predicted aligned error matrix

The colour at position (x, y) indicates AlphaFold’s expected position error at residue x, when the predicted and true structures are aligned on residue y. The shade of green indicates expected distance error in Ångströms.

* **Dark green** is good (low error)
* **Light green** is bad (high error)

<img src="https://alphafold.ebi.ac.uk/assets/img/pae-2.png" alt="drawing" style="width:500px;"/>

#### Notes
The color scheme used by ColabFold replaces the green shades by blue ones.

In [None]:
image_path = os.path.join(structbio_dir, 'day_01/examples/ccdb_ccdb_gyr_gyr/multimer_ccdB_ccdB_gyrA_gyrA_PAE.png')

Image(filename=image_path)

## Task B

Your job should be finished with success or about to be finished. We propose you to: 

1. Visualize the predicted structure (ranked model #1).
2. Discuss the quality of the prediction taking into account the pLDDT metric. 
3. Analyze the domains using the PAE matrix.


## Next steps

That was all for today from Hydra, next week we will use these predicted structures as the input of the MD simulations aiming to answer the questions mentioned today!