# Protrein generative models evaluation metrics
## Environment Preparation
### Conda Environment

You can set up the conda environment by running the following command:

In [None]:
conda env create -f fm.yml

Besides, you have to install the following packages:

In [None]:
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
pip install fair-esm
conda install -c conda-forge -c bioconda foldseek
pip install jupyter-mermaid

use this comand to launch jupyter notebook:

In [None]:
jupyter nbextension enable --py jupyter_mermaid

### Foldseek database

When we calculate the novelty metric, we use the Foldseek database.

In [None]:
conda install -c conda-forge -c bioconda foldseek
mkdir ./foldseek_database
cd ./foldseek_database
foldseek databases PDB pdb tmp 

Foldseek will download the PDB database automatically. After the download, you directory should look like this:

```
foldseek_database
    ├── pdb
    ├── pdb_ca
    ├── pdb_ca.dbtype
    ├── pdb_ca.index
    ├── pdb_clu
    ├── pdb_clu.dbtype
    ├── pdb_clu.index
    ├── pdb.dbtype
    ├── pdb_h
    ├── pdb_h.dbtype
    ├── pdb_h.index
    ├── pdb.index
    ├── pdb.lookup
    ├── pdb_mapping
    ├── pdb_seq.0 -> pdb
    ├── pdb_seq.1
    ...
```

After downloading the foldseek database, you need to replace the database path in the `foldseek_database` field of the `configs/evaluation.yaml` file.

### Maxcluster

When we cluster the designed protein based on their structure, we use maxcluster to cluster them.

In [None]:
wget https://www.sbg.bio.ic.ac.uk/maxcluster/maxcluster64bit

#### Example data

We provide some example data `./example_data` for testing purposes.

```
└── length_70
    ├── sample_0
    │   ├── bb_traj.pdb
    │   ├── sample.pdb
    │   └── x0_traj.pdb
    ├── sample_1
    │   ├── bb_traj.pdb
    │   ├── sample.pdb
    │   └── x0_traj.pdb
```



### ProteinMPNN

We can use the ProteinMPNN model to design a sequence for a given structure. 

In [None]:
git clone https://github.com/dauparas/ProteinMPNN.git

## Evaluation

We provide two ways to evaluate the performance of the model: 

1. Single pdb evaluation: only calculate the metrics of a single pdb file.

2. Batch evaluation: calculate the metrics of a batch of pdb files whose paths are given in a csv file.

### Single pdb evaluation

In [None]:
# import package
import os
import time
import numpy as np
import hydra
import torch
import subprocess
import logging
import pandas as pd
import shutil
from datetime import datetime
from biotite.sequence.io import fasta
import GPUtil
from typing import Optional, Union, List
from analysis import utils as au
from analysis import metrics
from data import utils as du
from omegaconf import DictConfig, OmegaConf
from openfold.data import data_transforms
import esm
from pathlib import Path
import mdtraj as md
from openfold.np import residue_constants
from tmtools import tm_align
from openfold.utils.superimposition import superimpose
from tqdm import tqdm
import re

In [None]:
from EvalRunner import EvalRunner

@hydra.main(version_base=None, config_path="configs", config_name="evaluation")
def run(conf: DictConfig) -> None:



    EvalModel = EvalRunner(conf)
    # EvalModel.calc_all_metrics(sc_output_dir, pdb_path)


#### Self consistency metrics

```mermaid

graph TD;
    A[Protein Generative models] --> B[ProteinMPNN (inverse folding)];
    B --> C[ESMFold (folding)];

```


In [None]:
# Example pdb path
pdb_path = "/home/shuaikes/Project/protein-evaluation-notebook/example_data/length_70/sample_0/"
sc_output_dir = os.path.join(pdb_path, "self_consistency")
os.makedirs(sc_output_dir, exist_ok=True)

sc_results = EvalModel.calc_designability(sc_output_dir, pdb_path)
sc_results

#### Sub-structure ratio evaluation

In [None]:
path = os.path.join(pdb_path, "sample.pdb")
sub_ratio = EvalModel.calc_mdtraj_metrics(path)
sub_ratio

#### Novelty: pdbTM

In [None]:
path = os.path.join(pdb_path, "sample.pdb")
value = EvalModel.pdbTM(pdb_path)

#### Diversity: number of clusters

We are not able to calculate the diversity of the cluster for just single protein strucuture. Please refer to section `batch evaluation`

#### Calculate all metrics

In [None]:
EvalModel.calc_all_metrics(sc_output_dir, pdb_path)

### Batch evaluation

In [None]:
pdb_csv_path = "/home/shuaikes/Project/protein-evaluation-notebook/pdb_path.csv"
sc_results, sub_ratio, novelty_result, clusters = EvalModel.calc_metrics_from_csv(
    pdb_csv_path
)
sc_results, sub_ratio, novelty_result, clusters