<a href="https://colab.research.google.com/github/Kuhlman-Lab/ThermoMPNN-D/blob/main/ThermoMPNN-D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>**This is the Colab implementation of ThermoMPNN-D**</center>


<center><img src='https://drive.google.com/uc?export=view&id=1qXMpih7MLeZfRDZF9-iYSlL6SXEY3FdS'></center>

---

ThermoMPNN-D is an updated version of ThermoMPNN for predicting double point mutations. It was trained on an augmented version of the Megascale double mutant dataset. It is state-of-the-art at predicting stabilizing double mutations.

For convenience, we also provide a single-mutant ThermoMPNN model and an "additive" model that finds mutation pairs in a naive fashion by ignoring epistatic interactions. For details, see the [ThermoMPNN-D paper](https://doi.org/10.1101/2024.08.20.608844).

### **COLAB TIPS:**
- The cells of this notebook are meant to be executed *in order*, so users should start from the top and work their way down.
- Executable cells can be run by clicking the PLAY button (>) that appears when you hover over each cell, or by using **Shift+Enter**.
- Make sure GPU is enabled by checking `Runtime` -> `Change Runtime Type`
  - Make sure that `Runtime type` is set to `Python 3`
  - Make sure that `Hardware accelerator` is set to `GPU`
  - Click `Save` to confirm

- If the notebook freezes up or otherwise crashes, go to `Runtime` -> `Restart Runtime` and try again.


In [1]:
%%capture

#@title # 1. Set up **ThermoMPNN environment**
#@markdown Import ThermoMPNN and its dependencies to this session. This may take a minute or two.

#@markdown You only need to do this once *per session*. To re-run ThermoMPNN on a new protein, you may start on Step 3.

#@markdown ---

# cleaning out any remaining data
!cd /content
!rm -rf /content/ThermoMPNN-D
!rm -rf /content/sample_data
!rm /content/*.pdb
!rm /content/*.csv

# import ThermoMPNN-D github repo
import os
if not os.path.exists("/content/ThermoMPNN-D"):
  !git clone https://github.com/Kuhlman-Lab/ThermoMPNN-D.git
  %cd /content/ThermoMPNN-D

# downloading various dependencies - add more if needed later
! pip install omegaconf wandb pytorch-lightning biopython


In [8]:
%%capture
#@title # **2. Set up ThermoMPNN imports and functions**

import os
import sys
from urllib import request
from urllib.error import HTTPError

from google.colab._message import MessageError
from google.colab import files


tMPNN_path = '/content/ThermoMPNN-D'
if tMPNN_path not in sys.path:
  sys.path.append(tMPNN_path)


def download_pdb(pdbcode, datadir, downloadurl="https://files.rcsb.org/download/"):
    """
    Downloads a PDB file from the Internet and saves it in a data directory.
    :param pdbcode: The standard PDB ID e.g. '3ICB' or '3icb'
    :param datadir: The directory where the downloaded file will be saved
    :param downloadurl: The base PDB download URL, cf.
        `https://www.rcsb.org/pages/download/http#structures` for details
    :return: the full path to the downloaded PDB file or None if something went wrong
    """

    pdbfn = pdbcode + ".pdb"
    url = downloadurl + pdbfn
    outfnm = os.path.join(datadir, pdbfn)
    try:
        request.urlretrieve(url, outfnm)
        return outfnm
    except Exception as err:
        print(str(err), file=sys.stderr)
        return None

def drop_cysteines(df, mode):
  """Drop any mutations to Cys"""

  if mode.lower() == 'single':
    aatype_to = df['Mutation'].str[-1].values
    is_cys = aatype_to == "C"
    df = df.loc[~is_cys].reset_index(drop=True)

  elif mode.lower() == 'additive' or mode.lower() == 'epistatic':
    muts = df['Mutation'].str.split(':', n=2, expand=True).values # [N, 2]
    is_cys = []
    for m in muts:
      mut1, mut2 = m
      is_cys.append(mut1.endswith("C") or mut2.endswith("C"))

    is_cys = np.array(is_cys)
    df = df.loc[~is_cys].reset_index(drop=True)
  else:
    raise ValueError(f"Invalid mode {mode} selected!")
  return df


In [3]:
# %%capture
#@title # **3. Upload or Fetch Input Data**

#@markdown ## You may either specify a PDB code to fetch or upload a custom PDB file.<br><br>

# -------- Collecting Settings for ThermoMPNN run --------- #

!rm /content/*.pdb &> /dev/null

#@markdown PDB code (example: 1PGA):
PDB = "1vii" #@param {type: "string"}

#@markdown Upload Custom PDB?
Custom = False #@param {type: "boolean"}
#@markdown NOTE: If enabled, a `Choose files` button will appear at the bottom of this cell once this cell is run.

#@markdown Chain(s) of Interest (example: A,B,C):
Chains = "" #@param {type:"string"}
#@markdown If left empty, all chains will be used.

# try to upload the PDB file to Colab servers
if Custom:
  try:
    uploaded_pdb = files.upload()
    for fn in uploaded_pdb.keys():
      PDB = os.path.basename(fn)
      if not PDB.endswith('.pdb'):
        raise ValueError(f"Uploaded file {PDB} does not end in '.pdb'. Please check and rename file as needed.")
      os.rename(fn, os.path.join("/content/", PDB))
      pdb_file = os.path.join("/content/", PDB)
  except (MessageError, FileNotFoundError):
    print('\n', '*' * 100, '\n')
    print('Sorry, your input file failed to upload. Please try the backup upload procedure (next cell).')

else:
  try:
    fn = download_pdb(PDB, "/content/")
    if fn is None:
      raise ValueError("Failed to fetch PDB from RSCB. Please double-check PDB code and try again.")
    else:
      pdb_file = fn
  except HTTPError:
    raise HTTPError(f"No protein with code {PDB} exists in RSCB PDB. Please double-check PDB code and try again.")


In [None]:
#@title # **3. Backup Data Upload (ONLY needed if initial upload failed)**

#@markdown ## Colab automatic file uploads are not very reliable. If your file failed to upload automatically, you can do so manually by following these steps.<br><br>

#@markdown #### 1. Click the "Files" icon on the left toolbar. This will open the Colab server file folder.

#@markdown #### 2. The only thing in this folder should be "ThermoMPNN" directory. If any other files are in here, delete them.

#@markdown #### 3. Click the "Upload to session storage" button under the "Files" header. Choose your file for upload.

#@markdown #### 4. Run this cell. ThermoMPNN will find your file in session storage and use it.


#@markdown Chain(s) of Interest (example: A,B,C):
Chains = "" #@param {type:"string"}
#@markdown If left empty, all chains will be used.

PDB = ""

files = sorted(os.listdir('/content/'))
files = [f for f in files if f.endswith('.pdb')]

if len(files) < 1:
  raise ValueError('No PDB file found. Please upload your file before running this cell. Make sure it has a .pdb suffix.')
elif len(files) > 1:
  raise ValueError('Too many PDB files found. Please clear out any other PDBs before running this cell.')
else:
  pdb_file = os.path.join("/content/", files[0])
  PDB = files[0].removesuffix('.pdb')
  print('Successfully uploaded PDB file %s' % (files[0]))

Successfully uploaded PDB file 1bvc.pdb


In [4]:
#@markdown # **4. Run Model**

#@markdown Stability model to use:
Model = "Additive" #@param ["Epistatic", "Additive", "Single"]

#@markdown ##### Model descriptions:
#@markdown * Single: Single mutation SSM sweep. Very fast and accurate.
#@markdown * Additive: Naive double mutation SSM sweep. Ignores non-additive coupling. Very fast but less accurate than Epistatic model for picking stabilizing mutations.
#@markdown * Epistatic: Full double mutation SSM sweep. Slower than Additive model, but more accurate for picking stabilizing mutations.

#@markdown ---------------

#@markdown Allow mutations to cysteine? (Not recommended)
Include = False #@param {type: "boolean"}
#@markdown Due to assay artifacts surrounding disulfide formation, model predictions for cysteine mutations may be overly favorable.

#@markdown ---------------

#@markdown Explicitly penalize disulfide breakage? (Recommended)
Penalize = True #@param {type: "boolean"}

#@markdown ThermoMPNN can usually detect disulfide breakage and penalize accordingly, but you may wish to explicitly forbid disulfide breakage to be safe. This option applies a flat penalty to make sure that breaking disulfides is always disfavored.

#@markdown --------------

#@markdown Batch size for model inference. (Recommended: 256 for Single/Additive models, 2048 for epistatic models)
BatchSize = 256 #@param {type: "integer"}
#@markdown If you hit a memory error, try lowering the BatchSize by factors of 2 to reduce memory usage.

#@markdown --------------

#@markdown Threshold for detecting stabilizing mutations. (Recommended: -1.0)
Threshold = -0.5 #@param {type: "number"}
#@markdown Only mutations with predicted ddG below this value will be kept for analysis. Higher thresholds will result in retaining more mutations.

#@markdown --------------

#@markdown Pairwise distance constraint for double mutants. (Recommended: 5.0)
Distance = 5.0 #@param {type: "number"}
#@markdown Only mutation pairs within this distance (in Angstrom) will be kept for analysis. Higher cutoffs will result in slower runtime and retaining more mutations.


# use input_chain_list to grab correct protein chain
chain_list = [c.strip() for c in Chains.strip().split(',')]
if len(chain_list) == 1 and chain_list[0] == '':
  chain_list = []

In [14]:
#@title # **Run SSM Inference**

import pandas as pd
import numpy as np

from thermompnn.ssm_utils import (
    distance_filter,
    disulfide_penalty,
    get_config,
    get_dmat,
    get_model,
    load_pdb,
    renumber_pdb,
)
from v2_ssm import (
    run_single_ssm,
    run_epistatic_ssm,
    format_output_single,
    format_output_double,
    check_df_size,
)

# ------------ MAIN INFERENCE ROUTINE -------------- #

mode = Model.lower()
pdb = pdb_file
chains = chain_list
threshold = Threshold
distance = Distance
batch_size = BatchSize
ss_penalty = Penalize

cfg = get_config(mode)
cfg.platform.thermompnn_dir = '/content/ThermoMPNN-D'
model = get_model(mode, cfg)
pdb_data = load_pdb(pdb, chains)
pdbname = os.path.basename(pdb)
print(f"Loaded PDB {pdbname}")

if (mode == "single") or (mode == "additive"):
  ddg, S = run_single_ssm(pdb_data, cfg, model)

  if mode == "single":
    ddg, mutations = format_output_single(ddg, S, threshold)
  else:
    ddg, mutations = format_output_double(
      ddg, S, threshold, pdb_data, distance
    )

elif mode == "epistatic":
  ddg, mutations = run_epistatic_ssm(
    pdb_data, cfg, model, distance, threshold, batch_size
  )

else:
  raise ValueError("Invalid mode selected!")

df = pd.DataFrame({"ddG (kcal/mol)": ddg, "Mutation": mutations})

check_df_size(df.shape[0])

if mode != "single":
  df = distance_filter(df, pdb_data, distance)

if ss_penalty:
  df = disulfide_penalty(df, pdb_data, mode)

if not Include:
  df = drop_cysteines(df, mode)

df = df.dropna(subset=["ddG (kcal/mol)"])
if threshold <= -0.0:
  df = df.sort_values(by=["ddG (kcal/mol)"])

if mode != "single":  # sort to have neat output order
  df[["mut1", "mut2"]] = df["Mutation"].str.split(":", n=2, expand=True)
  df["pos1"] = df["mut1"].str[1:-1].astype(int) + 1
  df["pos2"] = df["mut2"].str[1:-1].astype(int) + 1

  df = df.sort_values(by=["pos1", "pos2"])
  df = df[["ddG (kcal/mol)", "Mutation", "CA-CA Distance"]].reset_index(drop=True)

check_df_size(df.shape[0])

try:
  df = renumber_pdb(df, pdb_data, mode)

except (KeyError, IndexError):
  print(
    "PDB renumbering failed (sorry!) You can still use the raw position data. Or, you can renumber your PDB, fill any weird gaps, and try again."
  )


  checkpoint = torch.load(checkpoint_path, map_location='cpu')


Loading model %s /content/ThermoMPNN-D/vanilla_model_weights/v_48_020.pt
setting ProteinMPNN dropout: 0.0
MLP HIDDEN SIZES: [384, 64, 32, 21]
Loaded PDB 1vii.pdb
ThermoMPNN single mutant predictions generated for protein of length 36 in 0.02 seconds.


630it [00:00, 255009.80it/s]


ThermoMPNN double mutant additive model predictions calculated in 0.01 seconds.


570it [00:00, 731117.21it/s]

Distance matrix generated.
Identified the following disulfide engaged residues: []
ThermoMPNN predictions renumbered.





In [10]:
#@title **Visualize data in an interactive table**
from google.colab import data_table

data_table.enable_dataframe_formatter()
data_table.DataTable(df, include_index=True, num_rows_per_page=10)

Unnamed: 0,ddG (kcal/mol),Mutation,CA-CA Distance
0,-0.570106,EA45A:DA46Q,3.84
1,-0.798509,RA55W:SA56E,3.82
2,-0.795915,RA55Y:SA56E,3.82
3,-0.704363,RA55W:SA56A,3.82
4,-0.701769,RA55Y:SA56A,3.82
...,...,...,...
469,-0.531596,KA70L:KA73W,4.74
470,-0.512085,KA70V:KA73L,4.74
471,-0.508915,KA70I:KA73W,4.74
472,-0.505712,KA70F:KA73A,4.74


In [16]:
#@title # **Save Output as CSV**

# ---------- Collect output into DF and save as CSV ---------- #
from google.colab import files

#@markdown Specify prefix for file saving (e.g., MyProtein). Leave blank to use input PDB code.
PREFIX = "example" #@param {type:"string"}

#@markdown NOTE: If you wish to retrieve your files manually, you may do so in the **Files** tab in the leftmost toolbar.

#@markdown NOTE: Make sure you click "Allow" if your browser asks to permit downloads at this step.

#@markdown Verbose output? This means saving more individual columns.
VERBOSE = False #@param {type: "boolean"}

df['ddG (kcal/mol)'] = df['ddG (kcal/mol)'].round(4)

if len(PREFIX) < 1:
  PREFIX = pdb_file.split('.')[0]
else:
  PREFIX = os.path.join('/content/', PREFIX)

full_fname = PREFIX + '.csv'

if VERBOSE:
  if Model == 'Single':
    df['Wildtype AA'] = df['Mutation'].str[0]
    df['Mutant AA'] = df['Mutation'].str[-1]
    df['Position'] = df['Mutation'].str[2:-1]
    df['Chain'] = df['Mutation'].str[1]

  else:
    df[['Mutation 1', 'Mutation 2']] = df['Mutation'].str.split(':', n=2, expand=True)
    df['Wildtype AA 1'], df['Wildtype AA 2'] = df['Mutation 1'].str[0], df['Mutation 2'].str[0]
    df['Mutant AA 1'], df['Mutant AA 2'] = df['Mutation 1'].str[-1], df['Mutation 2'].str[-1]
    df['Position 1'], df['Position 2'] = df['Mutation 1'].str[2:-1], df['Mutation 2'].str[2:-1]
    df['Chain 1'], df['Chain 2'] = df['Mutation 1'].str[1], df['Mutation 2'].str[1]

df.to_csv(full_fname, index=True)
files.download(full_fname)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# APPENDIX

## License

The source code for ThermoMPNN-D, including license information, can be found [here](https://github.com/Kuhlman-Lab/ThermoMPNN-D)

## Citation Information

If you use ThermoMPNN-D in your research, please cite the following paper(s):

### Epistatic or Additive model:
Dieckhaus, H., Kuhlman, B., *Protein stability models fail to capture epistatic interactions of double point mutations*. **2024**, bioRxiv, doi: https://doi.org/10.1101/2024.08.20.608844.

### Single mutant model:
Dieckhaus, H., Brocidiacono, M., Randolph, N., Kuhlman, B. *Transfer learning to leverage larger datasets for improved prediction of protein stability changes.* Proc Natl Acad Sci **2024**, 121(6), e2314853121, doi: https://doi.org/10.1073/pnas.2314853121.

## Contact Information

# Please contact Henry Dieckhaus at dieckhau@unc.edu to report any bugs or issues with this notebook. You may also submit issues on the ThermoMPNN-D GitHub page [here](https://github.com/Kuhlman-Lab/ThermoMPNN-D/issues).
