[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RecognitionAnalytics/NovoSmithy/blob/main/ProteinLinker.ipynb)



üß± Designing the Initial Structure
Before using the Protein Segment Stitcher, it's recommended to create an initial scaffold by assembling relevant structural fragments:

Start with the RCSB PDB
Download one or more protein structures from the RCSB Protein Data Bank that are similar to the target protein you're designing.

Use ChimeraX for Structural Design
Tools like ChimeraX allow you to:

Position chains into desired orientations or conformations.

Delete unwanted segments or extraneous domains.

Assemble new architectures by combining parts of different proteins.

Manually inspect and edit chain connectivity, clashes, and spatial fit.

Export Your Designer Structure
Once you've created a rough design (with intended connectivity but likely chain breaks), export the structure as a PDB file. This becomes the input for the Segment Stitcher, which will intelligently bridge the broken regions with polyG linkers.

We must load LigandMPNN and prepare it for use.  This is a fairly long process (5-10 min).  Run the code in the following cell and wait for it to finish completely (Colab will show a popup, but do not click ok until this cell has completely finished.) Once you have Restarted the session, you can Run All as normal


In [1]:
# @title
import sys
import os
from IPython.display import clear_output

#check if  NovoSmithy github has already been cloned, if not clone it
if '/content/NovoSmithy' not in sys.path:
  !git clone https://github.com/RecognitionAnalytics/NovoSmithy.git
  sys.path.append('/content/NovoSmithy')

if '/content/LigandMPNN' not in sys.path:
  !git clone https://github.com/dauparas/LigandMPNN.git
  %cd LigandMPNN
  !bash get_model_params.sh "./model_params"

  #setup your conda/or other environment
  #conda create -n ligandmpnn_env python=3.11
  !pip3 install -r requirements.txt


  sys.path.append('/content/LigandMPNN')

clear_output()
print("Environment has been setup for LigandMPNN, please restart the session by clicking 'RunTime' and then 'Restart Session'.  ")


Environment has been setup for LigandMPNN, please restart the session by clicking 'RunTime' and then 'Restart Session'.  


In [1]:
# @title Run again after Restart
import sys
import os
!pip install biopython
!pip install py3Dmol
if '/content/NovoSmithy' not in sys.path:
  sys.path.append('/content/NovoSmithy')
if '/content/LigandMPNN' not in sys.path:
  sys.path.append('/content/LigandMPNN')

from ChainStitcher import ChainStich, PDBLoader, PDBSaver, PDBViewer
from ChainStitcher import ClearFolder, load_fasta, CreateSequence,MoveChainSection,RotateChainSection
import random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import numpy as np
from scipy.optimize import minimize
import glob
import json
from Bio.PDB import *
from scipy.spatial.transform import Rotation
import shutil
from IPython.display import clear_output
#from google.colab import drive

def ClearFolder(inputFolder):
  for filename in os.listdir(inputFolder):
      file_path = os.path.join(inputFolder, filename)
      try:
          if os.path.isfile(file_path) or os.path.islink(file_path):
              os.unlink(file_path)
          elif os.path.isdir(file_path):
              shutil.rmtree(file_path)
      except Exception as e:
          print('Failed to delete %s. Reason: %s' % (file_path, e))

def load_fasta(file_path):
    sequences = []
    with open(file_path, 'r') as file:
        while True:
            comment = file.readline().strip()
            if not comment:
                break
            sequence = file.readline().strip()
            if 'overall_confidence' in comment:
                #comment format -> ">working_04, id=1, T=0.044000000000000004, seed=37763, overall_confidence=0.2417, ligand_confidence=0.2186, seq_rec=0.2693"
                comment = 'file=' + comment[1:]
                json_data = {key: (value).strip() for key, value in [item.split('=') for item in comment.split(', ')]}
                json_data['overall_confidence']= float(json_data['overall_confidence'])
                json_data['ligand_confidence']= float(json_data['ligand_confidence'])
                json_data['seq_rec']= float(json_data['seq_rec'])
                sequences.append((json_data , sequence))
    return sequences

ClearFolder('/content/LigandMPNN/inputs')
ClearFolder('/content/LigandMPNN/outputs/default/backbones')
ClearFolder('/content/LigandMPNN/outputs/default/seqs')



üß¨ Protein Chain Stitcher
Protein Chain Stitcher is a specialized tool designed to repair and connect fragmented protein structures within PDB files by intelligently stitching together disjointed segments using flexible poly-glycine (polyG) linkers.

üîß What It Does
Analyzes a PDB file to identify continuous protein segments, chain breaks, and structural issues.

Determines the optimal stitching strategy by calculating the longest possible path through all valid segment connections.

Bridges chain breaks with polyG linkers, where breaks are within a user-defined X √Öngstr√∂m threshold.

Outputs a modified PDB with a unified, continuous backbone for downstream modeling, simulation, or structure prediction.

üß† Use Case
This tool is particularly useful when working with:

Fragmented predictions from structure prediction tools like AlphaFold or ESMFold

Engineered chimeric proteins requiring chain fusion

Loop modeling or de novo backbone design workflows

üßµ Why PolyG?
Poly-glycine segments provide a minimal, flexible linker that can later be refined or rebuilt using loop modeling tools, making them ideal for bridging uncertain or flexible regions in a structure.

In [None]:
input_pdb_path = r"./NovoSmithy/p96_Left.pdb"
output_pdb_path = r"./connected_protein.pdb"


model = PDBLoader(input_pdb_path)
chainStich = ChainStich(model, excluded_chains=[], connection_threshold_Ang=15)

PDBViewer(model)
connectedModel = chainStich.ConnectClosest()
PDBViewer(connectedModel)
PDBSaver(connectedModel, output_pdb_path)

üî° Generating Sequences with LigandMPNN
Once your stitched protein structure is ready, the next step is to design amino acid sequences that are compatible with the 3D backbone and any ligand binding geometry you have in mind.

We use LigandMPNN for this purpose.

üß™ What LigandMPNN Does
LigandMPNN is a deep learning-based sequence design tool that:

Takes a backbone PDB file (including polyG-linked, stitched segments) as input

Considers ligand geometry (if present in the PDB)

Generates multiple plausible sequences that fit the structure and optimize interactions with any included ligands

üìÅ Workflow
Input: The stitched PDB structure from the previous step.

Processing: LigandMPNN will run inference and sample a number of sequences.

Output: Designed sequences are saved in the ./Outputs/ directory, usually as .csv or .fa files with corresponding metadata.

This step is key for transforming your de novo structural model into a functional protein design candidate with sequences that can be synthesized or simulated.

In [None]:
sequenceFile="/content/outputs/possibleSequences.fa"

addSaltBridges =True  #use residue probabilities that improve stability of the alpha helix
numberAttempts = 10
batch_size=1
number_batches =1

generated_sequences=[]
#generate sequences without moving any of the chains

generated_sequences = CreateSequence(output_pdb_path, sequenceFile, generated_sequences, addSaltBridges, numberAttempts, batch_size, number_batches)

Often the creation of the protein is difficult to align.  We can use tools to move the added sections around.  This allows LigandMPNN a larger space to explore to find the best sequence.  All measurements are in angstromes

In [None]:
for i in range(25):
    model = PDBLoader(output_pdb_path)
    randomMoveDirection = np.random.randn(3)*10 #up to 10 Angstromes of movement for the chains
    moveSize = np.linalg.norm(randomMoveDirection)
    randomMoveDirection /= moveSize

    MoveChainSection(connectedModel, 'D', None, randomMoveDirection, moveSize) #all of chain D,E,F are moved together
    MoveChainSection(connectedModel, 'E', None, randomMoveDirection, moveSize)
    MoveChainSection(connectedModel, 'F', None, randomMoveDirection,moveSize)
    moveSize = moveSize + np.random.randn(1)[0]*10  #allow more movement for the clamp
    MoveChainSection(connectedModel, 'A', (409,496), randomMoveDirection, moveSize) #only a small part of chain A is the clamp so only move those residues

    PDBViewer(connectedModel)
    PDBSaver(connectedModel,  r"./connected_protein2.pdb")

    try:
      generated_sequences = CreateSequence(r"./connected_protein2.pdb", sequenceFile, generated_sequences, addSaltBridges, numberAttempts, batch_size, number_batches)
    except:
      pass


In [11]:
generated_sequences

[({'file': 'connected_protein_0',
   'id': '1',
   'T': '0.008',
   'seed': '84259',
   'overall_confidence': 0.4954,
   'ligand_confidence': 1.0,
   'seq_rec': 0.7267},
  'SEELHATYARLVAESEAEHAAGPRPVTFADIGGYEELKARLRRLVDGPVTRPEAFAAARATPPKAVLLFGPPGTGKTELAKALATETGRNLIFITGLDLAMWEFEGRPGRVGRIFVEAERRAPCILFIDEIHQVMLTGLYPGPLNPVLLALLREIRALPAGSGVFVIATTHRPELLPEIFFEVGLFDRPIYVPLPTLEEKVEQLRVELRDLEVPEEIDLEAIAERLTGATRADIRRLVEEARRIRAREIIEAIERGEEPPEPVLRMRDFEEALKTFRPSVTEEEIERYRELKERFENRPRGGTVFLPGRELTFEDIEGLEEVKEELRFRVIWPTRYPEVFEALKITPRKAVLLYGEPSVCKTALARAVATEAGLGLLTVTGLDLAMLEFARGPAGVRALFEEARRRALRGTAGLGIALYSEDEELARELAEEIERWIREHPLTTPETEIGRPEDGRFIGIADPDPERVRRLAEALRERLKEIPLTERVKVFIGVGLPRRGTVLFIDEIHLVTETGLYPGELHPVLLELARAIREAPERSGVFVIATTNRPERLPAFLFEPGLFDEPIEVRGGRLRPEAGLTGEENAARVEAALA:SPELRERRRRLIEESERRHAAGPRPVTFADIGGLRELKERLLRIVEGPVTRPEAFAAARAVPPKAVLLFGPPGTGKTELAKAVATETGRNLIFVTGLDLARWEFERRPERVTELFEEARRRAPCILFIDEIHQVMLTGLYPGPLNPVLLRLLLEIRRLPEGSGVFVIATTHRPERLPEIFFEVGLFDLPIYVPLPTLEEKVEELRRELRDLEVPEEIDLAAIAERLTGATRA