<a href="https://colab.research.google.com/github/PabloExperimental/VHH-3D-pipeline-prediction/blob/main/Nanobody_structure_prediction_with_NanoNet%2BDLPacker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is pipeline is based upon the following research projects:


*   [NanoNet: Rapid and accurate end-to-end nanobody modeling by deep learning](https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2022.958584/full)
*   [DLPacker: Deep learning for prediction of amino acid side chain conformations in proteins](https://onlinelibrary.wiley.com/doi/10.1002/prot.26311)

Superimposition can fail depends on predictions.



In [2]:
!wget https://github.com/dina-lab3D/NanoNet/archive/refs/heads/main.zip
!unzip main.zip
!chmod 775 NanoNet-main
!pip install py3Dmol
!pip install dlpacker
!pip install biotite

--2024-08-25 08:17:07--  https://github.com/dina-lab3D/NanoNet/archive/refs/heads/main.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/dina-lab3D/NanoNet/zip/refs/heads/main [following]
--2024-08-25 08:17:08--  https://codeload.github.com/dina-lab3D/NanoNet/zip/refs/heads/main
Resolving codeload.github.com (codeload.github.com)... 140.82.113.10
Connecting to codeload.github.com (codeload.github.com)|140.82.113.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘main.zip’

main.zip                [     <=>            ]  39.82M  14.9MB/s    in 2.7s    

2024-08-25 08:17:10 (14.9 MB/s) - ‘main.zip’ saved [41751234]

Archive:  main.zip
e712547468f849e015b4f3de45a1110efc6e1b66
   creating: NanoNet-main/
  inflating: NanoNet-main/LICENSE.txt  
  inflating: NanoNet-

In [3]:
import tensorflow as tf
import numpy as np
import time
import py3Dmol
from dlpacker import DLPacker
import biotite.structure as struc
import biotite.structure.io.pdb as pdb
import traceback

Unzipping pretrained weight files...
Found volume files:  ['/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.003', '/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.010', '/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.011', '/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.008', '/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.006', '/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.009', '/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.012', '/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.002', '/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.004', '/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.007', '/usr/local/lib/python3.10/dist-packages/dlpacker/data/DLPacker_weights.7z.001', '/usr/local/lib/python3.10/dist-packages/dlpacker/

In [4]:
NanoNet = tf.keras.saving.load_model("/content/NanoNet-main/NanoNet", compile=False)



In [5]:
start_workflow = time.time()

def one_hot_encoding(sequence, max_len_sequence, alphabet):

  x_embed = np.zeros((max_len_sequence, len(alphabet)))

  for index_seq, amino in enumerate(sequence):
    index_amino = alphabet.index(amino)
    x_embed[index_seq][index_amino] = 1.0

  return x_embed

###Quoting from NanoNet paper:

*   [NanoNet: Rapid and accurate end-to-end nanobody modeling by deep learning](https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2022.958584/full)

> 4 Methods
>
> 4.1 Network architecture
>
> The sequences are represented by an input tensor of 140x22, where
> 140 represents the maximal length of the heavy chain and the 22 channels are used for
>   one-hot encoding of the 20 amino acids (one channel for unknown amino acid > and one for insertion).


I added "X" for unknown aminoacid (UNK) and "*" as wild-card (WCR).


In [6]:
max_len_sequence = 140

alphabet = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P",
            "Q", "R", "S", "T", "V", "W", "Y", "X", "*"]

alphabet_to_three_letter_code = {
    "A": "ALA", "R": "ARG", "N": "ASN", "D": "ASP", "B": "ASX", "C": "CYS",
    "E": "GLU", "Q": "GLN", "Z": "GLX", "G": "GLY", "H": "HIS", "I": "ILE",
    "L": "LEU", "K": "LYS", "M": "MET", "F": "PHE", "P": "PRO", "S": "SER",
    "T": "THR", "W": "TRP", "Y": "TYR", "V": "VAL", "X": "UNK", "*" :"WCR"}

[1MEL - PDB](https://www.rcsb.org/structure/1mel)

1MEL_1 without this part at the end GRYPYDVPDYGSGRA

In [7]:

sequence = "DVQLQASGGGSVQAGGSLRLSCAASGYTIGPYCMGWFRQAPGKEREGVAAINMGGGITYYADSVK"\
        "GRFTISQDNAKNTVYLLMNSLEPEDTAIYYCAADSTIYASYYECGHGLSTGGYGYDSWGQGTQVTVSS"

x_pred = one_hot_encoding(sequence, max_len_sequence, alphabet)
result = NanoNet.predict((tf.expand_dims(x_pred, axis=0)))



Ref to PDB Format: \
https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html

Protein Data Bank Format: \
Coordinate Section \
Record Type	Columns	Data Justification	Data Type \
- ATOM	1-4	“ATOM” character \
- 7-11	Atom serial number	right	integer \
- 13-16	Atom name	left	character \
- 17	Alternate location indicator		character \
- 18-20	Residue name	right	character \
- 22	Chain identifier		character \
- 23-26	Residue sequence number	right	integer \
- 27	Code for insertions of residues		character \
- 31-38	X orthogonal Å coordinate	right	real (8.3) \
- 39-46	Y orthogonal Å coordinate	right	real (8.3) \
- 47-54	Z orthogonal Å coordinate	right	real (8.3) \
- 55-60	Occupancy	right	real (6.2) \
- 61-66	Temperature factor right	real (6.2) \
- 73-76	Segment identifier left	character \
- 77-78	Element symbol	right	character \
- 79-80	Charge character \

In [8]:
def get_atom_record(atom_data):
  # Format the data according to the provided specifications
  # --- Refined from ChatGPT result :P
  line = "{:4s}  {:5d} {:4s}{:1s}{:3s} {:1s}{:4d}{:1s}   {:8.3f}{:8.3f}{:8.3f}{:6.2f}{:6.2f}       {:4s}{:2s}{:2s}\n".format(
        atom_data['name'],
        # 2 spaces
        atom_data['serial_number'],
        # 1 space here
        atom_data['atom_name'],
        atom_data['alternate_location_indicator'],
        atom_data['residue_name'],
        # 1 space here
        atom_data['chain_identifier'],
        atom_data['residue_sequence_number'],
        atom_data['code_for_insertions'],
        # 3 space here
        atom_data['x_coordinate'],
        atom_data['y_coordinate'],
        atom_data['z_coordinate'],
        atom_data['occupancy'],
        atom_data['temperature_factor'],
        # 7 spaces here
        atom_data['segment_identifier'],
        atom_data['element_symbol'],
        atom_data['charge']
    )

  return line

In [9]:
result_change = np.reshape(result, (result.shape[1], result.shape[2]))
# print("Changed: ", result_change.shape)

atom_index = 1
text = ""

for amino_index in range(0, result_change.shape[0]):
  row = result_change[amino_index]

  if amino_index < len(sequence):

    tl_aminoacid = alphabet_to_three_letter_code[sequence[amino_index]]

    # b-factor is zero because it's a prediction!
    # row[0], row[1], row[2] # N

    n_string = get_atom_record({
    'name': 'ATOM',
    'serial_number': atom_index,
    'atom_name': 'N',
    'alternate_location_indicator': ' ',
    'residue_name': tl_aminoacid,
    'chain_identifier': 'B',
    'residue_sequence_number': amino_index+1,
    'code_for_insertions': ' ',
    'x_coordinate': row[0],
    'y_coordinate': row[1],
    'z_coordinate': row[2],
    'occupancy': 0.00,
    'temperature_factor': 0.00,
    'segment_identifier': '  ',
    'element_symbol': 'N',
    'charge': ' '
    })

    #row[3], row[4], row[5] # CA
    atom_index = atom_index+1

    ca_string = get_atom_record({
    'name': 'ATOM',
    'serial_number': atom_index,
    'atom_name': 'CA',
    'alternate_location_indicator': ' ',
    'residue_name': tl_aminoacid,
    'chain_identifier': 'B',
    'residue_sequence_number': amino_index+1,
    'code_for_insertions': ' ',
    'x_coordinate': row[3],
    'y_coordinate': row[4],
    'z_coordinate': row[5],
    'occupancy': 0.00,
    'temperature_factor': 0.00,
    'segment_identifier': '  ',
    'element_symbol': 'C',
    'charge': ' '
    })
    #row[6], row[7], row[8] # C
    atom_index = atom_index+1

    c_string = get_atom_record({
    'name': 'ATOM',
    'serial_number': atom_index,
    'atom_name': 'C',
    'alternate_location_indicator': ' ',
    'residue_name': tl_aminoacid,
    'chain_identifier': 'B',
    'residue_sequence_number': amino_index+1,
    'code_for_insertions': ' ',
    'x_coordinate': row[6],
    'y_coordinate': row[7],
    'z_coordinate': row[8],
    'occupancy': 0.00,
    'temperature_factor': 0.00,
    'segment_identifier': '  ',
    'element_symbol': 'C',
    'charge': ' '
    })

    # row[9], row[10], row[11]# O
    atom_index = atom_index+1

    o_string = get_atom_record({
    'name': 'ATOM',
    'serial_number': atom_index,
    'atom_name': 'O',
    'alternate_location_indicator': ' ',
    'residue_name': tl_aminoacid,
    'chain_identifier': 'B',
    'residue_sequence_number': amino_index+1,
    'code_for_insertions': ' ',
    'x_coordinate': row[9],
    'y_coordinate': row[10],
    'z_coordinate': row[11],
    'occupancy': 0.00,
    'temperature_factor': 0.00,
    'segment_identifier': '  ',
    'element_symbol': 'O',
    'charge': ' '
    })

    # row[12], row[13], row[14] # CB
    atom_index = atom_index+1

    cb_string = get_atom_record({
    'name': 'ATOM',
    'serial_number': atom_index,
    'atom_name': 'CB',
    'alternate_location_indicator': ' ',
    'residue_name': tl_aminoacid,
    'chain_identifier': 'B',
    'residue_sequence_number': amino_index+1,
    'code_for_insertions': ' ',
    'x_coordinate': row[12],
    'y_coordinate': row[13],
    'z_coordinate': row[14],
    'occupancy': 0.00,
    'temperature_factor': 0.00,
    'segment_identifier': '  ',
    'element_symbol': 'C',
    'charge': ' '
    })
    atom_index = atom_index+1

    text = text + n_string + ca_string + c_string + o_string + cb_string

with open("./backbone.pdb", "w+") as outfile:
  outfile.write(text)

Reconstruct side-chains with DLPacker

In [10]:
dlp = DLPacker("/content/backbone.pdb")
start_dl_packer = time.time()
dlp.reconstruct_protein(order="sequence", output_filename="/content/structure_repacked.pdb")
print("Repacked ended in %.3f seconds" % (time.time() - start_dl_packer))


Writing output file...
Done!
Repacked ended in 405.479 seconds


# Evaluation
Superimpose 1MEL original with the one predicted by the pipeline.

In [11]:
!wget https://files.rcsb.org/download/1MEL.pdb

--2024-08-25 08:26:25--  https://files.rcsb.org/download/1MEL.pdb
Resolving files.rcsb.org (files.rcsb.org)... 128.6.159.157
Connecting to files.rcsb.org (files.rcsb.org)|128.6.159.157|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘1MEL.pdb’

1MEL.pdb                [ <=>                ] 352.48K  --.-KB/s    in 0.1s    

2024-08-25 08:26:25 (2.53 MB/s) - ‘1MEL.pdb’ saved [360936]



In [12]:
pdb_original_lines = []

with open("./1MEL.pdb", "r") as f:
  pdb_original_lines = f.readlines()

nanobody_atoms = []
for line in pdb_original_lines:
  # Chain identifier 22
  if line[0:4] == "ATOM":
    if line[21] == "A":
      nanobody_atoms.append(line)

with open("./1MEL_1_only_nb.pdb", "w+") as f:
  f.writelines(nanobody_atoms)

In [13]:
print("1MEL - Predicted with NanoNet and reconstructed with DLPakcer.\n\n")

view = py3Dmol.view(width=400, height=400)
structure_repacked_file = open("./structure_repacked.pdb", "r")
structure_repacked_content = structure_repacked_file.read()

view.addModel(structure_repacked_content, "pdb")

view.setStyle({"cartoon": {"color": "yellow"}})
view.setBackgroundColor("dark")
view.zoomTo()
view.show()

1MEL - Predicted with NanoNet and reconstructed with DLPakcer.




In [14]:
print("1MEL - Original.\n\n")

view = py3Dmol.view(width=400, height=400)
original_file = open("./1MEL_1_only_nb.pdb", "r")
original_content = original_file.read()

view.addModel(original_content, "pdb")

view.setStyle({"cartoon": {"color": "red"}})
view.setBackgroundColor("dark")
view.zoomTo()
view.show()

1MEL - Original.




In [15]:
print("Superimposition\n\n")

pred_nb_file = pdb.PDBFile.read("/content/structure_repacked.pdb")
pred_nb = pred_nb_file.get_structure()[0]
original_nb_file = pdb.PDBFile.read("/content/1MEL_1_only_nb.pdb")
original_nb = original_nb_file.get_structure()[0]
try:

  result_file = pdb.PDBFile()
  pred_nb_common = pred_nb[struc.filter_intersection(pred_nb, original_nb)]
  original_common = original_nb[struc.filter_intersection(original_nb, pred_nb)]
  # Superimpose
  nb_superimposed, transformation = struc.superimpose(
      original_common, pred_nb_common, (pred_nb_common.atom_name == "CA"))

  # Apply to the original structure
  result_superimpose = struc.superimpose_apply(pred_nb, transformation)
  result_file.set_structure(result_superimpose)
  result_file.set_structure(nb_superimposed)
  result_file.write("superimposed_pred.pdb")

  pdbdata = open('./superimposed_pred.pdb', 'r').read()
  view=py3Dmol.view()
  view.addModel(pdbdata,'pdb')
  view.setStyle({"chain": "A"}, {"cartoon": {'color': 'yellow'}})
  view.setStyle({"chain": "B"}, {"cartoon": {'color': 'green'}})
  view.zoomTo()
  view.setBackgroundColor('black')

except Exception:
  print("Failed to superimpose.\n\n")
  traceback.print_exc()

Superimposition


Failed to superimpose.




  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = um.true_divide(
Traceback (most recent call last):
  File "<ipython-input-15-120d5f8b1d96>", line 18, in <cell line: 7>
    result_file.set_structure(result_superimpose)
  File "/usr/local/lib/python3.10/dist-packages/biotite/structure/io/pdb/file.py", line 579, in set_structure
    _check_pdb_compatibility(array, hybrid36)
  File "/usr/local/lib/python3.10/dist-packages/biotite/structure/io/pdb/file.py", line 1193, in _check_pdb_compatibility
    raise BadStructureError("Coordinates contain 'NaN' values")
biotite.structure.BadStructureError: Coordinates contain 'NaN' values
